wikipedia_toxicity_subtypes
Stay organized with collections
Save and categorize content based on your preferences.
The comments in this dataset come from an archive of Wikipedia talk page
comments. These have been annotated by Jigsaw for toxicity, as well as (for the
main config) a variety of toxicity subtypes, including severe toxicity,
obscenity, threatening language, insulting language, and identity attacks. This
dataset is a replica of the data released for the Jigsaw Toxic Comment
Classification Challenge and Jigsaw Multilingual Toxic Comment Classification
competition on Kaggle, with the test dataset merged with the test_labels
released after the end of the competitions. Test data not used for scoring has
been dropped. This dataset is released under CC0, as is the underlying comment
text.
Source code:
tfds.text.WikipediaToxicitySubtypes
Versions:
0.2.0
: Updated features for consistency with CivilComments dataset.
0.3.0
: Added WikipediaToxicityMultilingual config.
0.3.1
(default): Added a unique id for each comment. (For the
Multilingual config, these are only unique within each split.)
Download size: 50.57 MiB
Auto-cached
(documentation):
Yes
Supervised keys (See
as_supervised
doc):
('text', 'toxicity')
Figure
(tfds.show_examples):
Not supported.
Citation:
@inproceedings{10.1145/3038912.3052591,
author = {Wulczyn, Ellery and Thain, Nithum and Dixon, Lucas},
title = {Ex Machina: Personal Attacks Seen at Scale},
year = {2017},
isbn = {9781450349130},
publisher = {International World Wide Web Conferences Steering Committee},
address = {Republic and Canton of Geneva, CHE},
url = {https://doi.org/10.1145/3038912.3052591},
doi = {10.1145/3038912.3052591},
booktitle = {Proceedings of the 26th International Conference on World Wide Web},
pages = {1391-1399},
numpages = {9},
keywords = {online discussions, wikipedia, online harassment},
location = {Perth, Australia},
series = {WWW '17}
}
wikipedia_toxicity_subtypes/EnglishSubtypes (default config)
- Config description: The comments in the WikipediaToxicitySubtypes config
are from an archive of English Wikipedia talk page comments which have been
annotated by Jigsaw for toxicity, as well as five toxicity subtype labels
(severe toxicity, obscene, threat, insult, identity_attack). The toxicity
and toxicity subtype labels are binary values (0 or 1) indicating whether
the majority of annotators assigned that attribute to the comment text. This
config is a replica of the data released for the Jigsaw Toxic Comment
Classification Challenge on Kaggle, with the test dataset joined with the
test_labels released after the competition, and test data not used for
scoring dropped.
See the Kaggle documentation
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data or
https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973 for more
details.
Split |
Examples |
'test' |
63,978 |
'train' |
159,571 |
FeaturesDict({
'id': Text(shape=(), dtype=string),
'identity_attack': float32,
'insult': float32,
'language': Text(shape=(), dtype=string),
'obscene': float32,
'severe_toxicity': float32,
'text': Text(shape=(), dtype=string),
'threat': float32,
'toxicity': float32,
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
id |
Text |
|
string |
|
identity_attack |
Tensor |
|
float32 |
|
insult |
Tensor |
|
float32 |
|
language |
Text |
|
string |
|
obscene |
Tensor |
|
float32 |
|
severe_toxicity |
Tensor |
|
float32 |
|
text |
Text |
|
string |
|
threat |
Tensor |
|
float32 |
|
toxicity |
Tensor |
|
float32 |
|
wikipedia_toxicity_subtypes/Multilingual
- Config description: The comments in the WikipediaToxicityMultilingual
config here are from an archive of non-English Wikipedia talk page comments
annotated by Jigsaw for toxicity, with a binary value (0 or 1) indicating
whether the majority of annotators rated the comment text as toxic. The
comments in this config are in multiple different languages (Turkish,
Italian, Spanish, Portuguese, Russian, and French). This config is a replica
of the data released for the Jigsaw Multilingual Toxic Comment
Classification on Kaggle, with the test dataset joined with the test_labels
released after the competition.
See the Kaggle documentation
https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data
for more details.
Split |
Examples |
'test' |
63,812 |
'validation' |
8,000 |
FeaturesDict({
'id': Text(shape=(), dtype=string),
'language': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
'toxicity': float32,
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
id |
Text |
|
string |
|
language |
Text |
|
string |
|
text |
Text |
|
string |
|
toxicity |
Tensor |
|
float32 |
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-12-06 UTC."],[],[],null,["# wikipedia_toxicity_subtypes\n\n\u003cbr /\u003e\n\n- **Description**:\n\nThe comments in this dataset come from an archive of Wikipedia talk page\ncomments. These have been annotated by Jigsaw for toxicity, as well as (for the\nmain config) a variety of toxicity subtypes, including severe toxicity,\nobscenity, threatening language, insulting language, and identity attacks. This\ndataset is a replica of the data released for the Jigsaw Toxic Comment\nClassification Challenge and Jigsaw Multilingual Toxic Comment Classification\ncompetition on Kaggle, with the test dataset merged with the test_labels\nreleased after the end of the competitions. Test data not used for scoring has\nbeen dropped. This dataset is released under CC0, as is the underlying comment\ntext.\n\n- **Source code** :\n [`tfds.text.WikipediaToxicitySubtypes`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/text/wikipedia_toxicity_subtypes.py)\n\n- **Versions**:\n\n - `0.2.0`: Updated features for consistency with CivilComments dataset.\n - `0.3.0`: Added WikipediaToxicityMultilingual config.\n - **`0.3.1`** (default): Added a unique id for each comment. (For the Multilingual config, these are only unique within each split.)\n- **Download size** : `50.57 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('text', 'toxicity')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @inproceedings{10.1145/3038912.3052591,\n author = {Wulczyn, Ellery and Thain, Nithum and Dixon, Lucas},\n title = {Ex Machina: Personal Attacks Seen at Scale},\n year = {2017},\n isbn = {9781450349130},\n publisher = {International World Wide Web Conferences Steering Committee},\n address = {Republic and Canton of Geneva, CHE},\n url = {https://doi.org/10.1145/3038912.3052591},\n doi = {10.1145/3038912.3052591},\n booktitle = {Proceedings of the 26th International Conference on World Wide Web},\n pages = {1391-1399},\n numpages = {9},\n keywords = {online discussions, wikipedia, online harassment},\n location = {Perth, Australia},\n series = {WWW '17}\n }\n\nwikipedia_toxicity_subtypes/EnglishSubtypes (default config)\n------------------------------------------------------------\n\n- **Config description**: The comments in the WikipediaToxicitySubtypes config are from an archive of English Wikipedia talk page comments which have been annotated by Jigsaw for toxicity, as well as five toxicity subtype labels (severe toxicity, obscene, threat, insult, identity_attack). The toxicity and toxicity subtype labels are binary values (0 or 1) indicating whether the majority of annotators assigned that attribute to the comment text. This config is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge on Kaggle, with the test dataset joined with the test_labels released after the competition, and test data not used for scoring dropped.\n\nSee the Kaggle documentation\n\u003chttps://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data\u003e or\n\u003chttps://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973\u003e for more\ndetails.\n\n- **Homepage** :\n \u003chttps://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data\u003e\n\n- **Dataset size** : `128.32 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'test'` | 63,978 |\n| `'train'` | 159,571 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'id': Text(shape=(), dtype=string),\n 'identity_attack': float32,\n 'insult': float32,\n 'language': Text(shape=(), dtype=string),\n 'obscene': float32,\n 'severe_toxicity': float32,\n 'text': Text(shape=(), dtype=string),\n 'threat': float32,\n 'toxicity': float32,\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|-----------------|--------------|-------|---------|-------------|\n| | FeaturesDict | | | |\n| id | Text | | string | |\n| identity_attack | Tensor | | float32 | |\n| insult | Tensor | | float32 | |\n| language | Text | | string | |\n| obscene | Tensor | | float32 | |\n| severe_toxicity | Tensor | | float32 | |\n| text | Text | | string | |\n| threat | Tensor | | float32 | |\n| toxicity | Tensor | | float32 | |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nwikipedia_toxicity_subtypes/Multilingual\n----------------------------------------\n\n- **Config description**: The comments in the WikipediaToxicityMultilingual config here are from an archive of non-English Wikipedia talk page comments annotated by Jigsaw for toxicity, with a binary value (0 or 1) indicating whether the majority of annotators rated the comment text as toxic. The comments in this config are in multiple different languages (Turkish, Italian, Spanish, Portuguese, Russian, and French). This config is a replica of the data released for the Jigsaw Multilingual Toxic Comment Classification on Kaggle, with the test dataset joined with the test_labels released after the competition.\n\nSee the Kaggle documentation\n\u003chttps://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data\u003e\nfor more details.\n\n- **Homepage** :\n \u003chttps://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data\u003e\n\n- **Dataset size** : `35.13 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 63,812 |\n| `'validation'` | 8,000 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'id': Text(shape=(), dtype=string),\n 'language': Text(shape=(), dtype=string),\n 'text': Text(shape=(), dtype=string),\n 'toxicity': float32,\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|----------|--------------|-------|---------|-------------|\n| | FeaturesDict | | | |\n| id | Text | | string | |\n| language | Text | | string | |\n| text | Text | | string | |\n| toxicity | Tensor | | float32 | |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]