ag_news_subset
Stay organized with collections
Save and categorize content based on your preferences.
AG is a collection of more than 1 million news articles. News articles have been
gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of
activity. ComeToMyHead is an academic news search engine which has been running
since July, 2004. The dataset is provided by the academic comunity for research
purposes in data mining (clustering, classification, etc), information retrieval
(ranking, search, etc), xml, data compression, data streaming, and any other
non-commercial activity. For more information, please refer to the link
http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The AG's news topic classification dataset is constructed by Xiang Zhang
(xiang.zhang@nyu.edu) from the dataset above. It is used as a text
classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann
LeCun. Character-level Convolutional Networks for Text Classification. Advances
in Neural Information Processing Systems 28 (NIPS 2015).
The AG's news topic classification dataset is constructed by choosing 4 largest
classes from the original corpus. Each class contains 30,000 training samples
and 1,900 testing samples. The total number of training samples is 120,000 and
testing 7,600.
Split |
Examples |
'test' |
7,600 |
'train' |
120,000 |
FeaturesDict({
'description': Text(shape=(), dtype=string),
'label': ClassLabel(shape=(), dtype=int64, num_classes=4),
'title': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
description |
Text |
|
string |
|
label |
ClassLabel |
|
int64 |
|
title |
Text |
|
string |
|
@misc{zhang2015characterlevel,
title={Character-level Convolutional Networks for Text Classification},
author={Xiang Zhang and Junbo Zhao and Yann LeCun},
year={2015},
eprint={1509.01626},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-12-06 UTC."],[],[],null,["# ag_news_subset\n\n\u003cbr /\u003e\n\n- **Description**:\n\nAG is a collection of more than 1 million news articles. News articles have been\ngathered from more than 2000 news sources by ComeToMyHead in more than 1 year of\nactivity. ComeToMyHead is an academic news search engine which has been running\nsince July, 2004. The dataset is provided by the academic comunity for research\npurposes in data mining (clustering, classification, etc), information retrieval\n(ranking, search, etc), xml, data compression, data streaming, and any other\nnon-commercial activity. For more information, please refer to the link\n[http://www.di.unipi.it/\\~gulli/AG_corpus_of_news_articles.html](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) .\n\nThe AG's news topic classification dataset is constructed by Xiang Zhang\n(xiang.zhang@nyu.edu) from the dataset above. It is used as a text\nclassification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann\nLeCun. Character-level Convolutional Networks for Text Classification. Advances\nin Neural Information Processing Systems 28 (NIPS 2015).\n\nThe AG's news topic classification dataset is constructed by choosing 4 largest\nclasses from the original corpus. Each class contains 30,000 training samples\nand 1,900 testing samples. The total number of training samples is 120,000 and\ntesting 7,600.\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/ag-news)\n\n- **Homepage** :\n \u003chttps://arxiv.org/abs/1509.01626\u003e\n\n- **Source code** :\n [`tfds.datasets.ag_news_subset.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/ag_news_subset/ag_news_subset_dataset_builder.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): No release notes.\n- **Download size** : `11.24 MiB`\n\n- **Dataset size** : `35.79 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'test'` | 7,600 |\n| `'train'` | 120,000 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'description': Text(shape=(), dtype=string),\n 'label': ClassLabel(shape=(), dtype=int64, num_classes=4),\n 'title': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|-------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| description | Text | | string | |\n| label | ClassLabel | | int64 | |\n| title | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('description', 'label')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @misc{zhang2015characterlevel,\n title={Character-level Convolutional Networks for Text Classification},\n author={Xiang Zhang and Junbo Zhao and Yann LeCun},\n year={2015},\n eprint={1509.01626},\n archivePrefix={arXiv},\n primaryClass={cs.LG}\n }"]]