参考文献:
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:turkish_ner')
- 説明:
Turkish Wikipedia Named-Entity Recognition and Text Categorization
(TWNERTC) dataset is a collection of automatically categorized and annotated
sentences obtained from Wikipedia. The authors constructed large-scale
gazetteers by using a graph crawler algorithm to extract
relevant entity and domain information
from a semantic knowledge base, Freebase.
The constructed gazetteers contains approximately
300K entities with thousands of fine-grained entity types
under 77 different domains.
- ライセンス: 既知のライセンスはありません
- バージョン: 0.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 532629 |
- 特徴:
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"tokens": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"domain": {
"num_classes": 25,
"names": [
"architecture",
"basketball",
"book",
"business",
"education",
"fictional_universe",
"film",
"food",
"geography",
"government",
"law",
"location",
"military",
"music",
"opera",
"organization",
"people",
"religion",
"royalty",
"soccer",
"sports",
"theater",
"time",
"travel",
"tv"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"ner_tags": {
"feature": {
"num_classes": 9,
"names": [
"O",
"B-PERSON",
"I-PERSON",
"B-ORGANIZATION",
"I-ORGANIZATION",
"B-LOCATION",
"I-LOCATION",
"B-MISC",
"I-MISC"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}