wiki_dialog
Stay organized with collections
Save and categorize content based on your preferences.
WikiDialog is a large dataset of synthetically generated information-seeking
conversations. Each conversation in the dataset contains two speakers grounded
in a passage from English Wikipedia: one speaker’s utterances consist of exact
sentences from the passage; the other speaker is generated by a large language
model.
Split |
Examples |
'train' |
11,264,129 |
'validation' |
113,822 |
FeaturesDict({
'author_num': Sequence(int32),
'passage': Text(shape=(), dtype=string),
'pid': Text(shape=(), dtype=string),
'sentences': Sequence(Text(shape=(), dtype=string)),
'title': Text(shape=(), dtype=string),
'utterances': Sequence(Text(shape=(), dtype=string)),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
author_num |
Sequence(Tensor) |
(None,) |
int32 |
|
passage |
Text |
|
string |
|
pid |
Text |
|
string |
|
sentences |
Sequence(Text) |
(None,) |
string |
|
title |
Text |
|
string |
|
utterances |
Sequence(Text) |
(None,) |
string |
|
@inproceedings{dai2022dialoginpainting,
title={Dialog Inpainting: Turning Documents to Dialogs},
author={Dai, Zhuyun and Chaganty, Arun Tejasvi and Zhao, Vincent and Amini, Aida and Green, Mike and Rashid, Qazi and Guu, Kelvin},
booktitle={International Conference on Machine Learning (ICML)},
year={2022},
organization={PMLR}
}
wiki_dialog/OQ (default config)
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-12-06 UTC."],[],[],null,["# wiki_dialog\n\n\u003cbr /\u003e\n\n- **Description**:\n\nWikiDialog is a large dataset of synthetically generated information-seeking\nconversations. Each conversation in the dataset contains two speakers grounded\nin a passage from English Wikipedia: one speaker's utterances consist of exact\nsentences from the passage; the other speaker is generated by a large language\nmodel.\n\n- **Config description** : WikiDialog generated from the dialog inpainter\n finetuned on OR-QuAC and QReCC. `OQ` stands for OR-QuAC and QReCC.\n\n- **Homepage** :\n \u003chttps://github.com/google-research/dialog-inpainting#wikidialog-oq\u003e\n\n- **Source code** :\n [`tfds.text.wiki_dialog.WikiDialog`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/text/wiki_dialog/wiki_dialog.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): Initial release.\n- **Download size** : `7.04 GiB`\n\n- **Dataset size** : `36.58 GiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|------------|\n| `'train'` | 11,264,129 |\n| `'validation'` | 113,822 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'author_num': Sequence(int32),\n 'passage': Text(shape=(), dtype=string),\n 'pid': Text(shape=(), dtype=string),\n 'sentences': Sequence(Text(shape=(), dtype=string)),\n 'title': Text(shape=(), dtype=string),\n 'utterances': Sequence(Text(shape=(), dtype=string)),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|------------|------------------|---------|--------|-------------|\n| | FeaturesDict | | | |\n| author_num | Sequence(Tensor) | (None,) | int32 | |\n| passage | Text | | string | |\n| pid | Text | | string | |\n| sentences | Sequence(Text) | (None,) | string | |\n| title | Text | | string | |\n| utterances | Sequence(Text) | (None,) | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @inproceedings{dai2022dialoginpainting,\n title={Dialog Inpainting: Turning Documents to Dialogs},\n author={Dai, Zhuyun and Chaganty, Arun Tejasvi and Zhao, Vincent and Amini, Aida and Green, Mike and Rashid, Qazi and Guu, Kelvin},\n booktitle={International Conference on Machine Learning (ICML)},\n year={2022},\n organization={PMLR}\n }\n\nwiki_dialog/OQ (default config)\n-------------------------------"]]