user_libri_text
Stay organized with collections
Save and categorize content based on your preferences.
UserLibri is a dataset containing paired audio-transcripts and additional text
only data for each of 107 users. It is a reformatting of the LibriSpeech dataset
found at http://www.openslr.org/12, reorganizing the data into users with an
average of 52 LibriSpeech utterances and about 6,700 text example sentences per
user. The UserLibriAudio class provides access to the audio-transcript pairs.
See UserLibriText for the additional text data.
Split |
Examples |
'10136' |
38,496 |
'1041' |
970 |
'10540' |
3,283 |
'108' |
5,864 |
'11' |
1,348 |
'11667' |
3,312 |
'1184' |
22,062 |
'12176' |
1,467 |
'12434' |
2,796 |
'12544' |
4,080 |
'13110' |
2,634 |
'13158' |
3,440 |
'13441' |
4,145 |
'135' |
37,263 |
'1353' |
4,889 |
'1399' |
18,914 |
'14420' |
6,950 |
'14566' |
3,810 |
'1477' |
2,526 |
'14958' |
1,495 |
'15263' |
21,085 |
'15265' |
7,647 |
'1549' |
5,439 |
'1572' |
2,882 |
'1597' |
3,586 |
'1608' |
3,605 |
'16127' |
3,588 |
'16653' |
7,600 |
'18096' |
2,384 |
'1827' |
4,806 |
'19019' |
3,248 |
'19215' |
13,542 |
'19717' |
3,762 |
'1989' |
1,105 |
'1998' |
8,923 |
'20019' |
966 |
'2002' |
239 |
'20212' |
3,363 |
'209' |
2,090 |
'21297' |
4,165 |
'22002' |
4,044 |
'2300' |
22,201 |
'24' |
3,537 |
'24585' |
1,789 |
'24811' |
2,399 |
'2488' |
8,239 |
'2529' |
3,934 |
'26177' |
3,598 |
'26379' |
379 |
'2681' |
8,872 |
'27067' |
3,149 |
'27090' |
3,217 |
'2770' |
3,750 |
'2787' |
4,603 |
'28700' |
5,547 |
'28725' |
3,899 |
'28952' |
2,909 |
'2981' |
54,305 |
'3076' |
7,124 |
'30905' |
2,140 |
'3178' |
8,454 |
'33' |
3,569 |
'33800' |
5,145 |
'3436' |
5,899 |
'3440' |
5,087 |
'3441' |
6,042 |
'36508' |
521 |
'3748' |
4,767 |
'38675' |
2,696 |
'38804' |
5,653 |
'39159' |
2,729 |
'4028' |
9,633 |
'40359' |
7,821 |
'41326' |
6,181 |
'4217' |
6,003 |
'4276' |
10,461 |
'434' |
4,319 |
'4602' |
4,421 |
'507' |
9,093 |
'540' |
5,452 |
'5516' |
4,963 |
'5630' |
1,130 |
'574' |
452 |
'5921' |
6,040 |
'6328' |
5,926 |
'6812' |
5,839 |
'732' |
22,971 |
'76' |
6,454 |
'7891' |
1,476 |
'8166' |
3,190 |
'820' |
11,054 |
'833' |
3,638 |
'9189' |
8,387 |
'94' |
1,722 |
'940' |
6,172 |
'9464' |
1,695 |
'955' |
3,051 |
'969' |
7,799 |
'9983' |
8,898 |
FeaturesDict({
'book_id': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
book_id
|
Text
|
|
string
|
The book that this text was pulled
from |
text
|
Text
|
|
string
|
A sentence of text extracted from
a book |
@inproceedings{breiner2022userlibri,
title={UserLibri: A Dataset for ASR Personalization Using Only Text},
author={Breiner, Theresa and Ramaswamy, Swaroop and Variani, Ehsan and Garg, Shefali and Mathews, Rajiv and Sim, Khe Chai and Gupta, Kilol and Chen, Mingqing and McConnaughey, Lara},
booktitle={Proc. Interspeech 2022},
year={2022}
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-12-06 UTC."],[],[],null,["# user_libri_text\n\n\u003cbr /\u003e\n\n- **Description**:\n\nUserLibri is a dataset containing paired audio-transcripts and additional text\nonly data for each of 107 users. It is a reformatting of the LibriSpeech dataset\nfound at \u003chttp://www.openslr.org/12,\u003e reorganizing the data into users with an\naverage of 52 LibriSpeech utterances and about 6,700 text example sentences per\nuser. The UserLibriAudio class provides access to the audio-transcript pairs.\nSee UserLibriText for the additional text data.\n\n- **Homepage** :\n \u003chttps://www.kaggle.com/datasets/google/userlibri\u003e\n\n- **Source code** :\n [`tfds.text.userlibri_lm_data.UserLibriText`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/text/userlibri_lm_data/userlibri_lm_data.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): No release notes.\n- **Download size** : `Unknown size`\n\n- **Dataset size** : `86.86 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'10136'` | 38,496 |\n| `'1041'` | 970 |\n| `'10540'` | 3,283 |\n| `'108'` | 5,864 |\n| `'11'` | 1,348 |\n| `'11667'` | 3,312 |\n| `'1184'` | 22,062 |\n| `'12176'` | 1,467 |\n| `'12434'` | 2,796 |\n| `'12544'` | 4,080 |\n| `'13110'` | 2,634 |\n| `'13158'` | 3,440 |\n| `'13441'` | 4,145 |\n| `'135'` | 37,263 |\n| `'1353'` | 4,889 |\n| `'1399'` | 18,914 |\n| `'14420'` | 6,950 |\n| `'14566'` | 3,810 |\n| `'1477'` | 2,526 |\n| `'14958'` | 1,495 |\n| `'15263'` | 21,085 |\n| `'15265'` | 7,647 |\n| `'1549'` | 5,439 |\n| `'1572'` | 2,882 |\n| `'1597'` | 3,586 |\n| `'1608'` | 3,605 |\n| `'16127'` | 3,588 |\n| `'16653'` | 7,600 |\n| `'18096'` | 2,384 |\n| `'1827'` | 4,806 |\n| `'19019'` | 3,248 |\n| `'19215'` | 13,542 |\n| `'19717'` | 3,762 |\n| `'1989'` | 1,105 |\n| `'1998'` | 8,923 |\n| `'20019'` | 966 |\n| `'2002'` | 239 |\n| `'20212'` | 3,363 |\n| `'209'` | 2,090 |\n| `'21297'` | 4,165 |\n| `'22002'` | 4,044 |\n| `'2300'` | 22,201 |\n| `'24'` | 3,537 |\n| `'24585'` | 1,789 |\n| `'24811'` | 2,399 |\n| `'2488'` | 8,239 |\n| `'2529'` | 3,934 |\n| `'26177'` | 3,598 |\n| `'26379'` | 379 |\n| `'2681'` | 8,872 |\n| `'27067'` | 3,149 |\n| `'27090'` | 3,217 |\n| `'2770'` | 3,750 |\n| `'2787'` | 4,603 |\n| `'28700'` | 5,547 |\n| `'28725'` | 3,899 |\n| `'28952'` | 2,909 |\n| `'2981'` | 54,305 |\n| `'3076'` | 7,124 |\n| `'30905'` | 2,140 |\n| `'3178'` | 8,454 |\n| `'33'` | 3,569 |\n| `'33800'` | 5,145 |\n| `'3436'` | 5,899 |\n| `'3440'` | 5,087 |\n| `'3441'` | 6,042 |\n| `'36508'` | 521 |\n| `'3748'` | 4,767 |\n| `'38675'` | 2,696 |\n| `'38804'` | 5,653 |\n| `'39159'` | 2,729 |\n| `'4028'` | 9,633 |\n| `'40359'` | 7,821 |\n| `'41326'` | 6,181 |\n| `'4217'` | 6,003 |\n| `'4276'` | 10,461 |\n| `'434'` | 4,319 |\n| `'4602'` | 4,421 |\n| `'507'` | 9,093 |\n| `'540'` | 5,452 |\n| `'5516'` | 4,963 |\n| `'5630'` | 1,130 |\n| `'574'` | 452 |\n| `'5921'` | 6,040 |\n| `'6328'` | 5,926 |\n| `'6812'` | 5,839 |\n| `'732'` | 22,971 |\n| `'76'` | 6,454 |\n| `'7891'` | 1,476 |\n| `'8166'` | 3,190 |\n| `'820'` | 11,054 |\n| `'833'` | 3,638 |\n| `'9189'` | 8,387 |\n| `'94'` | 1,722 |\n| `'940'` | 6,172 |\n| `'9464'` | 1,695 |\n| `'955'` | 3,051 |\n| `'969'` | 7,799 |\n| `'9983'` | 8,898 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'book_id': Text(shape=(), dtype=string),\n 'text': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|---------|--------------|-------|--------|------------------------------------------|\n| | FeaturesDict | | | |\n| book_id | Text | | string | The book that this text was pulled from |\n| text | Text | | string | A sentence of text extracted from a book |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('text', 'text')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @inproceedings{breiner2022userlibri,\n title={UserLibri: A Dataset for ASR Personalization Using Only Text},\n author={Breiner, Theresa and Ramaswamy, Swaroop and Variani, Ehsan and Garg, Shefali and Mathews, Rajiv and Sim, Khe Chai and Gupta, Kilol and Chen, Mingqing and McConnaughey, Lara},\n booktitle={Proc. Interspeech 2022},\n year={2022}\n }"]]