opus
Stay organized with collections
Save and categorize content based on your preferences.
OPUS is a collection of translated texts from the web.
Create your own config to choose which data / language pair to load.
config = tfds.translate.opus.OpusConfig(
version=tfds.core.Version('0.1.0'),
language_pair=("de", "en"),
subsets=["GNOME", "EMEA"]
)
builder = tfds.builder("opus", config=config)
Translation({
'de': Text(shape=(), dtype=string),
'en': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
Translation |
|
|
|
de |
Text |
|
string |
|
en |
Text |
|
string |
|
@inproceedings{Tiedemann2012ParallelData,
author = {Tiedemann, J},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {LREC}
year = {2012} }
opus/medical (default config)
Split |
Examples |
'train' |
1,108,752 |
opus/law
Split |
Examples |
'train' |
719,372 |
opus/koran
Split |
Examples |
'train' |
537,128 |
opus/IT
Split |
Examples |
'train' |
347,817 |
opus/subtitles
Config description: subtitles documents
Download size: 677.64 MiB
Dataset size: 2.01 GiB
Auto-cached
(documentation):
No
Splits:
Split |
Examples |
'train' |
22,512,639 |
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-15 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-12-15 UTC."],[],[],null,["# opus\n\n\u003cbr /\u003e\n\n- **Description**:\n\nOPUS is a collection of translated texts from the web.\n\nCreate your own config to choose which data / language pair to load. \n\n config = tfds.translate.opus.OpusConfig(\n version=tfds.core.Version('0.1.0'),\n language_pair=(\"de\", \"en\"),\n subsets=[\"GNOME\", \"EMEA\"]\n )\n builder = tfds.builder(\"opus\", config=config)\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/opus-100)\n\n- **Homepage** : \u003chttp://opus.nlpl.eu/\u003e\n\n- **Source code** :\n [`tfds.datasets.opus.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/opus/opus_dataset_builder.py)\n\n- **Versions**:\n\n - **`0.1.0`** (default): No release notes.\n- **Feature structure**:\n\n Translation({\n 'de': Text(shape=(), dtype=string),\n 'en': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|---------|-------------|-------|--------|-------------|\n| | Translation | | | |\n| de | Text | | string | |\n| en | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('de', 'en')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @inproceedings{Tiedemann2012ParallelData,\n author = {Tiedemann, J},\n title = {Parallel Data, Tools and Interfaces in OPUS},\n booktitle = {LREC}\n year = {2012} }\n\nopus/medical (default config)\n-----------------------------\n\n- **Config description**: medical documents\n\n- **Download size** : `34.29 MiB`\n\n- **Dataset size** : `188.85 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Only when `shuffle_files=False` (train)\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|-----------|\n| `'train'` | 1,108,752 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nopus/law\n--------\n\n- **Config description**: law documents\n\n- **Download size** : `46.99 MiB`\n\n- **Dataset size** : `214.44 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Only when `shuffle_files=False` (train)\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'train'` | 719,372 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nopus/koran\n----------\n\n- **Config description**: koran documents\n\n- **Download size** : `35.42 MiB`\n\n- **Dataset size** : `117.54 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'train'` | 537,128 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nopus/IT\n-------\n\n- **Config description**: IT documents\n\n- **Download size** : `10.33 MiB`\n\n- **Dataset size** : `42.51 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'train'` | 347,817 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nopus/subtitles\n--------------\n\n- **Config description**: subtitles documents\n\n- **Download size** : `677.64 MiB`\n\n- **Dataset size** : `2.01 GiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|------------|\n| `'train'` | 22,512,639 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]