- Description:
OPUS is a collection of translated texts from the web.
Create your own config to choose which data / language pair to load.
config = tfds.translate.opus.OpusConfig(
    version=tfds.core.Version('0.1.0'),
    language_pair=("de", "en"),
    subsets=["GNOME", "EMEA"]
)
builder = tfds.builder("opus", config=config)
- Additional Documentation: Explore on Papers With Code 
- Homepage: http://opus.nlpl.eu/ 
- Source code: - tfds.datasets.opus.Builder
- Versions: - 0.1.0(default): No release notes.
 
- Feature structure: 
Translation({
    'de': Text(shape=(), dtype=string),
    'en': Text(shape=(), dtype=string),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| Translation | ||||
| de | Text | string | ||
| en | Text | string | 
- Supervised keys (See - as_superviseddoc):- ('de', 'en')
- Figure (tfds.show_examples): Not supported. 
- Citation: 
@inproceedings{Tiedemann2012ParallelData,
  author = {Tiedemann, J},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {LREC}
  year = {2012} }
opus/medical (default config)
- Config description: medical documents 
- Download size: - 34.29 MiB
- Dataset size: - 188.85 MiB
- Auto-cached (documentation): Only when - shuffle_files=False(train)
- Splits: 
| Split | Examples | 
|---|---|
| 'train' | 1,108,752 | 
- Examples (tfds.as_dataframe):
opus/law
- Config description: law documents 
- Download size: - 46.99 MiB
- Dataset size: - 214.44 MiB
- Auto-cached (documentation): Only when - shuffle_files=False(train)
- Splits: 
| Split | Examples | 
|---|---|
| 'train' | 719,372 | 
- Examples (tfds.as_dataframe):
opus/koran
- Config description: koran documents 
- Download size: - 35.42 MiB
- Dataset size: - 117.54 MiB
- Auto-cached (documentation): Yes 
- Splits: 
| Split | Examples | 
|---|---|
| 'train' | 537,128 | 
- Examples (tfds.as_dataframe):
opus/IT
- Config description: IT documents 
- Download size: - 10.33 MiB
- Dataset size: - 42.51 MiB
- Auto-cached (documentation): Yes 
- Splits: 
| Split | Examples | 
|---|---|
| 'train' | 347,817 | 
- Examples (tfds.as_dataframe):
opus/subtitles
- Config description: subtitles documents 
- Download size: - 677.64 MiB
- Dataset size: - 2.01 GiB
- Auto-cached (documentation): No 
- Splits: 
| Split | Examples | 
|---|---|
| 'train' | 22,512,639 | 
- Examples (tfds.as_dataframe):