um005

مراجع:

کتاب مقدس

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:um005/bible')
  • توضیحات :
UMC005 English-Urdu is a parallel corpus of texts in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation.

The texts come from four different sources:
- Quran
- Bible
- Penn Treebank (Wall Street Journal)
- Emille corpus

The authors provide the religious texts of Quran and Bible for direct download. Because of licensing reasons, Penn and Emille texts cannot be redistributed freely. However, if you already hold a license for the original corpora, we are able to provide scripts that will recreate our data on your disk. Our modifications include but are not limited to the following:

- Correction of Urdu translations and manual sentence alignment of the Emille texts.
- Manually corrected sentence alignment of the other corpora.
- Our data split (training-development-test) so that our published experiments can be reproduced.
- Tokenization (optional, but needed to reproduce our experiments).
- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'test' 257
'train' 7400
'validation' 300
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ur",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

قرآن

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:um005/quran')
  • توضیحات :
UMC005 English-Urdu is a parallel corpus of texts in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation.

The texts come from four different sources:
- Quran
- Bible
- Penn Treebank (Wall Street Journal)
- Emille corpus

The authors provide the religious texts of Quran and Bible for direct download. Because of licensing reasons, Penn and Emille texts cannot be redistributed freely. However, if you already hold a license for the original corpora, we are able to provide scripts that will recreate our data on your disk. Our modifications include but are not limited to the following:

- Correction of Urdu translations and manual sentence alignment of the Emille texts.
- Manually corrected sentence alignment of the other corpora.
- Our data split (training-development-test) so that our published experiments can be reproduced.
- Tokenization (optional, but needed to reproduce our experiments).
- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'test' 200
'train' 6000
'validation' 214
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ur",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

همه

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:um005/all')
  • توضیحات :
UMC005 English-Urdu is a parallel corpus of texts in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation.

The texts come from four different sources:
- Quran
- Bible
- Penn Treebank (Wall Street Journal)
- Emille corpus

The authors provide the religious texts of Quran and Bible for direct download. Because of licensing reasons, Penn and Emille texts cannot be redistributed freely. However, if you already hold a license for the original corpora, we are able to provide scripts that will recreate our data on your disk. Our modifications include but are not limited to the following:

- Correction of Urdu translations and manual sentence alignment of the Emille texts.
- Manually corrected sentence alignment of the other corpora.
- Our data split (training-development-test) so that our published experiments can be reproduced.
- Tokenization (optional, but needed to reproduce our experiments).
- Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'test' 457
'train' 13400
'validation' 514
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ur",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}