- Description:
Identifying parallel sentences in comparable corpora. Given two sentence-split monolingual corpora, participant systems are expected to identify pairs of sentences that are translations of each other.
The BUCC mining task is a shared task on parallel sentence extraction from two monolingual corpora with a subset of them assumed to be parallel, and that has been available since 2016. For each language pair, the shared task provides a monolingual corpus for each language and a gold mapping list containing true translation pairs. These pairs are the ground truth. The task is to construct a list of translation pairs from the monolingual corpora. The constructed list is compared to the ground truth, and evaluated in terms of the F1 measure.
Homepage: https://comparable.limsi.fr/bucc2018/
Source code:
tfds.datasets.bucc.BuilderVersions:
1.0.0(default): Initial release.
Auto-cached (documentation): Yes
Feature structure:
FeaturesDict({
'source_id': Text(shape=(), dtype=string),
'source_sentence': Text(shape=(), dtype=string),
'target_id': Text(shape=(), dtype=string),
'target_sentence': Text(shape=(), dtype=string),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description |
|---|---|---|---|---|
| FeaturesDict | ||||
| source_id | Text | string | ||
| source_sentence | Text | string | ||
| target_id | Text | string | ||
| target_sentence | Text | string |
Supervised keys (See
as_superviseddoc):NoneFigure (tfds.show_examples): Not supported.
Citation:
@inproceedings{zweigenbaum2018overview,
title={Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora},
author={Zweigenbaum, Pierre and Sharoff, Serge and Rapp, Reinhard},
booktitle={Proceedings of 11th Workshop on Building and Using Comparable Corpora},
pages={39--42},
year={2018}
}
bucc/bucc_de (default config)
Download size:
29.30 MiBDataset size:
3.21 MiBSplits:
| Split | Examples |
|---|---|
'test' |
9,580 |
'validation' |
1,038 |
- Examples (tfds.as_dataframe):
bucc/bucc_fr
Download size:
21.65 MiBDataset size:
2.90 MiBSplits:
| Split | Examples |
|---|---|
'test' |
9,086 |
'validation' |
929 |
- Examples (tfds.as_dataframe):
bucc/bucc_zh
Download size:
6.79 MiBDataset size:
615.20 KiBSplits:
| Split | Examples |
|---|---|
'test' |
1,899 |
'validation' |
257 |
- Examples (tfds.as_dataframe):
bucc/bucc_ru
Download size:
39.44 MiBDataset size:
6.36 MiBSplits:
| Split | Examples |
|---|---|
'test' |
14,435 |
'validation' |
2,374 |
- Examples (tfds.as_dataframe):