- Description:
Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York. This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
For further details, see the accompanying paper: PAWS: Paraphrase Adversaries from Word Scrambling at https://arxiv.org/abs/1904.01130
This corpus contains pairs generated from Wikipedia pages, containing pairs that are generated from both word swapping and back translation methods. All pairs have human judgements on both paraphrasing and fluency and they are split into Train/Dev/Test sections.
All files are in the tsv format with four columns:
id
: A unique id for each pair.sentence1
: The first sentence.sentence2
: The second sentence.(noisy_)label
: (Noisy) label for each pair.
Each label has two possible values: 0 indicates the pair has different meaning, while 1 indicates the pair is a paraphrase.
Additional Documentation: Explore on Papers With Code
Source code:
tfds.datasets.paws_wiki.Builder
Versions:
1.0.0
: Initial version.1.1.0
(default): Adds configs to different subset and support raw text.
Download size:
57.47 MiB
Feature structure:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
'sentence1': Text(shape=(), dtype=string),
'sentence2': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
label | ClassLabel | int64 | ||
sentence1 | Text | string | ||
sentence2 | Text | string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
@InProceedings{paws2019naacl,
title = { {PAWS: Paraphrase Adversaries from Word Scrambling} },
author = {Zhang, Yuan and Baldridge, Jason and He, Luheng},
booktitle = {Proc. of NAACL},
year = {2019}
}
paws_wiki/labeled_final_tokenized (default config)
Config description: Subset: labeled_final tokenized: True
Dataset size:
17.96 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
8,000 |
'train' |
49,401 |
'validation' |
8,000 |
- Examples (tfds.as_dataframe):
paws_wiki/labeled_final_raw
Config description: Subset: labeled_final tokenized: False
Dataset size:
17.57 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
8,000 |
'train' |
49,401 |
'validation' |
8,000 |
- Examples (tfds.as_dataframe):
paws_wiki/labeled_swap_tokenized
Config description: Subset: labeled_swap tokenized: True
Dataset size:
8.79 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'train' |
30,397 |
- Examples (tfds.as_dataframe):
paws_wiki/labeled_swap_raw
Config description: Subset: labeled_swap tokenized: False
Dataset size:
8.60 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'train' |
30,397 |
- Examples (tfds.as_dataframe):
paws_wiki/unlabeled_final_tokenized
Config description: Subset: unlabeled_final tokenized: True
Dataset size:
177.89 MiB
Auto-cached (documentation): Yes (validation), Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'train' |
645,652 |
'validation' |
10,000 |
- Examples (tfds.as_dataframe):