- Description:
ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.
Source code:
tfds.datasets.protein_net.BuilderVersions:
1.0.0(default): Initial release.
Auto-cached (documentation): No
Feature structure:
FeaturesDict({
'evolutionary': Tensor(shape=(None, 21), dtype=float32),
'id': Text(shape=(), dtype=string),
'length': int32,
'mask': Tensor(shape=(None,), dtype=bool),
'primary': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=20)),
'tertiary': Tensor(shape=(None, 3), dtype=float32),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description |
|---|---|---|---|---|
| FeaturesDict | ||||
| evolutionary | Tensor | (None, 21) | float32 | |
| id | Text | string | ||
| length | Tensor | int32 | ||
| mask | Tensor | (None,) | bool | |
| primary | Sequence(ClassLabel) | (None,) | int64 | |
| tertiary | Tensor | (None, 3) | float32 |
Supervised keys (See
as_superviseddoc):('primary', 'tertiary')Figure (tfds.show_examples): Not supported.
Citation:
@article{ProteinNet19,
title = { {ProteinNet}: a standardized data set for machine learning of protein structure},
author = {AlQuraishi, Mohammed},
journal = {BMC bioinformatics},
volume = {20},
number = {1},
pages = {1--10},
year = {2019},
publisher = {BioMed Central}
}
protein_net/casp7 (default config)
Download size:
3.18 GiBDataset size:
2.53 GiBSplits:
| Split | Examples |
|---|---|
'test' |
93 |
'train_100' |
34,557 |
'train_30' |
10,333 |
'train_50' |
13,024 |
'train_70' |
15,207 |
'train_90' |
17,611 |
'train_95' |
17,938 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp8
Download size:
4.96 GiBDataset size:
3.55 GiBSplits:
| Split | Examples |
|---|---|
'test' |
120 |
'train_100' |
48,087 |
'train_30' |
13,881 |
'train_50' |
17,970 |
'train_70' |
21,191 |
'train_90' |
24,556 |
'train_95' |
25,035 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp9
Download size:
6.65 GiBDataset size:
4.54 GiBSplits:
| Split | Examples |
|---|---|
'test' |
116 |
'train_100' |
60,350 |
'train_30' |
16,973 |
'train_50' |
22,172 |
'train_70' |
26,263 |
'train_90' |
30,513 |
'train_95' |
31,128 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp10
Download size:
8.65 GiBDataset size:
5.57 GiBSplits:
| Split | Examples |
|---|---|
'test' |
95 |
'train_100' |
73,116 |
'train_30' |
19,495 |
'train_50' |
25,897 |
'train_70' |
31,001 |
'train_90' |
36,258 |
'train_95' |
37,033 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp11
Download size:
10.81 GiBDataset size:
6.72 GiBSplits:
| Split | Examples |
|---|---|
'test' |
81 |
'train_100' |
87,573 |
'train_30' |
22,344 |
'train_50' |
29,936 |
'train_70' |
36,005 |
'train_90' |
42,507 |
'train_95' |
43,544 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp12
Download size:
13.18 GiBDataset size:
8.05 GiBSplits:
| Split | Examples |
|---|---|
'test' |
40 |
'train_100' |
104,059 |
'train_30' |
25,299 |
'train_50' |
34,039 |
'train_70' |
41,522 |
'train_90' |
49,600 |
'train_95' |
50,914 |
'validation' |
224 |
- Examples (tfds.as_dataframe):