TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

protein_net

Description:

ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.

Homepage: https://github.com/aqlaboratory/proteinnet
Source code: tfds.datasets.protein_net.Builder
Versions:
- 1.0.0 (default): Initial release.
Auto-cached (documentation): No
Feature structure:

FeaturesDict({
    'evolutionary': Tensor(shape=(None, 21), dtype=float32),
    'id': Text(shape=(), dtype=string),
    'length': int32,
    'mask': Tensor(shape=(None,), dtype=bool),
    'primary': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=20)),
    'tertiary': Tensor(shape=(None, 3), dtype=float32),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
evolutionary	Tensor	(None, 21)	float32
id	Text		string
length	Tensor		int32
mask	Tensor	(None,)	bool
primary	Sequence(ClassLabel)	(None,)	int64
tertiary	Tensor	(None, 3)	float32

Supervised keys (See as_supervised doc): ('primary', 'tertiary')
Figure (tfds.show_examples): Not supported.
Citation:

@article{ProteinNet19,
title = { {ProteinNet}: a standardized data set for machine learning of protein structure},
author = {AlQuraishi, Mohammed},
journal = {BMC bioinformatics},
volume = {20},
number = {1},
pages = {1--10},
year = {2019},
publisher = {BioMed Central}
}

protein_net/casp7 (default config)

Download size: 3.18 GiB
Dataset size: 2.53 GiB
Splits:

Split	Examples
`'test'`	93
`'train_100'`	34,557
`'train_30'`	10,333
`'train_50'`	13,024
`'train_70'`	15,207
`'train_90'`	17,611
`'train_95'`	17,938
`'validation'`	224

Examples (tfds.as_dataframe):

protein_net/casp8

Download size: 4.96 GiB
Dataset size: 3.55 GiB
Splits:

Split	Examples
`'test'`	120
`'train_100'`	48,087
`'train_30'`	13,881
`'train_50'`	17,970
`'train_70'`	21,191
`'train_90'`	24,556
`'train_95'`	25,035
`'validation'`	224

Examples (tfds.as_dataframe):

protein_net/casp9

Download size: 6.65 GiB
Dataset size: 4.54 GiB
Splits:

Split	Examples
`'test'`	116
`'train_100'`	60,350
`'train_30'`	16,973
`'train_50'`	22,172
`'train_70'`	26,263
`'train_90'`	30,513
`'train_95'`	31,128
`'validation'`	224

Examples (tfds.as_dataframe):

protein_net/casp10

Download size: 8.65 GiB
Dataset size: 5.57 GiB
Splits:

Split	Examples
`'test'`	95
`'train_100'`	73,116
`'train_30'`	19,495
`'train_50'`	25,897
`'train_70'`	31,001
`'train_90'`	36,258
`'train_95'`	37,033
`'validation'`	224

Examples (tfds.as_dataframe):

protein_net/casp11

Download size: 10.81 GiB
Dataset size: 6.72 GiB
Splits:

Split	Examples
`'test'`	81
`'train_100'`	87,573
`'train_30'`	22,344
`'train_50'`	29,936
`'train_70'`	36,005
`'train_90'`	42,507
`'train_95'`	43,544
`'validation'`	224

Examples (tfds.as_dataframe):

protein_net/casp12

Download size: 13.18 GiB
Dataset size: 8.05 GiB
Splits:

Split	Examples
`'test'`	40
`'train_100'`	104,059
`'train_30'`	25,299
`'train_50'`	34,039
`'train_70'`	41,522
`'train_90'`	49,600
`'train_95'`	50,914
`'validation'`	224

Examples (tfds.as_dataframe):

protein_net Stay organized with collections Save and categorize content based on your preferences.