genomics_ood
Stay organized with collections
Save and categorize content based on your preferences.
Bacteria identification based on genomic sequences holds the promise of early
detection of diseases, but requires a model that can output low confidence
predictions on out-of-distribution (OOD) genomic sequences from new bacteria
that were not present in the training data.
We introduce a genomics dataset for OOD detection that allows other researchers
to benchmark progress on this important problem. New bacterial classes are
gradually discovered over the years. Grouping classes by years is a natural way
to mimic the in-distribution and OOD examples.
The dataset contains genomic sequences sampled from 10 bacteria classes that
were discovered before the year 2011 as in-distribution classes, 60 bacteria
classes discovered between 2011-2016 as OOD for validation, and another 60
different bacteria classes discovered after 2016 as OOD for test, in total 130
bacteria classes. Note that training, validation, and test data are provided for
the in-distribution classes, and validation and test data are proviede for OOD
classes. By its nature, OOD data is not available at the training time.
The genomic sequence is 250 long, composed by characters of {A, C, G, T}. The
sample size of each class is 100,000 in the training and 10,000 for the
validation and test sets.
For each example, the features include: seq: the input DNA sequence composed by
{A, C, G, T}. label: the name of the bacteria class. seq_info: the source of the
DNA sequence, i.e., the genome name, NCBI accession number, and the position
where it was sampled from. domain: if the bacteria is in-distribution (in), or
OOD (ood)
The details of the dataset can be found in the paper supplemental.
Split |
Examples |
'test' |
100,000 |
'test_ood' |
600,000 |
'train' |
1,000,000 |
'validation' |
100,000 |
'validation_ood' |
600,000 |
FeaturesDict({
'domain': Text(shape=(), dtype=string),
'label': ClassLabel(shape=(), dtype=int64, num_classes=130),
'seq': Text(shape=(), dtype=string),
'seq_info': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
domain |
Text |
|
string |
|
label |
ClassLabel |
|
int64 |
|
seq |
Text |
|
string |
|
seq_info |
Text |
|
string |
|
@inproceedings{ren2019likelihood,
title={Likelihood ratios for out-of-distribution detection},
author={Ren, Jie and
Liu, Peter J and
Fertig, Emily and
Snoek, Jasper and
Poplin, Ryan and
Depristo, Mark and
Dillon, Joshua and
Lakshminarayanan, Balaji},
booktitle={Advances in Neural Information Processing Systems},
pages={14707--14718},
year={2019}
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-12-06 UTC."],[],[],null,["# genomics_ood\n\n\u003cbr /\u003e\n\n- **Description**:\n\nBacteria identification based on genomic sequences holds the promise of early\ndetection of diseases, but requires a model that can output low confidence\npredictions on out-of-distribution (OOD) genomic sequences from new bacteria\nthat were not present in the training data.\n\nWe introduce a genomics dataset for OOD detection that allows other researchers\nto benchmark progress on this important problem. New bacterial classes are\ngradually discovered over the years. Grouping classes by years is a natural way\nto mimic the in-distribution and OOD examples.\n\nThe dataset contains genomic sequences sampled from 10 bacteria classes that\nwere discovered before the year 2011 as in-distribution classes, 60 bacteria\nclasses discovered between 2011-2016 as OOD for validation, and another 60\ndifferent bacteria classes discovered after 2016 as OOD for test, in total 130\nbacteria classes. Note that training, validation, and test data are provided for\nthe in-distribution classes, and validation and test data are proviede for OOD\nclasses. By its nature, OOD data is not available at the training time.\n\nThe genomic sequence is 250 long, composed by characters of {A, C, G, T}. The\nsample size of each class is 100,000 in the training and 10,000 for the\nvalidation and test sets.\n\nFor each example, the features include: seq: the input DNA sequence composed by\n{A, C, G, T}. label: the name of the bacteria class. seq_info: the source of the\nDNA sequence, i.e., the genome name, NCBI accession number, and the position\nwhere it was sampled from. domain: if the bacteria is in-distribution (in), or\nOOD (ood)\n\nThe details of the dataset can be found in the paper supplemental.\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/real-bacteria-dataset)\n\n- **Homepage** :\n \u003chttps://github.com/google-research/google-research/tree/master/genomics_ood\u003e\n\n- **Source code** :\n [`tfds.structured.GenomicsOod`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/structured/genomics_ood.py)\n\n- **Versions**:\n\n - **`0.0.1`** (default): No release notes.\n- **Download size** : `Unknown size`\n\n- **Dataset size** : `926.87 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|--------------------|-----------|\n| `'test'` | 100,000 |\n| `'test_ood'` | 600,000 |\n| `'train'` | 1,000,000 |\n| `'validation'` | 100,000 |\n| `'validation_ood'` | 600,000 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'domain': Text(shape=(), dtype=string),\n 'label': ClassLabel(shape=(), dtype=int64, num_classes=130),\n 'seq': Text(shape=(), dtype=string),\n 'seq_info': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|----------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| domain | Text | | string | |\n| label | ClassLabel | | int64 | |\n| seq | Text | | string | |\n| seq_info | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('seq', 'label')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @inproceedings{ren2019likelihood,\n title={Likelihood ratios for out-of-distribution detection},\n author={Ren, Jie and\n Liu, Peter J and\n Fertig, Emily and\n Snoek, Jasper and\n Poplin, Ryan and\n Depristo, Mark and\n Dillon, Joshua and\n Lakshminarayanan, Balaji},\n booktitle={Advances in Neural Information Processing Systems},\n pages={14707--14718},\n year={2019}\n }"]]