cardiotox
Stay organized with collections
Save and categorize content based on your preferences.
Drug Cardiotoxicity dataset [1-2] is a molecule classification task to detect
cardiotoxicity caused by binding hERG target, a protein associated with heart
beat rhythm. The data covers over 9000 molecules with hERG activity.
The data is split into four splits: train, test-iid, test-ood1, test-ood2.
Each molecule in the dataset has 2D graph annotations which is designed to
facilitate graph neural network modeling. Nodes are the atoms of the
molecule and edges are the bonds. Each atom is represented as a vector
encoding basic atom information such as atom type. Similar logic applies to
bonds.
We include Tanimoto fingerprint distance (to training data) for each
molecule in the test sets to facilitate research on distributional shift in
graph domain.
For each example, the features include: atoms: a 2D tensor with shape (60, 27)
storing node features. Molecules with less than 60 atoms are padded with zeros.
Each atom has 27 atom features. pairs: a 3D tensor with shape (60, 60, 12)
storing edge features. Each edge has 12 edge features. atom_mask: a 1D tensor
with shape (60, ) storing node masks. 1 indicates the corresponding atom is
real, othewise a padded one. pair_mask: a 2D tensor with shape (60, 60) storing
edge masks. 1 indicates the corresponding edge is real, othewise a padded one.
active: a one-hot vector indicating if the molecule is toxic or not. [0, 1]
indicates it's toxic, otherwise [1, 0] non-toxic.
References
[1]: V. B. Siramshetty et al. Critical Assessment of Artificial Intelligence
Methods for Prediction of hERG Channel Inhibition in the Big Data Era. JCIM,
2020. https://pubs.acs.org/doi/10.1021/acs.jcim.0c00884
[2]: K. Han et al. Reliable Graph Neural Networks for Drug Discovery Under
Distributional Shift. NeurIPS DistShift Workshop 2021.
https://arxiv.org/abs/2111.12951
Split |
Examples |
'test' |
839 |
'test2' |
177 |
'train' |
6,523 |
'validation' |
1,631 |
FeaturesDict({
'active': Tensor(shape=(2,), dtype=int64),
'atom_mask': Tensor(shape=(60,), dtype=float32),
'atoms': Tensor(shape=(60, 27), dtype=float32),
'dist2topk_nbs': Tensor(shape=(1,), dtype=float32),
'molecule_id': string,
'pair_mask': Tensor(shape=(60, 60), dtype=float32),
'pairs': Tensor(shape=(60, 60, 12), dtype=float32),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
active |
Tensor |
(2,) |
int64 |
|
atom_mask |
Tensor |
(60,) |
float32 |
|
atoms |
Tensor |
(60, 27) |
float32 |
|
dist2topk_nbs |
Tensor |
(1,) |
float32 |
|
molecule_id |
Tensor |
|
string |
|
pair_mask |
Tensor |
(60, 60) |
float32 |
|
pairs |
Tensor |
(60, 60, 12) |
float32 |
|
@ARTICLE{Han2021-tu,
title = "Reliable Graph Neural Networks for Drug Discovery Under
Distributional Shift",
author = "Han, Kehang and Lakshminarayanan, Balaji and Liu, Jeremiah",
month = nov,
year = 2021,
archivePrefix = "arXiv",
primaryClass = "cs.LG",
eprint = "2111.12951"
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-12-06 UTC."],[],[],null,["# cardiotox\n\n\u003cbr /\u003e\n\n- **Description**:\n\nDrug Cardiotoxicity dataset \\[1-2\\] is a molecule classification task to detect\ncardiotoxicity caused by binding hERG target, a protein associated with heart\nbeat rhythm. The data covers over 9000 molecules with hERG activity.\n| **Note:**\n\n1. The data is split into four splits: train, test-iid, test-ood1, test-ood2.\n\n2. Each molecule in the dataset has 2D graph annotations which is designed to\n facilitate graph neural network modeling. Nodes are the atoms of the\n molecule and edges are the bonds. Each atom is represented as a vector\n encoding basic atom information such as atom type. Similar logic applies to\n bonds.\n\n3. We include Tanimoto fingerprint distance (to training data) for each\n molecule in the test sets to facilitate research on distributional shift in\n graph domain.\n\nFor each example, the features include: atoms: a 2D tensor with shape (60, 27)\nstoring node features. Molecules with less than 60 atoms are padded with zeros.\nEach atom has 27 atom features. pairs: a 3D tensor with shape (60, 60, 12)\nstoring edge features. Each edge has 12 edge features. atom_mask: a 1D tensor\nwith shape (60, ) storing node masks. 1 indicates the corresponding atom is\nreal, othewise a padded one. pair_mask: a 2D tensor with shape (60, 60) storing\nedge masks. 1 indicates the corresponding edge is real, othewise a padded one.\nactive: a one-hot vector indicating if the molecule is toxic or not. \\[0, 1\\]\nindicates it's toxic, otherwise \\[1, 0\\] non-toxic.\n\nReferences\n----------\n\n\\[1\\]: V. B. Siramshetty et al. Critical Assessment of Artificial Intelligence\nMethods for Prediction of hERG Channel Inhibition in the Big Data Era. JCIM,\n2020. \u003chttps://pubs.acs.org/doi/10.1021/acs.jcim.0c00884\u003e\n\n\\[2\\]: K. Han et al. Reliable Graph Neural Networks for Drug Discovery Under\nDistributional Shift. NeurIPS DistShift Workshop 2021.\n\u003chttps://arxiv.org/abs/2111.12951\u003e\n\n- **Homepage** :\n \u003chttps://github.com/google/uncertainty-baselines/tree/main/baselines/drug_cardiotoxicity\u003e\n\n- **Source code** :\n [`tfds.graphs.cardiotox.Cardiotox`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/graphs/cardiotox/cardiotox.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): Initial release.\n- **Download size** : `Unknown size`\n\n- **Dataset size** : `1.66 GiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 839 |\n| `'test2'` | 177 |\n| `'train'` | 6,523 |\n| `'validation'` | 1,631 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'active': Tensor(shape=(2,), dtype=int64),\n 'atom_mask': Tensor(shape=(60,), dtype=float32),\n 'atoms': Tensor(shape=(60, 27), dtype=float32),\n 'dist2topk_nbs': Tensor(shape=(1,), dtype=float32),\n 'molecule_id': string,\n 'pair_mask': Tensor(shape=(60, 60), dtype=float32),\n 'pairs': Tensor(shape=(60, 60, 12), dtype=float32),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|---------------|--------------|--------------|---------|-------------|\n| | FeaturesDict | | | |\n| active | Tensor | (2,) | int64 | |\n| atom_mask | Tensor | (60,) | float32 | |\n| atoms | Tensor | (60, 27) | float32 | |\n| dist2topk_nbs | Tensor | (1,) | float32 | |\n| molecule_id | Tensor | | string | |\n| pair_mask | Tensor | (60, 60) | float32 | |\n| pairs | Tensor | (60, 60, 12) | float32 | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @ARTICLE{Han2021-tu,\n title = \"Reliable Graph Neural Networks for Drug Discovery Under\n Distributional Shift\",\n author = \"Han, Kehang and Lakshminarayanan, Balaji and Liu, Jeremiah\",\n month = nov,\n year = 2021,\n archivePrefix = \"arXiv\",\n primaryClass = \"cs.LG\",\n eprint = \"2111.12951\"\n }"]]