ogbg_molpcba
Stay organized with collections
Save and categorize content based on your preferences.
'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a
graph prediction dataset from the Open Graph Benchmark (OGB).
This dataset is experimental, and the API is subject to change in future
releases.
The below description of the dataset is adapted from the OGB paper:
All the molecules are pre-processed using RDKit ([1]).
- Each graph represents a molecule, where nodes are atoms, and edges are
chemical bonds.
- Input node features are 9-dimensional, containing atomic number and
chirality, as well as other additional atom features such as formal charge
and whether the atom is in the ring.
- Input edge features are 3-dimensional, containing bond type, bond
stereochemistry, as well as an additional bond feature indicating whether
the bond is conjugated.
The exact description of all features is available at
https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py
Prediction
The task is to predict 128 different biological activities (inactive/active).
See [2] and [3] for more description about these targets. Not all targets apply
to each molecule: missing targets are indicated by NaNs.
References
[1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL:
https://github.com/rdkit/rdkit
[2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David
Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'.
URL: https://arxiv.org/pdf/1502.02072.pdf
[3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb
Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a
benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.
Split |
Examples |
'test' |
43,793 |
'train' |
350,343 |
'validation' |
43,793 |
FeaturesDict({
'edge_feat': Tensor(shape=(None, 3), dtype=float32),
'edge_index': Tensor(shape=(None, 2), dtype=int64),
'labels': Tensor(shape=(128,), dtype=float32),
'node_feat': Tensor(shape=(None, 9), dtype=float32),
'num_edges': Tensor(shape=(None,), dtype=int64),
'num_nodes': Tensor(shape=(None,), dtype=int64),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
edge_feat |
Tensor |
(None, 3) |
float32 |
|
edge_index |
Tensor |
(None, 2) |
int64 |
|
labels |
Tensor |
(128,) |
float32 |
|
node_feat |
Tensor |
(None, 9) |
float32 |
|
num_edges |
Tensor |
(None,) |
int64 |
|
num_nodes |
Tensor |
(None,) |
int64 |
|

@inproceedings{DBLP:conf/nips/HuFZDRLCL20,
author = {Weihua Hu and
Matthias Fey and
Marinka Zitnik and
Yuxiao Dong and
Hongyu Ren and
Bowen Liu and
Michele Catasta and
Jure Leskovec},
editor = {Hugo Larochelle and
Marc Aurelio Ranzato and
Raia Hadsell and
Maria{-}Florina Balcan and
Hsuan{-}Tien Lin},
title = {Open Graph Benchmark: Datasets for Machine Learning on Graphs},
booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference
on Neural Information Processing Systems 2020, NeurIPS 2020, December
6-12, 2020, virtual},
year = {2020},
url = {https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html},
timestamp = {Tue, 19 Jan 2021 15:57:06 +0100},
biburl = {https://dblp.org/rec/conf/nips/HuFZDRLCL20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-14 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-12-14 UTC."],[],[],null,["# ogbg_molpcba\n\n\u003cbr /\u003e\n\n- **Description**:\n\n'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a\ngraph prediction dataset from the Open Graph Benchmark (OGB).\n\nThis dataset is experimental, and the API is subject to change in future\nreleases.\n\nThe below description of the dataset is adapted from the OGB paper:\n\n### Input Format\n\nAll the molecules are pre-processed using RDKit (\\[1\\]).\n\n- Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.\n- Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring.\n- Input edge features are 3-dimensional, containing bond type, bond stereochemistry, as well as an additional bond feature indicating whether the bond is conjugated.\n\nThe exact description of all features is available at\n\u003chttps://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py\u003e\n\n### Prediction\n\nThe task is to predict 128 different biological activities (inactive/active).\nSee \\[2\\] and \\[3\\] for more description about these targets. Not all targets apply\nto each molecule: missing targets are indicated by NaNs.\n\n### References\n\n\\[1\\]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL:\n\u003chttps://github.com/rdkit/rdkit\u003e\n\n\\[2\\]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David\nKonerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'.\nURL: \u003chttps://arxiv.org/pdf/1502.02072.pdf\u003e\n\n\\[3\\]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb\nGeniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a\nbenchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.\n\n- **Homepage** :\n \u003chttps://ogb.stanford.edu/docs/graphprop\u003e\n\n- **Source code** :\n [`tfds.datasets.ogbg_molpcba.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/ogbg_molpcba/ogbg_molpcba_dataset_builder.py)\n\n- **Versions**:\n\n - `0.1.0`: Initial release of experimental API.\n - `0.1.1`: Exposes the number of edges in each graph explicitly.\n - `0.1.2`: Add metadata field for GraphVisualizer.\n - **`0.1.3`** (default): Add metadata field for names of individual tasks.\n- **Download size** : `37.70 MiB`\n\n- **Dataset size** : `822.53 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 43,793 |\n| `'train'` | 350,343 |\n| `'validation'` | 43,793 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'edge_feat': Tensor(shape=(None, 3), dtype=float32),\n 'edge_index': Tensor(shape=(None, 2), dtype=int64),\n 'labels': Tensor(shape=(128,), dtype=float32),\n 'node_feat': Tensor(shape=(None, 9), dtype=float32),\n 'num_edges': Tensor(shape=(None,), dtype=int64),\n 'num_nodes': Tensor(shape=(None,), dtype=int64),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|------------|--------------|-----------|---------|-------------|\n| | FeaturesDict | | | |\n| edge_feat | Tensor | (None, 3) | float32 | |\n| edge_index | Tensor | (None, 2) | int64 | |\n| labels | Tensor | (128,) | float32 | |\n| node_feat | Tensor | (None, 9) | float32 | |\n| num_edges | Tensor | (None,) | int64 | |\n| num_nodes | Tensor | (None,) | int64 | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @inproceedings{DBLP:conf/nips/HuFZDRLCL20,\n author = {Weihua Hu and\n Matthias Fey and\n Marinka Zitnik and\n Yuxiao Dong and\n Hongyu Ren and\n Bowen Liu and\n Michele Catasta and\n Jure Leskovec},\n editor = {Hugo Larochelle and\n Marc Aurelio Ranzato and\n Raia Hadsell and\n Maria{-}Florina Balcan and\n Hsuan{-}Tien Lin},\n title = {Open Graph Benchmark: Datasets for Machine Learning on Graphs},\n booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference\n on Neural Information Processing Systems 2020, NeurIPS 2020, December\n 6-12, 2020, virtual},\n year = {2020},\n url = {https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html},\n timestamp = {Tue, 19 Jan 2021 15:57:06 +0100},\n biburl = {https://dblp.org/rec/conf/nips/HuFZDRLCL20.bib},\n bibsource = {dblp computer science bibliography, https://dblp.org}\n }"]]