FeatureConnector
Stay organized with collections
Save and categorize content based on your preferences.
The tfds.features.FeatureConnector
API:
- Defines the structure, shapes, dtypes of the final
tf.data.Dataset
- Abstract away serialization to/from disk.
- Expose additional metadata (e.g. label names, audio sample rate,...)
Overview
tfds.features.FeatureConnector
defines the dataset features structure (in
tfds.core.DatasetInfo
):
tfds.core.DatasetInfo(
features=tfds.features.FeaturesDict({
'image': tfds.features.Image(shape=(28, 28, 1), doc='Grayscale image'),
'label': tfds.features.ClassLabel(
names=['no', 'yes'],
doc=tfds.features.Documentation(
desc='Whether this is a picture of a cat',
value_range='yes or no'
),
),
'metadata': {
'id': tf.int64,
'timestamp': tfds.features.Scalar(
tf.int64,
doc='Timestamp when this picture was taken as seconds since epoch'),
'language': tf.string,
},
}),
)
Features can be documented by either using just a textual description
(doc='description'
) or by using tfds.features.Documentation
directly to
provide a more detailed feature description.
Features can be:
- Scalar values:
tf.bool
, tf.string
, tf.float32
,... When you want to
document the feature, you can also use tfds.features.Scalar(tf.int64,
doc='description')
.
tfds.features.Audio
, tfds.features.Video
,... (see
the list
of available features)
- Nested
dict
of features: {'metadata': {'image': Image(), 'description':
tf.string} }
,...
- Nested
tfds.features.Sequence
: Sequence({'image': ..., 'id': ...})
,
Sequence(Sequence(tf.int64))
,...
During generation, the examples will be automatically serialized by
FeatureConnector.encode_example
into a format suitable to disk (currently
tf.train.Example
protocol buffers):
yield {
'image': '/path/to/img0.png', # `np.array`, file bytes,... also accepted
'label': 'yes', # int (0-num_classes) also accepted
'metadata': {
'id': 43,
'language': 'en',
},
}
When reading the dataset (e.g. with tfds.load
), the data is automtically
decoded with FeatureConnector.decode_example
. The returned tf.data.Dataset
will match the dict
structure defined in tfds.core.DatasetInfo
:
ds = tfds.load(...)
ds.element_spec == {
'image': tf.TensorSpec(shape=(28, 28, 1), tf.uint8),
'label': tf.TensorSpec(shape=(), tf.int64),
'metadata': {
'id': tf.TensorSpec(shape=(), tf.int64),
'language': tf.TensorSpec(shape=(), tf.string),
},
}
Serialize/deserialize to proto
TFDS expose a low-level API to serialize/deserialize examples to
tf.train.Example
proto.
To serialize dict[np.ndarray | Path | str | ...]
to proto bytes
, use
features.serialize_example
:
with tf.io.TFRecordWriter('path/to/file.tfrecord') as writer:
for ex in all_exs:
ex_bytes = features.serialize_example(data)
f.write(ex_bytes)
To deserialize to proto bytes
to tf.Tensor
, use
features.deserialize_example
:
ds = tf.data.TFRecordDataset('path/to/file.tfrecord')
ds = ds.map(features.deserialize_example)
See the
introduction doc
to access features metadata (label names, shape, dtype,...). Example:
ds, info = tfds.load(..., with_info=True)
info.features['label'].names # ['cat', 'dog', ...]
info.features['label'].str2int('cat') # 0
If you believe a feature is missing from the
available features,
please open a new issue.
To create your own feature connector, you need to inherit from
tfds.features.FeatureConnector
and implement the abstract methods.
The tfds.features.FeatureConnector
object abstracts away how the feature is
encoded on disk from how it is presented to the user. Below is a diagram showing
the abstraction layers of the dataset and the transformation from the raw
dataset files to the tf.data.Dataset
object.
To create your own feature connector, subclass tfds.features.FeatureConnector
and implement the abstract methods:
encode_example(data)
: Defines how to encode the data given in the
generator _generate_examples()
into a tf.train.Example
compatible data.
Can return a single value, or a dict
of values.
decode_example(data)
: Defines how to decode the data from the tensor read
from tf.train.Example
into user tensor returned by tf.data.Dataset
.
get_tensor_info()
: Indicates the shape/dtype of the tensor(s) returned by
tf.data.Dataset
. May be optional if inheriting from another
tfds.features
.
- (optionally)
get_serialized_info()
: If the info returned by
get_tensor_info()
is different from how the data are actually written on
disk, then you need to overwrite get_serialized_info()
to match the specs
of the tf.train.Example
to_json_content
/from_json_content
: This is required to allow your
dataset to be loaded without the original source code. See
Audio feature
for an example.
For more info, have a look at tfds.features.FeatureConnector
documentation.
It's also best to look at
real examples.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2023-04-08 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2023-04-08 UTC."],[],[],null,["# FeatureConnector\n\n\u003cbr /\u003e\n\nThe [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) API:\n\n- Defines the structure, shapes, dtypes of the final [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)\n- Abstract away serialization to/from disk.\n- Expose additional metadata (e.g. label names, audio sample rate,...)\n\nOverview\n--------\n\n[`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) defines the dataset features structure (in\n[`tfds.core.DatasetInfo`](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo)): \n\n tfds.core.DatasetInfo(\n features=tfds.features.FeaturesDict({\n 'image': tfds.features.Image(shape=(28, 28, 1), doc='Grayscale image'),\n 'label': tfds.features.ClassLabel(\n names=['no', 'yes'],\n doc=tfds.features.Documentation(\n desc='Whether this is a picture of a cat',\n value_range='yes or no'\n ),\n ),\n 'metadata': {\n 'id': tf.int64,\n 'timestamp': tfds.features.Scalar(\n tf.int64,\n doc='Timestamp when this picture was taken as seconds since epoch'),\n 'language': tf.string,\n },\n }),\n )\n\nFeatures can be documented by either using just a textual description\n(`doc='description'`) or by using `tfds.features.Documentation` directly to\nprovide a more detailed feature description.\n\nFeatures can be:\n\n- Scalar values: [`tf.bool`](https://www.tensorflow.org/api_docs/python/tf#bool), [`tf.string`](https://www.tensorflow.org/api_docs/python/tf#string), [`tf.float32`](https://www.tensorflow.org/api_docs/python/tf#float32),... When you want to document the feature, you can also use `tfds.features.Scalar(tf.int64,\n doc='description')`.\n- [`tfds.features.Audio`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Audio), [`tfds.features.Video`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Video),... (see [the list](https://www.tensorflow.org/datasets/api_docs/python/tfds/features?version=nightly) of available features)\n- Nested `dict` of features: `{'metadata': {'image': Image(), 'description':\n tf.string} }`,...\n- Nested [`tfds.features.Sequence`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Sequence): `Sequence({'image': ..., 'id': ...})`, `Sequence(Sequence(tf.int64))`,...\n\nDuring generation, the examples will be automatically serialized by\n[`FeatureConnector.encode_example`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector#encode_example) into a format suitable to disk (currently\n[`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) protocol buffers): \n\n yield {\n 'image': '/path/to/img0.png', # `np.array`, file bytes,... also accepted\n 'label': 'yes', # int (0-num_classes) also accepted\n 'metadata': {\n 'id': 43,\n 'language': 'en',\n },\n }\n\nWhen reading the dataset (e.g. with [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)), the data is automtically\ndecoded with [`FeatureConnector.decode_example`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector#decode_example). The returned [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)\nwill match the `dict` structure defined in [`tfds.core.DatasetInfo`](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo): \n\n ds = tfds.load(...)\n ds.element_spec == {\n 'image': tf.TensorSpec(shape=(28, 28, 1), tf.uint8),\n 'label': tf.TensorSpec(shape=(), tf.int64),\n 'metadata': {\n 'id': tf.TensorSpec(shape=(), tf.int64),\n 'language': tf.TensorSpec(shape=(), tf.string),\n },\n }\n\nSerialize/deserialize to proto\n------------------------------\n\nTFDS expose a low-level API to serialize/deserialize examples to\n[`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) proto.\n\nTo serialize `dict[np.ndarray | Path | str | ...]` to proto `bytes`, use\n`features.serialize_example`: \n\n with tf.io.TFRecordWriter('path/to/file.tfrecord') as writer:\n for ex in all_exs:\n ex_bytes = features.serialize_example(data)\n f.write(ex_bytes)\n\nTo deserialize to proto `bytes` to [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor), use\n`features.deserialize_example`: \n\n ds = tf.data.TFRecordDataset('path/to/file.tfrecord')\n ds = ds.map(features.deserialize_example)\n\nAccess metadata\n---------------\n\nSee the\n[introduction doc](https://www.tensorflow.org/datasets/overview#access_the_dataset_metadata)\nto access features metadata (label names, shape, dtype,...). Example: \n\n ds, info = tfds.load(..., with_info=True)\n\n info.features['label'].names # ['cat', 'dog', ...]\n info.features['label'].str2int('cat') # 0\n\nCreate your own [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector)\n--------------------------------------------------------------------------------------------------------------------------------------\n\nIf you believe a feature is missing from the\n[available features](https://www.tensorflow.org/datasets/api_docs/python/tfds/features#classes),\nplease open a [new issue](https://github.com/tensorflow/datasets/issues).\n\nTo create your own feature connector, you need to inherit from\n[`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) and implement the abstract methods.\n\n- If your feature is a single tensor value, it's best to inherit from [`tfds.features.Tensor`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Tensor) and use `super()` when needed. See [`tfds.features.BBoxFeature`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/BBoxFeature) source code for an example.\n- If your feature is a container of multiple tensors, it's best to inherit from [`tfds.features.FeaturesDict`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeaturesDict) and use the `super()` to automatically encode sub-connectors.\n\nThe [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) object abstracts away how the feature is\nencoded on disk from how it is presented to the user. Below is a diagram showing\nthe abstraction layers of the dataset and the transformation from the raw\ndataset files to the [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) object.\n\n\nTo create your own feature connector, subclass [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector)\nand implement the abstract methods:\n\n- `encode_example(data)`: Defines how to encode the data given in the generator `_generate_examples()` into a [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) compatible data. Can return a single value, or a `dict` of values.\n- `decode_example(data)`: Defines how to decode the data from the tensor read from [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) into user tensor returned by [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).\n- `get_tensor_info()`: Indicates the shape/dtype of the tensor(s) returned by [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). May be optional if inheriting from another [`tfds.features`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features).\n- (optionally) `get_serialized_info()`: If the info returned by `get_tensor_info()` is different from how the data are actually written on disk, then you need to overwrite `get_serialized_info()` to match the specs of the [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example)\n- `to_json_content`/`from_json_content`: This is required to allow your dataset to be loaded without the original source code. See [Audio feature](https://github.com/tensorflow/datasets/blob/65a76cb53c8ff7f327a3749175bc4f8c12ff465e/tensorflow_datasets/core/features/audio_feature.py#L121) for an example.\n\n| **Note:** Make sure to test your Feature connectors with `self.assertFeature` and [`tfds.testing.FeatureExpectationItem`](https://www.tensorflow.org/datasets/api_docs/python/tfds/testing/FeatureExpectationItem). Have a look at [test examples](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/core/features/image_feature_test.py):\n\nFor more info, have a look at [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) documentation.\nIt's also best to look at\n[real examples](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/core/features)."]]