Stay organized with collections
Save and categorize content based on your preferences.
In this document we describe how to do common transformations with tf.transform.
We assume you have already constructed the beam pipeline along the lines of the
examples, and only describe what needs to be added to preprocessing_fn and
possibly model.
Using String/Categorical data
The following preprocessing_fn will compute a vocabulary over the values of
feature x with tokens in descending frequency order, convert feature x
values to their index in the vocabulary, and finally perform a one-hot encoding
for the output.
This is common for example in use cases where the label feature is a categorical
string.
The resulting one-hot encoding is ready for training.
In this example, feature x is an optional feature, represented as a
tf.SparseTensor in the preprocessing_fn. In order to convert it to a dense
tensor, we compute its mean, and set the mean to be the default value when it
is missing from an instance.
The resulting dense tensor will have the shape [None, 1], None represents
the batch dimension, and for the second dimension it will be the number of
values that x can have per instance. In this case it's 1.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2022-11-22 UTC."],[],[],null,["# Common Transformations\n\n\u003cbr /\u003e\n\nIn this document we describe how to do common transformations with tf.transform.\n\nWe assume you have already constructed the beam pipeline along the lines of the\nexamples, and only describe what needs to be added to `preprocessing_fn` and\npossibly model.\n\nUsing String/Categorical data\n-----------------------------\n\nThe following `preprocessing_fn` will compute a vocabulary over the values of\nfeature `x` with tokens in descending frequency order, convert feature `x`\nvalues to their index in the vocabulary, and finally perform a one-hot encoding\nfor the output.\n\nThis is common for example in use cases where the label feature is a categorical\nstring.\nThe resulting one-hot encoding is ready for training.\n**Note:** this example produces `x_out` as a potentially large dense tensor. This is fine as long as the transformed data doesn't get materialized, and this is the format expected in training. Otherwise, a more efficient representation would be a [`tf.SparseTensor`](https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor), in which case only a single index and value (1) is used to represent each instance. \n\n def preprocessing_fn(inputs):\n integerized = tft.compute_and_apply_vocabulary(\n inputs['x'],\n num_oov_buckets=1,\n vocab_filename='x_vocab')\n one_hot_encoded = tf.one_hot(\n integerized,\n depth=tf.cast(tft.experimental.get_vocabulary_size_by_name('x_vocab') + 1,\n tf.int32),\n on_value=1.0,\n off_value=0.0)\n return {\n 'x_out': one_hot_encoded,\n }\n\nMean imputation for missing data\n--------------------------------\n\nIn this example, feature `x` is an optional feature, represented as a\n[`tf.SparseTensor`](https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor) in the `preprocessing_fn`. In order to convert it to a dense\ntensor, we compute its mean, and set the mean to be the default value when it\nis missing from an instance.\n\nThe resulting dense tensor will have the shape `[None, 1]`, `None` represents\nthe batch dimension, and for the second dimension it will be the number of\nvalues that `x` can have per instance. In this case it's 1. \n\n def preprocessing_fn(inputs):\n return {\n 'x_out': tft.sparse_tensor_to_dense_with_shape(\n inputs['x'], default_value=tft.mean(x), shape=[None, 1])\n }"]]