tf.feature_column.categorical_column_with_vocabulary_file
Stay organized with collections
Save and categorize content based on your preferences.
A CategoricalColumn
with a vocabulary file.
tf.feature_column.categorical_column_with_vocabulary_file(
key, vocabulary_file, vocabulary_size=None, dtype=tf.dtypes.string,
default_value=None, num_oov_buckets=0
)
Use this when your inputs are in string or integer format, and you have a
vocabulary file that maps each value to an integer ID. By default,
out-of-vocabulary values are ignored. Use either (but not both) of
num_oov_buckets
and default_value
to specify how to include
out-of-vocabulary values.
For input dictionary features
, features[key]
is either Tensor
or
SparseTensor
. If Tensor
, missing values can be represented by -1
for int
and ''
for string, which will be dropped by this feature column.
Example with num_oov_buckets
:
File '/us/states.txt'
contains 50 lines, each with a 2-character U.S. state
abbreviation. All inputs with values in that file are assigned an ID 0-49,
corresponding to its line number. All other values are hashed and assigned an
ID 50-54.
states = categorical_column_with_vocabulary_file(
key='states', vocabulary_file='/us/states.txt', vocabulary_size=50,
num_oov_buckets=5)
columns = [states, ...]
features = tf.io.parse_example(..., features=make_parse_example_spec(columns))
linear_prediction = linear_model(features, columns)
Example with default_value
:
File '/us/states.txt'
contains 51 lines - the first line is 'XX'
, and the
other 50 each have a 2-character U.S. state abbreviation. Both a literal
'XX'
in input, and other values missing from the file, will be assigned
ID 0. All others are assigned the corresponding line number 1-50.
states = categorical_column_with_vocabulary_file(
key='states', vocabulary_file='/us/states.txt', vocabulary_size=51,
default_value=0)
columns = [states, ...]
features = tf.io.parse_example(..., features=make_parse_example_spec(columns))
linear_prediction, _, _ = linear_model(features, columns)
And to make an embedding with either:
columns = [embedding_column(states, 3),...]
features = tf.io.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)
Args |
key
|
A unique string identifying the input feature. It is used as the
column name and the dictionary key for feature parsing configs, feature
Tensor objects, and feature columns.
|
vocabulary_file
|
The vocabulary file name.
|
vocabulary_size
|
Number of the elements in the vocabulary. This must be no
greater than length of vocabulary_file , if less than length, later
values are ignored. If None, it is set to the length of vocabulary_file .
|
dtype
|
The type of features. Only string and integer types are supported.
|
default_value
|
The integer ID value to return for out-of-vocabulary feature
values, defaults to -1 . This can not be specified with a positive
num_oov_buckets .
|
num_oov_buckets
|
Non-negative integer, the number of out-of-vocabulary
buckets. All out-of-vocabulary inputs will be assigned IDs in the range
[vocabulary_size, vocabulary_size+num_oov_buckets) based on a hash of
the input value. A positive num_oov_buckets can not be specified with
default_value .
|
Returns |
A CategoricalColumn with a vocabulary file.
|
Raises |
ValueError
|
vocabulary_file is missing or cannot be opened.
|
ValueError
|
vocabulary_size is missing or < 1.
|
ValueError
|
num_oov_buckets is a negative integer.
|
ValueError
|
num_oov_buckets and default_value are both specified.
|
ValueError
|
dtype is neither string nor integer.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2020-10-01 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2020-10-01 UTC."],[],[],null,["# tf.feature_column.categorical_column_with_vocabulary_file\n\n\u003cbr /\u003e\n\n|-------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|\n| [TensorFlow 1 version](/versions/r1.15/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_file) | [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/python/feature_column/feature_column_v2.py#L1575-L1692) |\n\nA `CategoricalColumn` with a vocabulary file. \n\n tf.feature_column.categorical_column_with_vocabulary_file(\n key, vocabulary_file, vocabulary_size=None, dtype=tf.dtypes.string,\n default_value=None, num_oov_buckets=0\n )\n\nUse this when your inputs are in string or integer format, and you have a\nvocabulary file that maps each value to an integer ID. By default,\nout-of-vocabulary values are ignored. Use either (but not both) of\n`num_oov_buckets` and `default_value` to specify how to include\nout-of-vocabulary values.\n\nFor input dictionary `features`, `features[key]` is either `Tensor` or\n`SparseTensor`. If `Tensor`, missing values can be represented by `-1` for int\nand `''` for string, which will be dropped by this feature column.\n\nExample with `num_oov_buckets`:\nFile `'/us/states.txt'` contains 50 lines, each with a 2-character U.S. state\nabbreviation. All inputs with values in that file are assigned an ID 0-49,\ncorresponding to its line number. All other values are hashed and assigned an\nID 50-54. \n\n states = categorical_column_with_vocabulary_file(\n key='states', vocabulary_file='/us/states.txt', vocabulary_size=50,\n num_oov_buckets=5)\n columns = [states, ...]\n features = tf.io.parse_example(..., features=make_parse_example_spec(columns))\n linear_prediction = linear_model(features, columns)\n\nExample with `default_value`:\nFile `'/us/states.txt'` contains 51 lines - the first line is `'XX'`, and the\nother 50 each have a 2-character U.S. state abbreviation. Both a literal\n`'XX'` in input, and other values missing from the file, will be assigned\nID 0. All others are assigned the corresponding line number 1-50. \n\n states = categorical_column_with_vocabulary_file(\n key='states', vocabulary_file='/us/states.txt', vocabulary_size=51,\n default_value=0)\n columns = [states, ...]\n features = tf.io.parse_example(..., features=make_parse_example_spec(columns))\n linear_prediction, _, _ = linear_model(features, columns)\n\nAnd to make an embedding with either: \n\n columns = [embedding_column(states, 3),...]\n features = tf.io.parse_example(..., features=make_parse_example_spec(columns))\n dense_tensor = input_layer(features, columns)\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|-------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `key` | A unique string identifying the input feature. It is used as the column name and the dictionary key for feature parsing configs, feature `Tensor` objects, and feature columns. |\n| `vocabulary_file` | The vocabulary file name. |\n| `vocabulary_size` | Number of the elements in the vocabulary. This must be no greater than length of `vocabulary_file`, if less than length, later values are ignored. If None, it is set to the length of `vocabulary_file`. |\n| `dtype` | The type of features. Only string and integer types are supported. |\n| `default_value` | The integer ID value to return for out-of-vocabulary feature values, defaults to `-1`. This can not be specified with a positive `num_oov_buckets`. |\n| `num_oov_buckets` | Non-negative integer, the number of out-of-vocabulary buckets. All out-of-vocabulary inputs will be assigned IDs in the range `[vocabulary_size, vocabulary_size+num_oov_buckets)` based on a hash of the input value. A positive `num_oov_buckets` can not be specified with `default_value`. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| A `CategoricalColumn` with a vocabulary file. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|--------------|-----------------------------------------------------------|\n| `ValueError` | `vocabulary_file` is missing or cannot be opened. |\n| `ValueError` | `vocabulary_size` is missing or \\\u003c 1. |\n| `ValueError` | `num_oov_buckets` is a negative integer. |\n| `ValueError` | `num_oov_buckets` and `default_value` are both specified. |\n| `ValueError` | `dtype` is neither string nor integer. |\n\n\u003cbr /\u003e"]]