tft.compute_and_apply_vocabulary
    
    
      
    
    
      
      Stay organized with collections
    
    
      
      Save and categorize content based on your preferences.
    
  
  
      
    
  
  
  
  
  
    
  
  
    
    
Generates a vocabulary for x and maps it to an integer with this vocab.
tft.compute_and_apply_vocabulary(
    x: common_types.ConsistentTensorType,
    *,
    default_value: Any = -1,
    top_k: Optional[int] = None,
    frequency_threshold: Optional[int] = None,
    num_oov_buckets: int = 0,
    vocab_filename: Optional[str] = None,
    weights: Optional[tf.Tensor] = None,
    labels: Optional[tf.Tensor] = None,
    use_adjusted_mutual_info: bool = False,
    min_diff_from_avg: float = 0.0,
    coverage_top_k: Optional[int] = None,
    coverage_frequency_threshold: Optional[int] = None,
    key_fn: Optional[Callable[[Any], Any]] = None,
    fingerprint_shuffle: bool = False,
    file_format: common_types.VocabularyFileFormatType = analyzers.DEFAULT_VOCABULARY_FILE_FORMAT,
    store_frequency: Optional[bool] = False,
    reserved_tokens: Optional[Union[Iterable[str], tf.Tensor]] = None,
    name: Optional[str] = None
) -> common_types.ConsistentTensorType
Used in the notebooks
In case one of the tokens contains the '\n' or '\r' characters or is empty it
will be discarded since we are currently writing the vocabularies as text
files. This behavior will likely be fixed/improved in the future.
Note that this function will cause a vocabulary to be computed.  For large
datasets it is highly recommended to either set frequency_threshold or top_k
to control the size of the vocabulary, and also the run time of this
operation.
| Args | 
|---|
| x | A Tensor,SparseTensor, orRaggedTensorof type tf.string or
tf.int[8|16|32|64]. | 
| default_value | The value to use for out-of-vocabulary values, unless
'num_oov_buckets' is greater than zero. | 
| top_k | Limit the generated vocabulary to the first top_kelements. If set
to None, the full vocabulary is generated. | 
| frequency_threshold | Limit the generated vocabulary only to elements whose
absolute frequency is >= to the supplied threshold. If set to None, the
full vocabulary is generated.  Absolute frequency means the number of
occurences of the element in the dataset, as opposed to the proportion of
instances that contain that element. If labels are provided and the vocab
is computed using mutual information, tokens are filtered if their mutual
information with the label is < the supplied threshold. | 
| num_oov_buckets | Any lookup of an out-of-vocabulary token will return a
bucket ID based on its hash if num_oov_bucketsis greater than zero.
Otherwise it is assigned thedefault_value. | 
| vocab_filename | The file name for the vocabulary file. If None, a name based
on the scope name in the context of this graph will be used as the file
name. If not None, should be unique within a given preprocessing function.
NOTE in order to make your pipelines resilient to implementation details
please set vocab_filenamewhen you are using the vocab_filename on a
downstream component. | 
| weights | (Optional) Weights Tensorfor the vocabulary. It must have the
same shape as x. | 
| labels | (Optional) A Tensorof labels for the vocabulary. If provided, the
vocabulary is calculated based on mutual information with the label,
rather than frequency. The labels must have the same batch dimension as x.
If x is sparse, labels should be a 1D tensor reflecting row-wise labels.
If x is dense, labels can either be a 1D tensor of row-wise labels, or a
dense tensor of the identical shape as x (i.e. element-wise labels).
Labels should be a discrete integerized tensor (If the label is numeric,
it should first be bucketized; If the label is a string, an integer
vocabulary should first be applied). Note:CompositeTensorlabels are
not yet supported (b/134931826). WARNING: when labels are provided, the
frequency_threshold argument functions as a mutual information threshold,
which is a float. | 
| use_adjusted_mutual_info | If true, use adjusted mutual information. | 
| min_diff_from_avg | Mutual information of a feature will be adjusted to zero
whenever the difference between count of the feature with any label and
its expected count is lower than min_diff_from_average. | 
| coverage_top_k | (Optional), (Experimental) The minimum number of elements
per key to be included in the vocabulary. | 
| coverage_frequency_threshold | (Optional), (Experimental) Limit the coverage
arm of the vocabulary only to elements whose absolute frequency is >= this
threshold for a given key. | 
| key_fn | (Optional), (Experimental) A fn that takes in a single entry of xand returns the corresponding key for coverage calculation. If this isNone, no coverage arm is added to the vocabulary. | 
| fingerprint_shuffle | (Optional), (Experimental) Whether to sort the
vocabularies by fingerprint instead of counts. This is useful for load
balancing on the training parameter servers. Shuffle only happens while
writing the files, so all the filters above will still take effect. | 
| file_format | (Optional) A str. The format of the resulting vocabulary file.
Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires
tensorflow>=2.4. The default value is 'text'. | 
| store_frequency | If True, frequency of the words is stored in the vocabulary
file. In the case labels are provided, the mutual information is stored in
the file instead. Each line in the file will be of the form 'frequency
word'. NOTE: if True and text_format is 'text' then spaces will be
replaced to avoid information loss. | 
| reserved_tokens | (Optional) A list of tokens that should appear in the
vocabulary regardless of their appearance in the input. These tokens would
maintain their order, and have a reserved spot at the beginning of the
vocabulary. Note: this field has no affect on cache. | 
| name | (Optional) A name for this operation. | 
| Returns | 
|---|
| A Tensor,SparseTensor, orRaggedTensorwhere each string value is
mapped to an integer. Each unique string value that appears in the
vocabulary is mapped to a different integer and integers are consecutive
starting from zero. String value not in the vocabulary is assigneddefault_value. Alternatively, ifnum_oov_bucketsis specified, out of
vocabulary strings are hashed to values in
[vocab_size, vocab_size + num_oov_buckets) for an overall range of
[0, vocab_size + num_oov_buckets). | 
| Raises | 
|---|
| ValueError | If top_korfrequency_thresholdis negative.
Ifcoverage_top_korcoverage_frequency_thresholdis negative. | 
  
  
 
  
    
    
      
    
    
  
       
    
    
  
  
  Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
  Last updated 2024-11-01 UTC.
  
  
  
    
      [[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-11-01 UTC."],[],[]]