tfds.deprecated.text.SubwordTextEncoder
Stay organized with collections
Save and categorize content based on your preferences.
Invertible TextEncoder
using word pieces with a byte-level fallback.
Inherits From: TextEncoder
tfds.deprecated.text.SubwordTextEncoder(
vocab_list=None
)
Encoding is fully invertible because all out-of-vocab wordpieces are
byte-encoded.
The vocabulary is "trained" on a corpus and all wordpieces are stored in a
vocabulary file. To generate a vocabulary from a corpus, use
tfds.deprecated.text.SubwordTextEncoder.build_from_corpus
.
Typical usage:
# Build
encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
corpus_generator, target_vocab_size=2**15)
encoder.save_to_file(vocab_fname)
# Load
encoder = tfds.deprecated.text.SubwordTextEncoder.load_from_file(vocab_fname)
ids = encoder.encode("hello world")
text = encoder.decode([1, 2, 3, 4])
Args |
vocab_list
|
list<str> , list of subwords for the vocabulary. Note that an
underscore at the end of a subword indicates the end of the word (i.e. a
space will be inserted afterwards when decoding). Underscores in the
interior of subwords are disallowed and should use the underscore escape
sequence.
|
Attributes |
subwords
|
|
vocab_size
|
Size of the vocabulary. Decode produces ints [1, vocab_size).
|
Methods
build_from_corpus
View source
@classmethod
build_from_corpus(
corpus_generator,
target_vocab_size,
max_subword_length=20,
max_corpus_chars=None,
reserved_tokens=None
)
Builds a SubwordTextEncoder
based on the corpus_generator
.
Args |
corpus_generator
|
generator yielding str , from which subwords will be
constructed.
|
target_vocab_size
|
int , approximate size of the vocabulary to create.
|
max_subword_length
|
int , maximum length of a subword. Note that memory
and compute scale quadratically in the length of the longest token.
|
max_corpus_chars
|
int , the maximum number of characters to consume from
corpus_generator for the purposes of building the subword vocabulary.
|
reserved_tokens
|
list<str> , list of tokens that will always be treated
as whole tokens and not split up. Note that these must contain a mix of
alphanumeric and non-alphanumeric characters (e.g. "") and not end
in an underscore.
|
Returns |
SubwordTextEncoder .
|
decode
View source
decode(
ids
)
Decodes a list of integers into text.
encode
View source
encode(
s
)
Encodes text into a list of integers.
load_from_file
View source
@classmethod
load_from_file(
filename_prefix
)
Extracts list of subwords from file.
save_to_file
View source
save_to_file(
filename_prefix
)
Save the vocabulary to a file.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-04-26 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-04-26 UTC."],[],[],null,["# tfds.deprecated.text.SubwordTextEncoder\n\n\u003cbr /\u003e\n\n|-------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/deprecated/text/subword_text_encoder.py#L34-L402) |\n\nInvertible `TextEncoder` using word pieces with a byte-level fallback.\n\nInherits From: [`TextEncoder`](../../../tfds/deprecated/text/TextEncoder) \n\n tfds.deprecated.text.SubwordTextEncoder(\n vocab_list=None\n )\n\nEncoding is fully invertible because all out-of-vocab wordpieces are\nbyte-encoded.\n\nThe vocabulary is \"trained\" on a corpus and all wordpieces are stored in a\nvocabulary file. To generate a vocabulary from a corpus, use\n[`tfds.deprecated.text.SubwordTextEncoder.build_from_corpus`](../../../tfds/deprecated/text/SubwordTextEncoder#build_from_corpus).\n\n#### Typical usage:\n\n # Build\n encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(\n corpus_generator, target_vocab_size=2**15)\n encoder.save_to_file(vocab_fname)\n\n # Load\n encoder = tfds.deprecated.text.SubwordTextEncoder.load_from_file(vocab_fname)\n ids = encoder.encode(\"hello world\")\n text = encoder.decode([1, 2, 3, 4])\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `vocab_list` | `list\u003cstr\u003e`, list of subwords for the vocabulary. Note that an underscore at the end of a subword indicates the end of the word (i.e. a space will be inserted afterwards when decoding). Underscores in the interior of subwords are disallowed and should use the underscore escape sequence. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Attributes ---------- ||\n|--------------|----------------------------------------------------------------|\n| `subwords` | \u003cbr /\u003e \u003cbr /\u003e |\n| `vocab_size` | Size of the vocabulary. Decode produces ints \\[1, vocab_size). |\n\n\u003cbr /\u003e\n\nMethods\n-------\n\n### `build_from_corpus`\n\n[View source](https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/deprecated/text/subword_text_encoder.py#L261-L348) \n\n @classmethod\n build_from_corpus(\n corpus_generator,\n target_vocab_size,\n max_subword_length=20,\n max_corpus_chars=None,\n reserved_tokens=None\n )\n\nBuilds a `SubwordTextEncoder` based on the `corpus_generator`.\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ||\n|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `corpus_generator` | generator yielding `str`, from which subwords will be constructed. |\n| `target_vocab_size` | `int`, approximate size of the vocabulary to create. |\n| `max_subword_length` | `int`, maximum length of a subword. Note that memory and compute scale quadratically in the length of the longest token. |\n| `max_corpus_chars` | `int`, the maximum number of characters to consume from `corpus_generator` for the purposes of building the subword vocabulary. |\n| `reserved_tokens` | `list\u003cstr\u003e`, list of tokens that will always be treated as whole tokens and not split up. Note that these must contain a mix of alphanumeric and non-alphanumeric characters (e.g. \"\") and not end in an underscore. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ||\n|---|---|\n| `SubwordTextEncoder`. ||\n\n\u003cbr /\u003e\n\n### `decode`\n\n[View source](https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/deprecated/text/subword_text_encoder.py#L84-L120) \n\n decode(\n ids\n )\n\nDecodes a list of integers into text.\n\n### `encode`\n\n[View source](https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/deprecated/text/subword_text_encoder.py#L74-L82) \n\n encode(\n s\n )\n\nEncodes text into a list of integers.\n\n### `load_from_file`\n\n[View source](https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/deprecated/text/subword_text_encoder.py#L252-L259) \n\n @classmethod\n load_from_file(\n filename_prefix\n )\n\nExtracts list of subwords from file.\n\n### `save_to_file`\n\n[View source](https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/deprecated/text/subword_text_encoder.py#L244-L250) \n\n save_to_file(\n filename_prefix\n )\n\nSave the vocabulary to a file."]]