View source on GitHub |
Invertible TextEncoder
using word pieces with a byte-level fallback.
Inherits From: TextEncoder
tfds.deprecated.text.SubwordTextEncoder(
vocab_list=None
)
Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded.
The vocabulary is "trained" on a corpus and all wordpieces are stored in a
vocabulary file. To generate a vocabulary from a corpus, use
tfds.deprecated.text.SubwordTextEncoder.build_from_corpus
.
Typical usage:
# Build
encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
corpus_generator, target_vocab_size=2**15)
encoder.save_to_file(vocab_fname)
# Load
encoder = tfds.deprecated.text.SubwordTextEncoder.load_from_file(vocab_fname)
ids = encoder.encode("hello world")
text = encoder.decode([1, 2, 3, 4])
Attributes | |
---|---|
subwords
|
|
vocab_size
|
Size of the vocabulary. Decode produces ints [1, vocab_size). |
Methods
build_from_corpus
@classmethod
build_from_corpus( corpus_generator, target_vocab_size, max_subword_length=20, max_corpus_chars=None, reserved_tokens=None )
Builds a SubwordTextEncoder
based on the corpus_generator
.
Args | |
---|---|
corpus_generator
|
generator yielding str , from which subwords will be
constructed.
|
target_vocab_size
|
int , approximate size of the vocabulary to create.
|
max_subword_length
|
int , maximum length of a subword. Note that memory
and compute scale quadratically in the length of the longest token.
|
max_corpus_chars
|
int , the maximum number of characters to consume from
corpus_generator for the purposes of building the subword vocabulary.
|
reserved_tokens
|
list<str> , list of tokens that will always be treated
as whole tokens and not split up. Note that these must contain a mix of
alphanumeric and non-alphanumeric characters (e.g. " |
Returns | |
---|---|
SubwordTextEncoder .
|
decode
decode(
ids
)
Decodes a list of integers into text.
encode
encode(
s
)
Encodes text into a list of integers.
load_from_file
@classmethod
load_from_file( filename_prefix )
Extracts list of subwords from file.
save_to_file
save_to_file(
filename_prefix
)
Save the vocabulary to a file.