View source on GitHub |
Tokenizes a tensor of UTF-8 string into words according to labels.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
text.SplitMergeTokenizer()
Used in the notebooks
Used in the guide |
---|
Methods
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
input, labels, force_split_at_break_character=True
)
Tokenizes a tensor of UTF-8 strings according to labels.
Example:
strings = ["HelloMonday", "DearFriday"]
labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
tokenizer = SplitMergeTokenizer()
tokenizer.tokenize(strings, labels)
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
Args | |
---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
labels
|
An (N+1)-dimensional Tensor or RaggedTensor of int32 , with
labels[i1...iN, j] being the split(0)/merge(1) label of the j-th
character for input[i1...iN] . Here split means create a new word with
this character and merge means adding this character to the previous
word.
|
force_split_at_break_character
|
bool indicates whether to force start a
new word after seeing a ICU defined whitespace character. When seeing
one or more ICU defined whitespace character:
|
Returns | |
---|---|
A RaggedTensor of strings where tokens[i1...iN, j] is the string
content of the j-th token in input[i1...iN]
|
tokenize_with_offsets
tokenize_with_offsets(
input, labels, force_split_at_break_character=True
)
Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.
Example:
strings = ["HelloMonday", "DearFriday"]
labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]
tokenizer = SplitMergeTokenizer()
tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, labels)
tokens
<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
starts
<tf.RaggedTensor [[0, 5], [0, 4]]>
ends
<tf.RaggedTensor [[5, 11], [4, 10]]>
Args | |
---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
labels
|
An (N+1)-dimensional Tensor or RaggedTensor of int32, with
labels[i1...iN, j] being the split(0)/merge(1) label of the j-th
character for input[i1...iN]. Here split means create a new word with
this character and merge means adding this character to the previous
word.
|
force_split_at_break_character
|
bool indicates whether to force start a
new word after seeing a ICU defined whitespace character. When seeing
one or more ICU defined whitespace character:
|
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|
|
tokens
|
is a RaggedTensor of strings where tokens[i1...iN, j] is
the string content of the j-th token in input[i1...iN]
|
start_offsets
|
is a RaggedTensor of int64s where
start_offsets[i1...iN, j] is the byte offset for the start of the
j-th token in input[i1...iN] .
|
end_offsets
|
is a RaggedTensor of int64s where
end_offsets[i1...iN, j] is the byte offset immediately after the
end of the j-th token in input[i...iN] .
|