|  TensorFlow 2 version |  View source on GitHub | 
Splits each string into a sequence of code points with start offsets.
tf.strings.unicode_split_with_offsets(
    input, input_encoding, errors='replace', replacement_char=65533, name=None
)
This op is similar to tf.strings.decode(...), but it also returns the
start offset for each character in its respective string.  This information
can be used to align the characters with the original byte sequence.
Returns a tuple (chars, start_offsets) where:
- chars[i1...iN, j]is the substring of- input[i1...iN]that encodes its- jth character, when decoded using- input_encoding.
- start_offsets[i1...iN, j]is the start byte offset for the- jth character in- input[i1...iN], when decoded using- input_encoding.
| Args | |
|---|---|
| input | An Ndimensional potentially raggedstringtensor with shape[D1...DN].Nmust be statically known. | 
| input_encoding | String name for the unicode encoding that should be used to decode each string. | 
| errors | Specifies the response when an input string can't be converted
using the indicated encoding. One of: 
 | 
| replacement_char | The replacement codepoint to be used in place of invalid
substrings in inputwhenerrors='replace'. | 
| name | A name for the operation (optional). | 
| Returns | |
|---|---|
| A tuple of N+1dimensional tensors(codepoints, start_offsets).
 The returned tensors are  | 
Example:
>>> input = [s.encode('utf8') for s in (u'G\xf6\xf6dnight', u'\U0001f60a')] >>> result = tf.strings.unicode_split_with_offsets(input, 'UTF-8') >>> result[0].tolist() # character substrings [['G', '\xc3\xb6', '\xc3\xb6', 'd', 'n', 'i', 'g', 'h', 't'], ['\xf0\x9f\x98\x8a']] >>> result[1].tolist() # offsets [[0, 1, 3, 5, 6, 7, 8, 9, 10], [0]]