View source on GitHub |
Reads and (optionally) parses avro files into a dataset.
tfio.experimental.columnar.make_avro_record_dataset(
file_pattern, features, batch_size, reader_schema, reader_buffer_size=None,
num_epochs=None, shuffle=True, shuffle_buffer_size=None, shuffle_seed=None,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE, num_parallel_reads=None,
drop_final_batch=False
)
Used in the notebooks
Used in the tutorials |
---|
Provides common functionality such as batching, optional parsing, shuffling,
and performing defaults.
Args:
file_pattern: List of files or patterns of avro file paths.
See tf.io.gfile.glob
for pattern rules.
features: A map of feature names mapped to feature information.
batch_size: An int representing the number of records to combine
in a single batch.
reader_schema: The reader schema.
reader_buffer_size: (Optional.) An int specifying the readers buffer
size in By. If None (the default) will use the default value from
AvroRecordDataset.
num_epochs: (Optional.) An int specifying the number of times this
dataset is repeated. If None (the default), cycles through the
dataset forever. If set to None drops final batch.
shuffle: (Optional.) A bool that indicates whether the input
should be shuffled. Defaults to True
.
shuffle_buffer_size: (Optional.) Buffer size to use for
shuffling. A large buffer size ensures better shuffling, but
increases memory usage and startup time. If not provided
assumes default value of 10,000 records. Note that the shuffle
size is measured in records.
shuffle_seed: (Optional.) Randomization seed to use for shuffling.
By default uses a pseudo-random seed.
prefetch_buffer_size: (Optional.) An int specifying the number of
feature batches to prefetch for performance improvement.
Defaults to auto-tune. Set to 0 to disable prefetching.
num_parallel_reads: (Optional.) Number of parallel
records to parse in parallel. Defaults to None(no parallelization).
drop_final_batch: (Optional.) Whether the last batch should be
dropped in case its size is smaller than batch_size
; the
default behavior is not to drop the smaller batch.
Returns:
A dataset, where each element matches the output of parser_fn
except it will have an additional leading batch-size
dimension,
or a batch_size
-length 1-D tensor of strings if parser_fn
is
unspecified.