View source on GitHub |
A StatsGenerator which computes statistics using a combiner function.
tfdv.CombinerStatsGenerator(
name: Text, schema: Optional[schema_pb2.Schema] = None
) -> None
This class computes statistics using a combiner function. It emits partial states processing a batch of examples at a time, merges the partial states, and finally computes the statistics from the merged partial state at the end.
This object mirrors a beam.CombineFn except for the add_input interface, which is expected to be defined by its sub-classes. Specifically, the generator must implement the following four methods:
Initializes an accumulator to store the partial state and returns it. create_accumulator()
Incorporates a batch of input examples (represented as an arrow RecordBatch) into the current accumulator and returns the updated accumulator. add_input(accumulator, input_record_batch)
Merge the partial states in the accumulators and returns the accumulator containing the merged state. merge_accumulators(accumulators)
Compute statistics from the partial state in the accumulator and return the result as a DatasetFeatureStatistics proto. extract_output(accumulator)
Args | |
---|---|
name
|
A unique name associated with the statistics generator. |
schema
|
An optional schema for the dataset. |
Attributes | |
---|---|
name
|
|
schema
|
Methods
add_input
add_input(
accumulator: ACCTYPE, input_record_batch: pa.RecordBatch
) -> ACCTYPE
Returns result of folding a batch of inputs into accumulator.
Args | |
---|---|
accumulator
|
The current accumulator, which may be modified and returned for efficiency. |
input_record_batch
|
An Arrow RecordBatch whose columns are features and
rows are examples. The columns are of type List |
Returns | |
---|---|
The accumulator after updating the statistics for the batch of inputs. |
compact
compact(
accumulator: ACCTYPE
) -> ACCTYPE
Returns a compact representation of the accumulator.
This is optionally called before an accumulator is sent across the wire. The base class is a no-op. This may be overwritten by the derived class.
Args | |
---|---|
accumulator
|
The accumulator to compact. |
Returns | |
---|---|
The compacted accumulator. By default is an identity. |
create_accumulator
create_accumulator() -> ACCTYPE
Returns a fresh, empty accumulator.
Returns | |
---|---|
An empty accumulator. |
extract_output
extract_output(
accumulator: ACCTYPE
) -> statistics_pb2.DatasetFeatureStatistics
Returns result of converting accumulator into the output value.
Args | |
---|---|
accumulator
|
The final accumulator value. |
Returns | |
---|---|
A proto representing the result of this stats generator. |
merge_accumulators
merge_accumulators(
accumulators: Iterable[ACCTYPE]
) -> ACCTYPE
Merges several accumulators to a single accumulator value.
Args | |
---|---|
accumulators
|
The accumulators to merge. |
Returns | |
---|---|
The merged accumulator. |
setup
setup() -> None
Prepares an instance for combining.
Subclasses should put costly initializations here instead of in init(), so that 1) the cost is properly recognized by Beam as setup cost (per worker) and 2) the cost is not paid at the pipeline construction time.