Validates examples in csv files.
tfdv.validate_examples_in_csv(
data_location: Text,
stats_options: tfdv.StatsOptions
,
column_names: Optional[List[types.FeatureName]] = None,
delimiter: Text = ',',
output_path: Optional[Text] = None,
pipeline_options: Optional[PipelineOptions] = None,
num_sampled_examples=0
) -> Union[statistics_pb2.DatasetFeatureStatisticsList, Tuple[statistics_pb2.
DatasetFeatureStatisticsList, Mapping[str, pd.DataFrame]]]
Runs a Beam pipeline to detect anomalies on a per-example basis. If this
function detects anomalous examples, it generates summary statistics regarding
the set of examples that exhibit each anomaly.
This is a convenience function for users with data in CSV format.
Users with data in unsupported file/data formats, or users who wish
to create their own Beam pipelines need to use the 'IdentifyAnomalousExamples'
PTransform API directly instead.
Args |
data_location
|
The location of the input data files.
|
stats_options
|
tfdv.StatsOptions for generating data statistics. This must
contain a schema.
|
column_names
|
A list of column names to be treated as the CSV header. Order
must match the order in the input CSV files. If this argument is not
specified, we assume the first line in the input CSV files as the header.
Note that this option is valid only for 'csv' input file format.
|
delimiter
|
A one-character string used to separate fields in a CSV file.
|
output_path
|
The file path to output data statistics result to. If None, the
function uses a temporary directory. The output will be a TFRecord file
containing a single data statistics list proto, and can be read with the
'load_statistics' function. If you run this function on Google Cloud, you
must specify an output_path. Specifying None may cause an error.
|
pipeline_options
|
Optional beam pipeline options. This allows users to
specify various beam pipeline execution parameters like pipeline runner
(DirectRunner or DataflowRunner), cloud dataflow service project id, etc.
See https://cloud.google.com/dataflow/pipelines/specifying-exec-params for
more details.
|
num_sampled_examples
|
If set, returns up to this many examples of each
anomaly type as a map from anomaly reason string to pd.DataFrame.
|
Returns |
If num_sampled_examples is zero, returns a single
DatasetFeatureStatisticsList proto in which each dataset consists of the
set of examples that exhibit a particular anomaly. If
num_sampled_examples is nonzero, returns the same statistics
proto as well as a mapping from anomaly to a pd.DataFrame of CSV rows
exhibiting that anomaly.
|
Raises |
ValueError
|
If the specified stats_options does not include a schema.
|