SDM Profiler Service¶
The Agoora profiler service takes a grpc stream of ProfileRequest
(id-string, json-string) as input and tries to do a
JSON decode. If successful pandas profiling is run against. It also gives the possibility to inspect a stream for its
quality (level of specification and integrity). For more information about data quality see here.
Configuration¶
Environment variables:
Environment variables | Default | Description |
---|---|---|
PROFILER_SERVER_PORT |
8089 | Port gRPC servers listens on |
PROFILER_TIMEOUT |
30 | Seconds until the profiling times out. On huge samples with lots of correlation the profiling takes too long otherwise. |
There are two ways to configure the pandas profiler:
- Put the config file at
config/pandas_config.yaml
. (/app/config/pandas_config.yaml
within the docker container) - Pass environment variables. Note that they are cast to lower case, single
_
to.
and__
to_
, where the.
are used to construct the config hierarchy.
Example: To change the number of samples displayed, which are set in the setting file at
samples:
head: 10
tail: 10
One can also set SAMPLES_HEAD=5
. The rule is simply a concatenation of the keys on different levels in yaml by an _
.
The defaults are set like this:
# Title of the document
title: "Pandas Profiling Report"
# Number of workers (0=multiprocessing.cpu_count())
pool_size: 1
# which diagrams to show
missing_diagrams:
bar: False
matrix: False
heatmap: False
dendrogram: False
interactions:
continuous: False
correlations:
pearson:
calculate: False
spearman:
calculate: False
kendall:
calculate: False
phi_k:
calculate: False
cramers:
calculate: False
recoded:
calculate: False
check_recoded: False
html:
# Styling options for the HTML report
style:
full_width: True
navbar_show: False
Reference pandas profiling for more details on configuration. See config_default.yaml
API¶
See proto files:
syntax = "proto3";
package io.spoud.sdm.profiler.v1alpha1;
option java_package = "io.spoud.sdm.profiler.service.v1alpha1";
option java_multiple_files = true;
import "profiler/domain/v1alpha1/domain.proto";
service Profiler {
rpc ProfileDataStream (stream ProfileRequest) returns (stream ProfileDataStreamResponse);
rpc InspectQuality (stream InspectionRequest) returns (InspectionDataStreamResponse);
}
message ProfileRequest {
string request_id = 1;
string json_data = 2;
}
message ProfileDataStreamResponse {
oneof response {
io.spoud.sdm.profiler.domain.v1alpha1.Meta meta = 1;
string profile = 2;
}
}
message InspectionRequest {
string samples_json = 1;
string schema_json = 2;
bool is_schema_inferred = 3;
}
message InspectionDataStreamResponse {
oneof response {
io.spoud.sdm.profiler.domain.v1alpha1.QualityMetrics metric = 1;
io.spoud.sdm.profiler.domain.v1alpha1.InspectionError error = 2;
}
}
Test connectivity by GRPCC request¶
To send a sample request to the service with GRPCC run:
grpcc --proto proto/profiler/service/v1alpha1/profiler.proto --address localhost:8089 -i
Within the grpcc command line run:
let em
em = client.profile(pr)
em.write({id: '1', json_data: '{"x": 0, "y": 2 }'})
em.write({id: '1', json_data: '{"x": 0, "y": 2 }'})
em.write({id: '1', json_data: '{"x": 0, "y": 2 }'})
em.write({id: '1', json_data: '{"x": 0, "y": 2 }'})
em.end()
// here the output will be dumped in a few seconds...
Data quality inspection¶
At the moment only attribute quality is implemented. So the data will only be validated on attribute level and no row (e.g. conditional logic between fields) or set (e.g. distribution, continuity) metrics are implemented. The design however should be open for extension and additional metrics can be implemented.
For attribute quality the following semantics have been defined:
Attribute quality¶
This is the composite metric of integrity and specification because the quality of data only implies corresponding expectations, respectively specifications. In the current implementation attribute quality weights both integrity and specification equally.
Attribute specification¶
Attribute specification measures how precisely the expectations of the schema (e.g. type, min, max) are. The expectations can be defined on numbers and strings separately:
Numbers¶
- 0%: no type is specified
- 50%: type is specified
- 75%: type and minimum or maximum is specified
- 100%: type and both minimum and maximum is specified
Strings¶
- 0%: no type is specified
- 50%: type is specified
- 100%: type and a regex pattern is specified
Attribute integrity¶
Integrity evaluates how many of the samples are compliant with the specification.