SDM Profiler Service¶

The Agoora profiler service takes a grpc stream of ProfileRequest (id-string, json-string) as input and tries to do a JSON decode. If successful pandas profiling is run against. It also gives the possibility to inspect a stream for its quality (level of specification and integrity). For more information about data quality see here.

Configuration¶

Environment variables:

Environment variables	Default	Description
`PROFILER_SERVER_PORT`	8089	Port gRPC servers listens on
`PROFILER_TIMEOUT`	30	Seconds until the profiling times out. On huge samples with lots of correlation the profiling takes too long otherwise.

There are two ways to configure the pandas profiler:

Put the config file at config/pandas_config.yaml. (/app/config/pandas_config.yaml within the docker container)
Pass environment variables. Note that they are cast to lower case, single _ to . and __ to _, where the . are used to construct the config hierarchy.

Example: To change the number of samples displayed, which are set in the setting file at

samples:
    head: 10
    tail: 10

One can also set SAMPLES_HEAD=5. The rule is simply a concatenation of the keys on different levels in yaml by an _.

The defaults are set like this:

# Title of the document
title: "Pandas Profiling Report"

# Number of workers (0=multiprocessing.cpu_count())
pool_size: 1

# which diagrams to show
missing_diagrams:
    bar: False
    matrix: False
    heatmap: False
    dendrogram: False

interactions:
  continuous: False

correlations:
    pearson:
      calculate: False
    spearman:
      calculate: False
    kendall:
      calculate: False
    phi_k:
      calculate: False
    cramers:
      calculate: False
    recoded:
      calculate: False

check_recoded: False

html:
  # Styling options for the HTML report
    style:
      full_width: True
    navbar_show: False

Reference pandas profiling for more details on configuration. See config_default.yaml

API¶

See proto files:

syntax = "proto3";
package io.spoud.sdm.profiler.v1alpha1;

option java_package = "io.spoud.sdm.profiler.service.v1alpha1";
option java_multiple_files = true;

import "profiler/domain/v1alpha1/domain.proto";

service Profiler {
  rpc ProfileDataStream (stream ProfileRequest) returns (stream ProfileDataStreamResponse);
  rpc InspectQuality (stream InspectionRequest) returns (InspectionDataStreamResponse);
}

message ProfileRequest {
  string request_id = 1;
  string json_data = 2;
}

message ProfileDataStreamResponse {
  oneof response {
    io.spoud.sdm.profiler.domain.v1alpha1.Meta meta = 1;
    string profile = 2;
  }
}

message InspectionRequest {
  string samples_json = 1;
  string schema_json = 2;
  bool is_schema_inferred = 3;
}

message InspectionDataStreamResponse {

  oneof response {
    io.spoud.sdm.profiler.domain.v1alpha1.QualityMetrics metric = 1;
    io.spoud.sdm.profiler.domain.v1alpha1.InspectionError error = 2;
  }
}

Test connectivity by GRPCC request¶

To send a sample request to the service with GRPCC run:

grpcc --proto proto/profiler/service/v1alpha1/profiler.proto --address localhost:8089 -i

Within the grpcc command line run:

let em
em = client.profile(pr)
em.write({id: '1', json_data: '{"x": 0, "y": 2 }'})
em.write({id: '1', json_data: '{"x": 0, "y": 2 }'})
em.write({id: '1', json_data: '{"x": 0, "y": 2 }'})
em.write({id: '1', json_data: '{"x": 0, "y": 2 }'})
em.end()
// here the output will be dumped in a few seconds...

Data quality inspection¶

At the moment only attribute quality is implemented. So the data will only be validated on attribute level and no row (e.g. conditional logic between fields) or set (e.g. distribution, continuity) metrics are implemented. The design however should be open for extension and additional metrics can be implemented.

For attribute quality the following semantics have been defined:

Attribute quality¶

This is the composite metric of integrity and specification because the quality of data only implies corresponding expectations, respectively specifications. In the current implementation attribute quality weights both integrity and specification equally.

Attribute specification¶

Attribute specification measures how precisely the expectations of the schema (e.g. type, min, max) are. The expectations can be defined on numbers and strings separately:

Numbers¶

0%: no type is specified
50%: type is specified
75%: type and minimum or maximum is specified
100%: type and both minimum and maximum is specified

Strings¶

0%: no type is specified
50%: type is specified
100%: type and a regex pattern is specified

Attribute integrity¶

Integrity evaluates how many of the samples are compliant with the specification.