automated testing for undocumented assumptions · 2020. 7. 15. · automated testing for protecting...
TRANSCRIPT
Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsEugene Mandel
Head of Product, Superconductive
Agenda
I
What is Pipeline Debt?II
How does Great Expectations beat pipeline debt?III
How can I get started?
What is pipeline debt?
Technical debt in data pipelines,mainly as a result of missing
tests and documentation.
Your data pipeline
Your data pipeline
Your data pipeline
Your data pipeline
Your data pipeline
wants to be a hairball
UndocumentedUntestedUnstable
What is pipeline debt?
code testing ≠ data testing
Solution: automated testing,
BUT
How does Great Expectations
beat pipeline debt
Always know what to expect from your data
▪ Public launch in 2018
▪ Full-time, active development started June 2019
▪ Most popular OSS library for data pipeline testing
▪ Growing community on Slack and github
An expectation is a declarative statement that describes a property of a dataset
“Values in this column should be between 55 and 90, at least 95% of the time.”
Describe expected behavior
{ "expectation_type": "expect_column_values_to_be_between", "kwargs": { "column": "temp_f", "max_value": 90, "min_value": 55, "mostly": 0.97, }, "meta": { "notes": { "format": "markdown", "content": [ "this column contains indoor temp readings - CA, spring and summer" ] } }}
Declarative language
{ "expectation_type": "expect_column_values_to_be_between", "kwargs": { "column": "temp_f", "max_value": 90, "min_value": 55, "mostly": 0.97, }, "meta": { "notes": { "format": "markdown", "content": [ "this column contains indoor temp readings - CA, spring and summer" ] } }}
class PandasDataset ... def expect_column_values_to_be_between( ...
class SparkDFDataset ... def expect_column_values_to_be_between( ...
class SqlAlchemyDataset ... def expect_column_values_to_be_between( ...
expectation:
Validate: take the compute to the data
expect_column_to_exist
expect_table_row_count_to_be_between
expect_column_values_to_be_unique
expect_column_values_to_not_be_null
expect_column_values_to_be_between
expect_column_values_to_match_regex
expect_column_values_to_match_strftime_format
expect_column_mean_to_be_between
expect_column_kl_divergence_to_be_less_than
etc. etc. etc.great_expectations
Expressive and extensible
expect_column_to_exist
expect_table_row_count_to_be_between
expect_column_values_to_be_unique
expect_column_values_to_not_be_null
expect_column_values_to_be_between
expect_column_values_to_match_regex
expect_column_values_to_match_strftime_format
expect_column_mean_to_be_between
expect_column_kl_divergence_to_be_less_than
etc. etc. etc.great_expectations
Expressive and extensible
expect_column_to_exist
expect_table_row_count_to_be_between
expect_column_values_to_be_unique
expect_column_values_to_not_be_null
expect_column_values_to_be_between
expect_column_values_to_match_regex
expect_column_values_to_match_strftime_format
expect_column_mean_to_be_between
expect_column_kl_divergence_to_be_less_than
etc. etc. etc.great_expectations
Expressive and extensible
expect_column_to_exist
expect_table_row_count_to_be_between
expect_column_values_to_be_unique
expect_column_values_to_not_be_null
expect_column_values_to_be_between
expect_column_values_to_match_regex
expect_column_values_to_match_strftime_format
expect_column_mean_to_be_between
expect_column_kl_divergence_to_be_less_than
etc. etc. etc.great_expectations
Expressive and extensible
Your tests are your docsYour docs are your tests
Your tests are your docsYour docs are your tests
Setup and Configuration
Drift
Outliers
Outage
How can I get started?
▪ Check out github▪ https://github.com/great-expectations/great_expectations
▪ Read the docs▪ https://docs.greatexpectations.io/en/latest/
▪ Say hi and ask questions on Slack▪ https://greatexpectations.io/slack
▪ pip install great_expectations
How can I get started?
Thank you!