automatic data validation and cleaning with pysemantic
TRANSCRIPT
![Page 1: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/1.jpg)
Automatic Data Validation & Cleaning with PySemantic
Jaidev DeshpandeData Scientist, Cube26 Software Pvt Ltd
![Page 2: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/2.jpg)
About Me
● Data Scientist at Cube26 Software Pvt Ltd● Previously software developer at Enthought● Research assistant at TIFR and UoP● Active contributor to the SciPy stack
/ jaidevd
/ jaidevd
![Page 3: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/3.jpg)
Typical Data Pipeline
![Page 4: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/4.jpg)
The Problem● Curating and the data and standardizing across the team● Data quality problems:
○ Unstructured data○ Unorganized data○ Duplicated data○ Irrelevant data
● Communication problems:○ Large and distributed teams○ “What has happened to get the dataset to the current stage?”○ Messier data means more communication.
HOW DO I DESCRIBE THE STRUCTURE OF THE DATA EFFECTIVELY?
![Page 5: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/5.jpg)
![Page 6: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/6.jpg)
PySemantic
![Page 7: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/7.jpg)
Pythonically, PySemantic is:● A wrapper around pandas parsers and dataframe manipulation routines.● Not a parser● A loader for feature extraction for machine learning tasks● A logger for all operations on a dataset
PySemantic supports:● Recursive elimination of parser errors● Automatic validation based on rules
![Page 8: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/8.jpg)
How it works
$ semantic add mydictionary.yaml
mydataset1: path: /path/to/mydataset.csv nrows: 100 use_columns:
- col_a- col_b- col_c
>>> from pysemantic import Project>>> project = Project(“myproject”)>>>project.load_dataset(“mydataset”)
![Page 9: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/9.jpg)
PySemantic Internals
● Infer and validate parser arguments from the schema using traits
● Dynamically change parser arguments based on the errors raised, if any
● Log everything● Post loading a dataset, apply common preprocessing
methods by default
![Page 10: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/10.jpg)
Software Development Practices
● Fully test-driven● Fully documented● Pylint score > 9.0
![Page 11: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/11.jpg)
Limitations
● Only supports local files and MySQL tables (untested)● Not as smart as MS Excel● Architecture isn’t very clean - the main classes are
somewhat confusing
![Page 12: Automatic Data Validation and Cleaning with PySemantic](https://reader036.vdocuments.us/reader036/viewer/2022062400/58a03a1e1a28ab5d2e8b63a9/html5/thumbnails/12.jpg)
Feedback, Issues, PRs Welcome!
http://github.com/jaidevd/pysemantic