ml engineering platform scribble enrich...scribble is primarily a python shop today. the enrich...
TRANSCRIPT
1
SCRIBBLE ENRICH
ML Engineering Platform
Scribble Data - Accelerated ML Engineering
Invariably, this work is context sensitive, coupled closely to both the data as well as the business usecases that the
data science team is solving for. A common problem here is that without guardrails, feature engineering can be messy,
can discourage collaboration, and can be hard to untangle when the models and their output need to be examined or
debugged.
Feature Engineering - 80% of the ML lifecycle
© Scribble Data 2019
FEATUREENGINEERING
2
Feature Engineering remains a critical
bottleneck in the ML lifecycle, to go from a
data store to a feature matrix, rich with
numerous derived variables or features.
Data Science teams routinely spend up to
80 % of their time on feature engineering
before they can build their ML models.
Enrich - Feature Engineering for ML
Enrich streamlines the most laborious parts of ML model training and productionization, and does so with high
auditability, reproducibility, and the highest per-core compute efficiency.
3© Scribble Data 2019
Scribble Enrich is an ML engineering platform
focused on Feature Engineering. It sits behind
customers’ firewalls, takes data from a lake or
other store, and turns it into features. It is built
for scalability, with numerous guardrails to help
data science teams accelerate their productivity,
whether it is in ML model training, model
deployment, or general purpose data enrichment.
Enrich - Architecture Schematic
4 © Scribble Data 2019
Scribble is primarily a python
shop today. The Enrich
platform sits atop customers’
data lakes, and provides
feature matrices to models or
dashboards.
The Enrich stack includes:
Storage Frontend Backend Pipelines Hardware
● S3● SQLDBs● Cassandra
● Bootstrap● JQuery
● Django● REST API
● Pandas● Spark
● Standard compute x-86 16 core,64GB
Enrich - Components and Design Principles
5 © Scribble Data 2019
The Enrich platform comprises a
number of different components,
each fit-for-purpose and thought
through in the context of the flow
of the feature engineering
discipline. They represent the four
principles we chose in our design
thinking.
● Quick time-to-market for each feature and model ● Trust (correctness and dependability)
● Flexibility ● Scalability
Enrich - Components
6 © Scribble Data 2019
Catalog Health Augment
Labeling Core Audit
Marketplace Search Monitor
A lightweight data catalog to continuously document what is in the data store
A programmable health check monitor of data flowing into the data store
Extend data by linking with thirdparty datasets
Generate labeled datasets or extend master for richer features
Versioned auditable feature computation pipelines
Audit interface to understand lineage of every dataset
Discover features being computed by the system (for status and reuse)
Filter and export datasets Monitor model performance
CONTACTUS
DENVER BANGALORELittleton Indiranagar | HSR
7
Scribble DataAccelerated ML Engineering