r meetup talk scaling data science with dgit
TRANSCRIPT
![Page 1: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/1.jpg)
Scaling Data Science with dgit
Dr. Venkata PingaliFounder, Scribble [email protected]
https://github.com/pingali
![Page 2: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/2.jpg)
Summary
1. Scaling impact of data science requires increasing trust and efficiencya. Trust requires auditability and reproducibility of results
b. Efficiency requires standardization and automation
2. Dataset is a fundamental abstraction of data science
3. dgit enables git-like management of datasetsa. Python package, open source, MIT licence
b. Familiar git interface with modifications
4. Call to collaborate
![Page 3: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/3.jpg)
dgit - 1 min summary
![Page 4: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/4.jpg)
dgit - git wrapper for datasets
1. Python package, MIT license2. Application of git3. Beyond git - “Understands” data
a. Metadata generation and managementb. Automatic scanning of working directory for changesc. Automatic validation and materialization d. Dependency tracking across repos e. Automatic audit trails with executionf. Pipeline support
![Page 5: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/5.jpg)
Growing Pains in Data Science
![Page 6: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/6.jpg)
Anonymized Random Slide from an Actual Presentation
Implication: Large wasted spend, poor production design, baseline worsening
![Page 7: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/7.jpg)
Decision-maker Questions
1. Where did the numbers come from? (Correctness, Lineage)a. Assumption, models, datasets
2. Is this an accident? Does it hold now? (Reproducibility, Retargetability)a. Model, dataset, and question revisions
3. Can you get the results faster? (Efficiency)a. Time, effort, cost
4. Can you also analyze X? (Extensibility) a. Different dataset, question
5. Could we try X? (Dataset generation - synthetic and real)a. What if scenarios, field experiments
![Page 8: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/8.jpg)
Conceptual Process Biz Analytics
TeamData Engg
Qtns, Context
Data Req
Datasets
Model Results
Story TellingAll three roles could be in a single team!
![Page 9: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/9.jpg)
Business Complexity is Discovered Over Time
Incomplete context (history, semantics)Qtns not thought through Continuous revisions
Biz Analytics Team
Data Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
![Page 10: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/10.jpg)
Imperfect Data Queries due to Limited Understanding
Dependencies not specifiedWrong filters Known outliers Narrow specification (cubes)
Biz Analytics Team
Data Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
![Page 11: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/11.jpg)
Weak process
Lack of protocol (email/files)Missing validation checksNo lineageNo revisions
Biz Analytics Team
Data Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
![Page 12: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/12.jpg)
Eagerness to Present Great Narratives
Wrong input datasetMistakes in pipelineExcel/adhoc transformationsModel evolutionContinuous revision of narratives Missing interpretation integrity checks (e.g. other time windows)Better methodology
Biz Analytics Team
Data Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
![Page 13: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/13.jpg)
Process in RealityBiz Analytics
TeamData Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
IterativeExpensiveLaborious
![Page 14: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/14.jpg)
Actual Process Biz Analytics
TeamData Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
IterativeExpensiveLaborious
http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/
"80% of ..companies strategic decision go haywire.. “flawed” data
![Page 15: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/15.jpg)
Desired State
1. Trusted a. Every model should be auditable to the last record and step ⬅b. Every model should be reproducible with zero human intervention ⬅c. Enables use and development of mathematical judgment
2. Scalablea. Highly automated through most of the lifecycle ⬅b. Continuous reduction in costs ⬅c. Grow sublinearly with questions, datasets, models
3. Robusta. Younger, inexperienced staff ⬅b. Weak processes
![Page 16: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/16.jpg)
Process with Dataset RepositoryBiz Analytics
TeamData Engg
Server Side CI
Dataset RulesEvaluation Rules
DependenciesMaterialized dataset
v1
v2
v3MaterializeModel Pipeline
Pipeline Executionv4
Slide ContentURN
Context,Questions
v5Evaluation Interpretation
v6
Dataset as mutable object with memory
No emails/google docs
Continuous validation by thirdparty (server)
Separate model development and evaluation
![Page 17: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/17.jpg)
dgit
![Page 18: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/18.jpg)
Dgit Structure
dgitcore API
Repo Mgr
Git
Backend
S3
Validator Generator Instrumentation
MySQLS3Regression ContentPlatform
dgit CLI
Metadata
Basic
![Page 19: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/19.jpg)
Demo Goals
1. Show end-to-end example (command line)a. Simple regression
2. Explain structure 3. Advanced features
a. Validation (regression quality plugin) b. Generator (SQL)c. Pipeline (Dora)
![Page 20: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/20.jpg)
Open Tasks
1. Dgit specifica. Cleanup and stabilization
i. Python v2/3 compatibility ii. Plugins to do various tasks (anonymization, hive etc)
b. Testing infrastructure
c. Integrationi. Windows and MacOS support ii. Support for instabase/dat/other services
2. Ideas for new tools to reduce cost and complexity of data science
![Page 21: R meetup talk scaling data science with dgit](https://reader031.vdocuments.us/reader031/viewer/2022020410/589ee0f61a28ab39498b6e3d/html5/thumbnails/21.jpg)
Speaker
Dr. Venkata Pingali
Founder, Scribble DataFormer-VP Analytics, FourthLion
IIT(B) PhD (USC)
http://linkedin.com/in/pingali