ground: managing metadata in the big data ecosystem
TRANSCRIPT
![Page 1: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/1.jpg)
Ground: Managing Metadata in the Big Data EcosystemVikram SreekantiAMP Lab, U.C. Berkeley
![Page 2: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/2.jpg)
What is data?
• 20th Century Data: Accounting• “02/16: Sally withdrew $100 from checking.”
• 21st Century Data: Raw materials• Sally’s online purchase records…• ... and timeseries data from her FitBit• ... and popular films for various demographics• ... and weather forecast for the next 48 hours.
![Page 3: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/3.jpg)
What is Metadata?
• Data about data• This used to be so simple!
![Page 4: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/4.jpg)
What is Metadata?
• Data about data• This used to be so simple!
• But... schema on use• One of many changes
![Page 5: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/5.jpg)
InterpretationAnalysis Interoperability
Reproducibility Governance & The Collective
What is Metadata?
![Page 6: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/6.jpg)
Analysis
![Page 7: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/7.jpg)
Case: Data Analysis
Wrangle
Visualize
AnalyzeData
Results
METAMNESIA
![Page 8: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/8.jpg)
—JIM GRAY
One of the things that my research advisor Mike Harrison taught me to do is to WRITE THINGS DOWN. I’M IN THE FLOW.WRITE THINGS DOWN. TENSION
![Page 9: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/9.jpg)
You will never know your data better than when you are wrangling and analyzing it.
The flow state
![Page 10: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/10.jpg)
TAKE ACTION
Data Analytics Infrastructure team
“Write down what you can, we’ll fill in the rest.”
![Page 11: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/11.jpg)
Taking Action: Football
• Video data Annotations.• Metadata from manual annotation
![Page 12: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/12.jpg)
Taking Action: Football
• Video data Annotations.• Passive metadata: sensor streams• NFL + MS = Cool.
![Page 13: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/13.jpg)
Taking Action: Football
• Video data Annotations.• Passive metadata: sensor streams• NFL + MS = Cool.
• Metadata + Simulation• NFL + MS + EA = POV.
![Page 14: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/14.jpg)
Capture what people do with data. Augment as appropriate. Interpolate as needed.
Taking Action: Data Analysis
![Page 15: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/15.jpg)
Analysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
![Page 16: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/16.jpg)
CASE:Data Debugging
![Page 17: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/17.jpg)
CASE:Data Debugging
![Page 18: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/18.jpg)
![Page 19: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/19.jpg)
Relationships
Master Data on Customers
Call detail from HDFS
Data Wrangling Script
Python Numpy
Churn Analysis
Hypothesis Wrangle
![Page 20: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/20.jpg)
Pythonv2.7
Numpyv1.9.3
Wranglev3.0
Versioned Relationships
Master Data on CustomersMDM 10/11/15
Call detail from HDFSv1.26
Data Wrangling Scriptgit hash 0x6987a68a9876b7
Churn Analysisgit hash 0x987667e876f033
Hypothesis
![Page 21: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/21.jpg)
Common ground?
• SW market exploding
• n2 connections
• Need a shared place to Write it down, Link it up
![Page 22: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/22.jpg)
InterpretationAnalysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
• metadata as protocol• general formats
standards-on-use
![Page 23: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/23.jpg)
CASE:Recommender System
• Consider a recommender system like Netflix• Consists of data (user views & ratings, movie features) and a
statistical model.• For any piece of data: “Sally watched The Shining”• This fact is much more meaningful with a model: the model makes the
recommendations!• The model is also no good without data: data is used to train & refine the
model.
![Page 24: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/24.jpg)
CASE:Recommender System
• Any machine learning system has this coupling.
• Interpretation of the data depends on the model we choose.• Models are parametrized by data.• The meaning & value of data in any context is the coupling of model +
data.
![Page 25: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/25.jpg)
Reproducibility
Analysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
• metadata as protocol• general formats
standards-on-use
Interpretation
• models interpret data• data conditions models
(data + metadata)-on-use
![Page 26: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/26.jpg)
Can metadata cure cancer?
![Page 27: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/27.jpg)
No.
![Page 28: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/28.jpg)
But it’s going to be useful.
![Page 29: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/29.jpg)
Case: Cancer Genomics
Generalpopulation data
(“1000 genomes”)
Compare
Clustering AlgorithmPatient Data
Put leukemia cells on slide
Robot putschemistry on slides
Robot puts slide on gene sequencerX 1000 patients
![Page 30: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/30.jpg)
Data Lineage
Back to tissue and bar codes on slides!
Logical & Physical• Tissue• Data (and metadata)• Code
![Page 31: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/31.jpg)
It gets messier
Generalpopulation data
(“1000 genomes”)
Compare
Put leukemia cells on slide
Robot putschemistry on slides
Robot puts slide on gene sequencerX 1000 patients
Parameter Sweep
Parameters
Clustering Algorithm
![Page 32: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/32.jpg)
Analysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
• metadata as protocol• general formats
standards-on-use
Reproducibility
• instrumentation• lineage: success & failure
lab notebook-on-use
Governance & The Collective
Interpretation
• models interpret data• data conditions models
(data + metadata)-on-use
![Page 33: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/33.jpg)
Back at the Enterprise
We’re talking Governance.• And self-service for end users
![Page 34: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/34.jpg)
CASE:Jupyter Notebook
• An electronic lab notebook• Evolution of IPython Notebook• Writing it down since 2011
![Page 35: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/35.jpg)
Running a Class from NotebooksAssignments are notebooks• Students create versions• Staff solution is a version
Grading• Execute each notebook on some data• Annotating the notebook with grades• Updating a grades spreadsheet
![Page 36: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/36.jpg)
Homework Governance
Skools ’n rools!• Students can’t see each others’ HW• Students can’t see solution• Unless they’ve turned in theirs
and it’s after April 12 and they have a Berkeley login
• Graders can’t see student names• Students can’t update
grade spreadsheet
![Page 37: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/37.jpg)
Collective Intelligence
Rules should be a small part of school.
If we do things well…• People get smarter• Educational software gets smarter• Organizations get smarter
Fueled by observing, learning, iterating.
Write things down, fill in later.
![Page 38: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/38.jpg)
Collective, Intelligent Governance
By the people. Emergent governance.• Sandbox → Annotations → Awareness → Reuse → Debate → ConsensusFor the people. Collective Intelligence emerges.
http://blogs.forrester.com/michele_goetz/15-09-24-are_data_preparation_tools_changing_data_governance
![Page 39: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/39.jpg)
Analysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
• metadata as protocol• general formats
standards-on-use
Reproducibility
• instrumentation• lineage: success & failure
lab notebook-on-use
Governance & The Collective• by & for the people• collective intelligence
governance-on-use
Interpretation
• models interpret data• data conditions models
(data + metadata)-on-use
![Page 40: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/40.jpg)
What we’re doing: Ground
• Focus our design on useful & interesting challenges for real problems
• Develop a general but expressive data model• Don’t prescribe design principles;
support as many as possible
![Page 41: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/41.jpg)
Data Model: Core
• “Thing”: basic logical object• Immutable• Every “Thing” has a version history.
Models
Usage
Versions
![Page 42: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/42.jpg)
Design Principle: Immutability & Versioning
• Recall versioning• Reproducibility = time travel.• Alternate histories: What-if?
Pythonv2.7
Numpyv1.9.3Master Data on
CustomersMDM 10/11/15
Call detail from HDFSv1.26
Data Wrangling Scriptgit hash 0x6987a68a9876b7
Churn Analysisgit hash 0x987667e876f033
Hypothesis Wranglev3.0
![Page 43: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/43.jpg)
Data Model: Mantle
• Structures, Nodes, Edges, Graphs• Subclasses of “Thing”
• Allows for modeling of dataModels
Usage
Versions
![Page 44: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/44.jpg)
Data Model: Crust
• Lineage relationships between “Thing”s
Models
Usage
Versions
![Page 45: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/45.jpg)
Design Principle: Lineage
• Lineage is fundamental to any metadata system• Versions track the state of things
(”who”, ”what”, and “when”)• Lineage captures causes &
influences (”how”)
![Page 46: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/46.jpg)
Design Principle: Postel’s Law• “Be conservative in what
you do, be liberal in what you accept from others.”
![Page 47: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/47.jpg)
Design Principle: Neutrality
• Open-source & vendor-neutral• Related to Postel’s Law: Be as
diverse as possible while still being useful.
![Page 48: Ground: Managing Metadata in the Big Data Ecosystem](https://reader035.vdocuments.us/reader035/viewer/2022070601/587ef8b81a28ab35528b58e5/html5/thumbnails/48.jpg)
Check out what we’ve done so far: https://github.com/ground-metadata/ground
Reach out if you’re interested: @viksree
Most slides were taken from Joe Hellerstein’s Strata NYC 2015 Talk: “Time to go Meta (on use)”.