trio: a system for data, uncertainty, and lineage
DESCRIPTION
UNCERTAINTY. LINEAGE. DATA. Trio: A System for Data, Uncertainty, and Lineage. Search “stanford trio” http://i.stanford.edu/trio. People. Current Jennifer Widom (faculty) Omar Benjelloun (post-doc) Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD) Michi Mutsuzaki (MS) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/1.jpg)
Trio: A System for Data, Uncertainty, and Lineage
Search “stanford trio”http://i.stanford.edu/trio
![Page 2: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/2.jpg)
2
People
Current• Jennifer Widom (faculty)• Omar Benjelloun (post-doc)• Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD)• Michi Mutsuzaki (MS)• Tomoe Sugihara (visitor)
Incoming• Martin Theobald (post-doc)• Raghu Murthy (MS)• Ander de Keijzer (visitor)
Alums• Alon Halevy, Ashok Chandra (visitors)• Chris Hayworth (MS)
![Page 3: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/3.jpg)
3
Why Uncertainty + Lineage?
Many applications seem to need both
From a technical standpoint, it turns out that
lineage...1. Enables simple and consistent
representation of uncertain data
2. Correlates uncertainty in query results with uncertainty in the input data
3. Can make computation over uncertain data more efficient
![Page 4: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/4.jpg)
4
Trio Components
1. Data Model ULDBs (Uncertainty-Lineage Databases): Simple extension to relational model
2. Query Language TriQL: Simple extension to SQL, well-defined
semantics and intuitive behavior
3. System Version 1: Complete system and GUI built
on top of conventional DBMS
![Page 5: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/5.jpg)
5
Running Example: Crime-Solving
Saw(witness,car) // may be uncertain
Drives(person,car) // may be uncertain
Suspects(person) = πperson(Saw ⋈ Drives)
![Page 6: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/6.jpg)
6
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences
![Page 7: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/7.jpg)
7
Our Model for Uncertainty
1. Alternatives: uncertainty about value
2. ‘?’ (Maybe) Annotations
3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
witness car
Amy { Honda, Toyota, Mazda }
=
Three possibleinstances
![Page 8: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/8.jpg)
8
Six possibleinstances
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe): uncertainty about presence
3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
(Betty, Acura)?
![Page 9: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/9.jpg)
9
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences: weighted uncertainty
Saw (witness,car)
(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2
(Betty, Acura): 0.6?
Six possible instances, each with a probability
![Page 10: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/10.jpg)
10
Models for Uncertainty
• Our model (so far) is not especially new
• We spent some time exploring the space of models for uncertainty [ICDE 06, journal]
• Tension between understandability and expressiveness– Our model is understandable
– But it is not complete, or even closed under common operations
![Page 11: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/11.jpg)
11
Our Model is Not Closed
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
(Billy, Honda) ∥ (Frank, Honda)
(Hank, Honda)
Suspects
Jimmy
Billy ∥ Frank
Hank
Suspects = πperson(Saw ⋈ Drives)
???
Does not correctlycapture possibleinstances in theresult
CANNOT
![Page 12: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/12.jpg)
12
Lineage to the Rescue
Lineage• Captures “where data came from”
• In Trio: A function λ from alternatives to other alternatives (or external sources)
![Page 13: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/13.jpg)
13
Example with Lineage
ID Saw (witness,car)
11
(Cathy, Honda) ∥ (Cathy, Mazda)
ID Drives (person,car)
21
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
22
(Billy, Honda) ∥ (Frank, Honda)
23
(Hank, Honda)
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
???
Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)
λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)
λ(33) = (11,1), 23
Correctly captures possible instances inthe result
![Page 14: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/14.jpg)
14
Uncertainty-Lineage Databases (ULDBs)
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences
4. Lineage
ULDBs are closed and complete[VLDB 06]
![Page 15: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/15.jpg)
15
ULDBs: Lineage
• Conjunctive lineage sufficient for most operations
• Duplicate-elimination: Disjunctive lineage
• Difference: Negative lineage
• General case after multiple operations/queries: Boolean formula
![Page 16: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/16.jpg)
16
ULDBs: Interesting Questions
• Data-minimality: extraneous alternatives, extraneous “?”
• Lineage-minimality: harder
• Membership: tuple and table, some-instance and all-instances
• Coexistence: multiple tuples
• Extraction: remove tables, retain possible-instances
![Page 17: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/17.jpg)
17
Example: Extraneous Data
(Diane, Mazda) ∥ (Diane, Acura)
Dianeextraneous
(Diane, Mazda)
(Diane, Acura)
?
??
![Page 18: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/18.jpg)
18
Example: Coexistence
Mazda
Acura
(Diane, Mazda) ∥ (Diane, Acura)
(Diane, Mazda)
(Diane, Acura)
?
??
?Can’t coexist
![Page 19: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/19.jpg)
19
Querying ULDBs: Semantics
Query Q on ULDB D
DD
D1, D2, …, DnD1, D2, …, Dn
possibleinstances
Q on eachinstance
representationof instances
Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)
D’D’implementation of Q
operational semanticsD + ResultD + Result
![Page 20: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/20.jpg)
20
Querying ULDBs: TriQL
Basic TriQL: SQL with new semantics• Obeys commutative diagram for uncertain data
• Tracks lineage
• Query results: new table or on-the-fly
Implemented TriQL: also built-in predicates conf(), lineage(), lineage*()
![Page 21: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/21.jpg)
21
Additional TriQL Constructs
[Language manual on web site]
• “Horizontal subqueries”Refer to tuple alternatives as a relation
• Unmerged (horizontal duplicates)
• Flatten, GroupAlts
• NoLineage, NoConf, NoMaybe
• Query-specified confidences [done]
• Data modification statements
![Page 22: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/22.jpg)
22
Confidence Computation
• Confidences computed on-demand based on lineage—Confidence of alternative A is function of
confidences in λ*(A)
—Permits any query plan for data computation
• Default probabilistic interpretation, but queries can override
SELECT person, min(conf(Saw),conf(Drives)) as confFROM Saw, DrivesWHERE Saw.car = Drives.car
![Page 23: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/23.jpg)
23
Trio System: Version 1
Standard relational DBMS
Trio API and translator(Python)
Trio API and translator(Python)
Command-lineclient
Command-lineclient
TrioMetadat
a
TrioExplorer(GUI client)
TrioExplorer(GUI client)
Trio Stored
Procedures
EncodedData
TablesLineageTables
Standard SQL• “Verticalize”• Shared IDs for alternatives• Columns for confidence,“?”• One per result table• Uses unique IDs
• Table types• Schema-level lineage structure
• conf()• lineage() “==>”• lineage*() “==>>”
• DDL commands• TriQL queries• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation
![Page 24: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/24.jpg)
24
Current & Future Topics
Algorithms: confidence computation, coexistence
extraneous data• Minimize lineage traversal• Memoization• Batch operations
System• Full query language• More internal processing ?
– Storage and indexing– Statistics and query optimization
![Page 25: Trio: A System for Data, Uncertainty, and Lineage](https://reader036.vdocuments.us/reader036/viewer/2022062517/56813d0f550346895da6c68a/html5/thumbnails/25.jpg)
25
Current & Future Topics
• Top-K by confidence
• Extend basic uncertainty model—Incomplete relations
—Continuous uncertainty
—Correlated uncertainty ?
• External lineage, update lineage, versioning