cs 744: big data systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - real-time...
TRANSCRIPT
![Page 1: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/1.jpg)
CS 744: Big Data Systems
Shivaram Venkataraman Fall 2018
![Page 2: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/2.jpg)
ADMINISTRIVIA
- Assignment 2 grades: Tonight - Midterm review session on Nov 2 at 5pm at 1221 CS - Course Project Proposal feedback
![Page 3: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/3.jpg)
EFFICIENT SQL ON MODERN HARDWARE
![Page 4: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/4.jpg)
MOTIVATION
Query Model - Need to handle diverse queries - Real-time streaming, temporal queries on logs, progressive queries etc.
Language Integration
- Support for High-level Language (HLL) - SQL Library
Performance
![Page 5: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/5.jpg)
approach
1. Temporal Logical Data Model 2. DAG of operators (Volcano, Spark, DryadLINQ etc.) 3. Performance
i. Data batching ii. Columnar processing iii. Code Generation iv. Efficient Aggregation
![Page 6: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/6.jpg)
ARCHITECHTURE
![Page 7: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/7.jpg)
DATA MODEL, QUERY
LINQ style queries Similar to SparkSQL, DryadLINQ Includes timestamp by default Support for windowing, aggregation
varstr=Network.ToStream(e=>e.ClickTime,Latency(10secs));varquery=str.Where(e=>e.UserId%100<5).Select(e=>{e.AdId}).GroupApply(e=>e.AdId,s=>s.Window(5min).Aggregate(w=>w.Count()));
![Page 8: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/8.jpg)
DATA BATCHING
Why is batching important ? Vectorized operations, better throughput
Implementing batching
Group a set of events together, each having sync time Aadaptively choose batch size Insert punctuation to enforce batch gets flushed
Example: Punctuation every 5min, batch contains 500 tuples Throughput is 1000 tuples/sec à 600 batches each punctuation
![Page 9: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/9.jpg)
COLUMNAR PROCESSING: LAYOUT
- Separate into control, payload fields - BitVector to indicate absence - Each of these has columnar layout
- Payload generated from user struct
classDataBatch{long[]SyncTime;long[]OtherTime;BitvectorBV;}classUserData_Gen:DataBatch{long[]col_ClickTime;long[]col_UserId;long[]col_AdId;}
![Page 10: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/10.jpg)
COLUMNAR PROCESSING: OPERATORS
Operators à nodes in query DAG Chain operators together with On() Tight-loop from code-gen Further optimizations:
Copy-on-write, Zero-copy pointer-swing
voidOn(UserData_Genbatch){batch.BV.MakeWritable();for(inti=0;i<batch.Count;i++)if((batch.BV[i]==0)&&!(batch.col_UserId[i]%100<5))batch.BitVector[i]=1;nextOperator.On(batch);}
![Page 11: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/11.jpg)
COLUMNAR PROCESSING: OTHER
Serialization - Store data in column batches - Code generation of serialization/deserialization
String Handling
- Bloated string representation in Java/C# - Encode multiple strings into MultiString - stringsplit, substring – operate directly on MultiString
![Page 12: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/12.jpg)
GROUPED AGGREGATION
Temporal Data Model - Each event belongs to a data window or interval - Aggregates can be stateless or stateful (more in next 3 lectures)
other_time - When other_time > sync_time, represents interval - When other_time is infinity, start at sync_time - When other_time < sync_time, end at sync_time
![Page 13: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/13.jpg)
GROUPED AGGREGATION
API for user-defined aggregation functions Efficient implementation using three data structures Example for count:
InitialState:()=>0LAccumulate:(oldCount,timestamp,input)=>oldCount+1Deaccumulate:(oldCount,timestamp,input)=>oldCount-1Difference:(leftCount,rightCount)=>leftCount-rightCountComputeResult:count=>count
![Page 14: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/14.jpg)
MAP-REDUCE on MULTI-CORE
![Page 15: CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fdc88fcfd24a1741244b8f4/html5/thumbnails/15.jpg)
SUMMARY
Flexible SQL library to handle workload patterns Integration with high-level language Efficient execution through
- Batching - Columnar processing - Code generation