brigham young university february 17, 2011 matt tolton [email protected] dremel: interactive...
TRANSCRIPT
![Page 1: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/1.jpg)
Brigham Young University
February 17, 2011
Matt [email protected]
Dremel:Interactive Analysis of Web-Scale Datasets
![Page 2: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/2.jpg)
Data mgmt at Googlesample from http://research.google.com/pubs/papers.html
• Classics– GFS [SOSP’03]: distributed file system
– Bigtable [OSDI’06]: key/value store
– MapReduce [OSDI’04]: batch-oriented processing
– Sawzall [Sci.Pr.’05]: PL on top of MR
• Recent– Pregel [SIGMOD’10]: Graph mining
– FlumeJava [PLDI’10]: Java library for data-parallel pipelines
– Megastore [to appear in CIDR’11]• Transactions, consistent repl., secondary idx on top of BT
2
![Page 3: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/3.jpg)
Large-scale data analysis• Data analysis is lifeblood of many companies• Parallel database systems
– Not designed for extreme scale• MapReduce [Dean, Ghemawat ’04]
– Fault-tolerant data processing• Can be used to execute queries!
– Not designed for low latency (coding, batch jobs)• SQL et al layers on MR (e.g., Hive, Pig, Sawzall)
• DBMS/MR hybrids (e.g., Aster, Cloudera, HadoopDB)
3
![Page 4: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/4.jpg)
• What does it take to run queries on 1,000,000,000,000 records in seconds?
• Dremel: data analysis tool that uses speed instead of raw power
• Based on VLDB’10 paper– Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey
Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis
Interactive speed
4
Brand of power tools that primarily rely on their speed as opposed to torque
![Page 5: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/5.jpg)
Key features• Interactive speed at very large scale
– Multi-TB datasets, 1000s of nodes– Fault and straggler tolerant execution– Columnar storage and processing– Tree architecture (as in web search)
• Nested data model w/ SQL-like language– 1000s of fields, deeply nested– Normalization is prohibitive
• Interoperates with Google's data mgmt tools– In situ data access (e.g., GFS, Bigtable)
5
![Page 6: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/6.jpg)
MR and Dremel execution
Execution time (sec) on 3000 nodes
6
87 TB 0.5 TB 0.5 TB
MR overheads: launch jobs, schedule 0.5M tasks, assemble records
Processing an 85 billion record table
![Page 7: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/7.jpg)
Makes engineers more productive
7
Runs a MapReduce to extractbillions of signals from web pages
Googler Alice
DEFINE TABLE t AS /path/to/data/*SELECT TOP(signal, 100), COUNT(*) FROM t. . .
Feed into another MR pipeline or serving system
1
2
3
What’s the best way to use new signals?
![Page 8: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/8.jpg)
Widely used inside Google• Analysis of crawled web
documents• Tracking install data for
applications on Android Market• Crash reporting for Google
products• OCR results from Google
Books• Spam analysis• Debugging of map tiles on
Google Maps
• Tablet migrations in managed Bigtable instances
• Results of tests run on Google's distributed build system
• Disk I/O statistics for hundreds of thousands of disks
• Resource monitoring for jobs run in Google's data centers
• Symbols and dependencies in Google's codebase
8
10s/1000s-node instances in several data centers
![Page 9: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/9.jpg)
9
![Page 10: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/10.jpg)
NESTED COLUMNAR STORAGE(COLUMNIO)
10
![Page 11: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/11.jpg)
Records vs. columnsA
B
C D
E*
*
*
. . .
. . .r1
r2 r1
r2
r1
r2
r1
r2
11
Challenge: preserve structure, reconstruct from a subset of fields
Read less,cheaperdecompression
![Page 12: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/12.jpg)
Nested data model
message Document { required int64 DocId; [1,1] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; [0,1] } optional string Url; }}
DocId: 10Links Forward: 20 Forward: 40 Forward: 60Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A'Name Url: 'http://B'Name Language Code: 'en-gb' Country: 'gb'
r1
DocId: 20Links Backward: 10 Backward: 30 Forward: 80Name Url: 'http://C'
r2
12
http://code.google.com/apis/protocolbuffers
multiplicity:
![Page 13: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/13.jpg)
ColumnIO representation
value r d
10 0 0
20 0 0
DocId
value r d
http://A 0 2
http://B 1 2
NULL 1 1
http://C 0 2
Name.Url
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code Name.Language.Country
Links.BackwardLinks.Forward
value r d
us 0 3
NULL 2 2
NULL 1 1
gb 1 3
NULL 0 1
value r d
20 0 2
40 1 2
60 1 2
80 0 2
value r d
NULL 0 1
10 0 2
30 1 2
13
![Page 14: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/14.jpg)
Repetition and definition levels
DocId: 10Links Forward: 20 Forward: 40 Forward: 60Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A'Name Url: 'http://B'Name Language Code: 'en-gb' Country: 'gb'
r1
DocId: 20Links Backward: 10 Backward: 30 Forward: 80Name Url: 'http://C'
r2
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code
r: At what repeated field in the field’s path the value has repeated
d: How many fields that could be undefined (opt. or rep.) are actually present
14
record (r=0) has repeatedr=2r=1
Language (r=2) has repeated
(non-repeating)
no value: Name (r=1) has repeated,
Name (d=1) is defined
no value: record (r=0) has repeated,
Name is defined (d=1)
![Page 15: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/15.jpg)
Record assembly FSM
Name.Language.CountryName.Language.Code
Links.Backward Links.Forward
Name.Url
DocId
1
0
1
0
0,1,2
2
0,11
0
0
15
For record-oriented data processing (e.g., MapReduce)
Transitionslabeled withrepetition levels
![Page 16: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/16.jpg)
Reading two fields
DocId
Name.Language.Country1,2
0
0
DocId: 10Name Language Country: 'us' LanguageNameName Language Country: 'gb'
DocId: 20Name
s1
s2
16
• Structure of parent fields is preserved.• Useful for queries like /Name[3]/Language[1]/Country• Selective assembly (vs. full assembly in XMill [Liefke, Suciu '00])
Both Dremel andMapReduce can read same columnar data
![Page 17: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/17.jpg)
Selective assembly
columnsrecords
objects
from
rec
ords
from
col
umns
(a) read + decompress
(b) assemble records
(c) parse as C++ objects
(d) read + decompress
(e) parse as C++ objects
time (sec)
number of fields
"cold" time on local disk,averaged over 30 runs
Tablet: 375 MB (compressed), 300K rows, 125 columns
17
2-4x overhead ofusing records
10x speedupusing columnarstorage
![Page 18: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/18.jpg)
18
![Page 19: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/19.jpg)
HIERARCHICAL QUERY PROCESSING
19
![Page 20: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/20.jpg)
Query processing architecture• Optimized for select-project-aggregate
– Very common class of interactive queries– Single scan– Within-record and cross-record aggregation
• Unit of storage: tablet– Self-contained horizontal partition of a table
• Unit of execution: slot– Thread on a server– E.g., 3K servers × 8 threads = 24K slots
20
Schema Metadata Data
keys, order, ranges, … C1 … Cn
![Page 21: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/21.jpg)
Serving tree
storage layer (e.g., GFS)
. . .
. . .. . .leaf servers
(with local storage)
intermediateservers
root server
client
21
• Parallelizes scheduling and aggregation
• Fault tolerance
• Stragglers
• Designed for "small" results (<1M records)
[Dean WSDM'09]
![Page 22: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/22.jpg)
Example: count()
SELECT A, COUNT(B) FROM T GROUP BY AT = {/gfs/1, /gfs/2, …, /gfs/100000}
SELECT A, SUM(c)FROM (R11 UNION ALL R110)GROUP BY A
SELECT A, COUNT(B) AS cFROM T11 GROUP BY AT11 = {/gfs/1, …, /gfs/10000}
SELECT A, COUNT(B) AS cFROM T12 GROUP BY AT12 = {/gfs/10001, …, /gfs/20000}
SELECT A, COUNT(B) AS cFROM T31 GROUP BY AT31 = {/gfs/1}
. . .
0
1
3
22
R11 R12
Data access ops
. . .
. . .
![Page 23: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/23.jpg)
Impact of serving tree depthexecution time (sec)
23
SELECT country, SUM(item.amount) FROM T2GROUP BY country
SELECT domain, SUM(item.amount) FROM T2WHERE domain CONTAINS ’.net’GROUP BY domain
Q2:
Q3:
40 billion nested items
(returns 100s of records) (returns 1M records)
The deeper the tree,the better we canparallelize aggregation
![Page 24: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/24.jpg)
Scalability
execution time (sec)
number ofleaf servers
24
SELECT TOP(aid, 20), COUNT(*) FROM T4WHERE bid = {value1} AND cid = {value2}
Q5 on a trillion-row table T4 (total read: 4.2 TB)
![Page 25: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/25.jpg)
Stragglers at 2× replicationpercentage of processed tablets
processing time per tablet (sec)
stragglers
25
SELECT COUNT(DISTINCT a) FROM T5
Q6: approximate query on a trillion-row table T5 (total read: 1TB)
![Page 26: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/26.jpg)
SQL dialect for nested data
Id: 10Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en'Name Cnt: 0
t1
SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS StrFROM tWHERE REGEXP(Name.Url, '^http') AND DocId < 20;
message QueryResult { required int64 Id; repeated group Name { optional uint64 Cnt; repeated group Language { optional string Str; } }}
26
Output table Output schema
No record assembly during distributed query processing
![Page 27: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/27.jpg)
Other features• In the paper
– Algorithms: select-project-aggregate evaluation, column-striping, record assembly, FSM construction
• Not in the paper– Joins, User-defined scalar and table valued
functions, materialized results– Approximate queries (count distinct, top-k)– Cross-data center replication– Quality of service– Bulk updates
27
![Page 28: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/28.jpg)
28
![Page 29: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/29.jpg)
OBSERVATIONS
29
![Page 30: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/30.jpg)
Interactive speed
execution time (sec)
percentage of queries
30
Most queries complete under 10 sec
Monthly query workloadof one 3000-node Dremel instance
![Page 31: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/31.jpg)
What we learned• Possible to analyze large disk-resident datasets
interactively on commodity hardware– 1T records, 1000s of nodes
• MR can benefit from columnar storage just like parallel DBMSes– But record assembly is expensive
• Parallel DBMSes may benefit from serving tree architecture just like search engines
• Interactive SQL and MR can be complementary• Getting to last few percent of data fast is hard
– Replication, approximation, early termination
31
![Page 32: Brigham Young University February 17, 2011 Matt Tolton mtolton@google.com Dremel: Interactive Analysis of Web-Scale Datasets](https://reader035.vdocuments.us/reader035/viewer/2022070400/56649f135503460f94c27bae/html5/thumbnails/32.jpg)
BigQuery: powered by Dremel
32
http://code.google.com/apis/bigquery/
1. Upload
2. Process
Upload your datato Google Storage
Import to tables
Run queries3. Act
Your Data
BigQuery
Your Apps