Download - Adaptive Query Processing on RAW Data
![Page 1: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/1.jpg)
Adaptive Query Processing on RAW Data
Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, Anastasia Ailamaki
![Page 2: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/2.jpg)
Many “data problems” lack solutions
• Petabytes of data
• No a priori knowledge about data
• Multiple file formats
2
![Page 3: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/3.jpg)
Many “data problems” lack solutions
• Petabytes of data
• No a priori knowledge about data
• Multiple file formats
3Need declarative, interactive access to arbitrary datasets
![Page 4: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/4.jpg)
CSV JSONXML
DBMSTr
ansf
orm
QueryRAW: A query engine adapting to data
Raw files as native storage format
RAW: Generate plug-in per file
4
Traditional DBMS: Data adapts to query engine
![Page 5: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/5.jpg)
CSV JSONXML
QueryRAW: A query engine adapting to data
Raw files as native storage format
RAW: Generate plug-in per file
5
Traditional DBMS: Data adapts to query engine
![Page 6: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/6.jpg)
CSV JSONXML
QueryRAW: A query engine adapting to data
Raw files as native storage format
RAW: Generate plug-in per file
6
High-performance querying… while keeping data formats, files, and scripts
Traditional DBMS: Data adapts to query engine
![Page 7: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/7.jpg)
How RAW adapts to data
CSVROOT
![Page 8: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/8.jpg)
Vectorized Query Execution
CSVROOT
join
scanplaceholder
scanplaceholder
filter
How RAW adapts to data
scanplaceh.
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND …
8
![Page 9: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/9.jpg)
Code Generate the Access Paths
Vectorized Query Execution
CSVROOT
join
filter
How RAW adapts to data
scancsv
scanroot
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan
csv
9
![Page 10: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/10.jpg)
Code Generate the Access Paths
Build Position and Data Caches
Vectorized Query Execution
CSVROOT
join
filter
How RAW adapts to data
scancsv
scanroot
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan
csv
10
![Page 11: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/11.jpg)
Code Generate the Access Paths
Build Position and Data Caches
Vectorized Query Execution
CSVROOT
join
filter
How RAW adapts to data
scancsv
scanroot
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan
csv
11
Adapt to format, file instance and queryjust-in-time
![Page 12: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/12.jpg)
Adapting to schema & query
[CSV input]
∀col:
if col needed:
if col isInt
readInt();
if col isFloat
readFloat();
if ...
else:
skipField();
GENERAL-PURPOSE
readInt();
readInt();
skipField();
readFloat();
skipRestLine();
JUST-IN-TIME
Remove overhead of generic operators
12
![Page 13: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/13.jpg)
Adapting to format
• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
13
![Page 14: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/14.jpg)
Adapting to format
• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
∀col:if col needed:
if col isInt...
readInt();skipField();readFloat();skipRest();
14
![Page 15: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/15.jpg)
Adapting to format
• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
∀col:if col needed:
if col isInt...
readInt();skipField();readFloat();skipRest();
- fieldLength:10- tupleLength:100- Need fields 2 & 5
of 2nd row
moveTo(110);readInt();moveTo(140);readFloat();
15
![Page 16: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/16.jpg)
Adapting to format
• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
∀col:if col needed:
if col isInt...
readInt();skipField();readFloat();skipRest();
- fieldLength:10- tupleLength:100- Need fields 2 & 5
of 2nd row
moveTo(110);readInt();moveTo(140);readFloat();
- Bitmaps, R-Trees etc.- readNextField() vs. readField(filename,id)
16
![Page 17: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/17.jpg)
Ad-Hoc Operators for Raw Data Fine-grained, raw-data-aware decisions
![Page 18: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/18.jpg)
Ad-Hoc Operators for Raw Data
Col9
Scan CSVColumns 1,9
Filter
Tuple Construction
Col1
Col1
JIT – OPTION 1
Col1 Col9
CSV file:SELECT col9 WHERE col1 < [X]
Fine-grained, raw-data-aware decisions
18
![Page 19: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/19.jpg)
JIT – OPTION 2
Col1
Ad-Hoc Operators for Raw Data
Col9
Scan CSVColumns 1,9
Filter
Tuple Construction
Col1
Col1
JIT – OPTION 1
Col1 Col9
CSV file:SELECT col9 WHERE col1 < [X]
Scan CSVColumn 1
Fine-grained, raw-data-aware decisions
19
![Page 20: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/20.jpg)
JIT – OPTION 2
Filter
Col1
Col1
543
12
Ad-Hoc Operators for Raw Data
Col9
Scan CSVColumns 1,9
Filter
Tuple Construction
Col1
Col1
JIT – OPTION 1
Col1 Col9
CSV file:SELECT col9 WHERE col1 < [X]
Scan CSVColumn 1
Fine-grained, raw-data-aware decisions
20
![Page 21: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/21.jpg)
JIT – OPTION 2
Filter
Tuple Construction
Col1 Col9
Col1
Col9
Col1
543
12
Ad-Hoc Operators for Raw Data
Col9
Scan CSVColumns 1,9
Filter
Tuple Construction
Col1
Col1
JIT – OPTION 1
Col1 Col9
CSV file:SELECT col9 WHERE col1 < [X]
Scan CSVColumn 1
Fine-grained, raw-data-aware decisions
Scan CSVColumn 9
21
Processes 3/5 raw col9 fields
![Page 22: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/22.jpg)
Electron
eventID INT
eta FLOAT
pt FLOAT
Jet
eventID INT
eta FLOAT
pt FLOATEvent
eventID INT
runNumber INT
Muon
eventID INT
eta FLOAT
pt FLOAT
ROOT - C++ RAWclass Event {
class Muon {float pt, eta;…
}class Electron {
float pt, eta;…
}class Jet {
float pt, eta;…
}int runNumber;vector<Muon> muons;vector<Electron> electrons;vector<Jet> jets; }
Finding the Higgs Boson: Data
22
![Page 23: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/23.jpg)
Finding the Higgs Boson: Queries“Identify events of interest → Filter out background events
→ Plot aggregated results in a histogram”
SELECT eventFROM root:/data1/ATLAS/*.root ,
csv:/data1/ATLAS/events.csvWHERE( csv.id = event.id AND
event.EF_e24vhi_medium1 OR event.EF_e60_medium1 OR event.EF_2e12Tvh_loose1 OR event.EF_mu24i_tight OR event.EF_mu36_tight OR event.EF_2mu13) ANDevent.muon.mu_ptcone20 < 0.1 * event.muon.mu_pt ANDevent.muon.mu_pt > 20000. ANDABS(event.muon.mu_eta) < 2.4 AND
…..
1000+ lines of C++
for (unsigned int imuon = 0 ; imuon<((*curr_entries)[jentry].mu_pt)->size(); imuon++) {if (((*curr_entries)[jentry].
mu_ptcone20)->at(imuon) < 0.1 * ((*curr_entries)[jentry].mu_pt)->at(imuon) && ((*curr_entries)[jentry].mu_pt)->at(imuon) > 20000. &&fabs(((*curr_entries)[jentry].mu_eta)->at(imuon)) < 2.4 &&…
}...
ROOT - C++ RAW
23
![Page 24: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/24.jpg)
0.1
1
10
100
1000
10000
Query 1(Cold)
Query 2 Query 3 Query 4 Query 5 Query 6
Exe
cuti
on
Tim
e (
sec)
RAW ROOT
RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]
ROOT: 900 GB in 127 files
CSV: 1 “table” of IDs
24
![Page 25: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/25.jpg)
0.1
1
10
100
1000
10000
Query 1(Cold)
Query 2 Query 3 Query 4 Query 5 Query 6
Exe
cuti
on
Tim
e (
sec)
RAW ROOT
RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]
ROOT: 900 GB in 127 files
CSV: 1 “table” of IDs
25Declarative access and up to 90x improvement
![Page 26: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/26.jpg)
Accessing raw data is DBMS business
• Adapt to data and queries
• CERN: 90x
• Application in Banking / Healthcare / Astronomy
http://dias.epfl.ch/RAW
26
![Page 27: Adaptive Query Processing on RAW Data](https://reader035.vdocuments.us/reader035/viewer/2022062320/559e9f2e1a28abd3048b478b/html5/thumbnails/27.jpg)
Image Sources
http://www.nasa.gov
http://home.web.cern.ch/
http://www.alacergroup.com/
http://www.humanbrainproject.eu
28