a comparison of approaches to large-scale data analysis
DESCRIPTION
A Comparison of Approaches to Large-Scale Data Analysis. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung. Intelligent Database Systems Lab - PowerPoint PPT PresentationTRANSCRIPT
A Comparison of Approaches to Large-Scale Data AnalysisA Comparison of Approaches to Large-Scale Data Analysis
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt,
Samuel Madden, Michael Stonebraker
SIGMOD 2009
2009-10-09
Summarized by Jaeseok Myung
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Copyright 2009 by CEBTCenter for E-Business Technology
MapReduce vs. Parallel DBMS
Copyright 2009 by CEBT
MapReduceMapReduce
Center for E-Business Technology
한재선 , SearchDay2008, http://nexr.tistory.com
Copyright 2009 by CEBT
Architectural DifferencesArchitectural Differences
Parallel DBMS MapReduce
Schema Support O X
Indexing O X
Programming ModelStating what you
want(SQL)
Presenting an algorithm
(C/C++, Java, …)
Optimization O X
Flexibility Good
Fault Tolerance Good
Center for E-Business Technology
Copyright 2009 by CEBT
Benchmark Environment (1/2)Benchmark Environment (1/2)
Systems
Hadoop: The most popular open-source MR implementation
DBMS-X: a parallel DBMS that stores data in a row-based format
Vertica: a column-based parallel DBMS
All Three systems were deployed on a 100-node cluster
Analytical Tasks
Data Loading
Selection Task
Aggregation Task
Join Task
UDF Aggregation Task
Center for E-Business Technology
Copyright 2009 by CEBT
Benchmark Environment (2/2)Benchmark Environment (2/2)
Dataset
Documents : 600,000 unique documents for each node
155 million UserVisits records (20GB/node)
18 million Rankings records (1GB/node)
Center for E-Business Technology
Copyright 2009 by CEBT
1. Data Loading1. Data Loading
Center for E-Business Technology
loading timeloading time
ReorganizationReorganization
Copyright 2009 by CEBT
2. Selection Task2. Selection Task
The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold
Query
SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x;
x = 10, which yields approximately 36,000 records per data file on each node
For MR, implementing the same task with Java language
Center for E-Business Technology
Copyright 2009 by CEBT
2. Selection Task - Result2. Selection Task - Result
Center for E-Business Technology
time for combining the output into a single file(Additional MR)
time for combining the output into a single file(Additional MR)
Processing timeProcessing time
Copyright 2009 by CEBT
3. Aggregation Task3. Aggregation Task
The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column
Query
SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;
This task always produces 2.5 million records
Center for E-Business Technology
Copyright 2009 by CEBT
3. Aggregation Task - Result3. Aggregation Task - Result
Center for E-Business Technology
Copyright 2009 by CEBT
4. Join Task4. Join Task
The join task consists of two sub-tasks that perform a complex calculation on two data sets
In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range
Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval
Query
SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP;
SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;
Center for E-Business Technology
Copyright 2009 by CEBT
4. Join Task - Result4. Join Task - Result
Center for E-Business Technology
Copyright 2009 by CEBT
5. UDF Aggregation Task5. UDF Aggregation Task
The final task is to compute the inlink count for each document in the dataset
Query
SELECT INTO Temp F(contents) FROM Document;
– F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database
– With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count
SELECT url, SUM(value) FROM Temp GROUP BY url;
Center for E-Business Technology
Copyright 2009 by CEBT
5. UDF Aggregation Task - Result5. UDF Aggregation Task - Result
Center for E-Business Technology
Copyright 2009 by CEBT
ConclusionConclusion
Center for E-Business Technology
MapReduce < Parallel DBMS
HadoopDB: An Architectural Hybrid of MapReduce and HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsDBMS Technologies for Analytical Workloads
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin
VLDB 2009
2009-10-09
Summarized by Jaeseok Myung
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Copyright 2009 by CEBT
HadoopDBHadoopDB
The Basic Idea (An Architectural Hybrid of MR & DBMS)
To use MR as the communication layer above multiple nodes running single-node DBMS instances
Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases
Center for E-Business Technology
Copyright 2009 by CEBT
The Architecture of HadoopDBThe Architecture of HadoopDB
Center for E-Business Technology
Copyright 2009 by CEBT
HadoopDB – Join TaskHadoopDB – Join Task
Center for E-Business Technology