a comparison of approaches to large-scale data analysis

A Comparison of Approaches to Large-Scale Data AnalysisA Comparison of Approaches to Large-Scale Data Analysis

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt,

Samuel Madden, Michael Stonebraker

SIGMOD 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Copyright 2009 by CEBTCenter for E-Business Technology

MapReduce vs. Parallel DBMS

Copyright 2009 by CEBT

MapReduceMapReduce

Center for E-Business Technology

한재선 , SearchDay2008, http://nexr.tistory.com


Architectural DifferencesArchitectural Differences

Parallel DBMS MapReduce

Schema Support O X

Indexing O X

Programming ModelStating what you

want(SQL)

Presenting an algorithm

(C/C++, Java, …)

Optimization O X

Flexibility Good

Fault Tolerance Good



Benchmark Environment (1/2)Benchmark Environment (1/2)

Systems

Hadoop: The most popular open-source MR implementation

DBMS-X: a parallel DBMS that stores data in a row-based format

Vertica: a column-based parallel DBMS

All Three systems were deployed on a 100-node cluster

Analytical Tasks

Data Loading

Selection Task

Aggregation Task

Join Task

UDF Aggregation Task



Benchmark Environment (2/2)Benchmark Environment (2/2)

Dataset

Documents : 600,000 unique documents for each node

155 million UserVisits records (20GB/node)

18 million Rankings records (1GB/node)



1. Data Loading1. Data Loading


loading timeloading time

ReorganizationReorganization


2. Selection Task2. Selection Task

The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold

Query

SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x;

x = 10, which yields approximately 36,000 records per data file on each node

For MR, implementing the same task with Java language



2. Selection Task - Result2. Selection Task - Result


time for combining the output into a single file(Additional MR)

time for combining the output into a single file(Additional MR)

Processing timeProcessing time


3. Aggregation Task3. Aggregation Task

The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column

Query

SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;

This task always produces 2.5 million records



3. Aggregation Task - Result3. Aggregation Task - Result



4. Join Task4. Join Task

The join task consists of two sub-tasks that perform a complex calculation on two data sets

In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range

Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval

Query

SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP;

SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;



4. Join Task - Result4. Join Task - Result



5. UDF Aggregation Task5. UDF Aggregation Task

The final task is to compute the inlink count for each document in the dataset

Query

SELECT INTO Temp F(contents) FROM Document;

– F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database

– With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count

SELECT url, SUM(value) FROM Temp GROUP BY url;



5. UDF Aggregation Task - Result5. UDF Aggregation Task - Result



ConclusionConclusion


MapReduce < Parallel DBMS

HadoopDB: An Architectural Hybrid of MapReduce and HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsDBMS Technologies for Analytical Workloads

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin

VLDB 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea


HadoopDBHadoopDB

The Basic Idea (An Architectural Hybrid of MR & DBMS)

To use MR as the communication layer above multiple nodes running single-node DBMS instances

Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases



The Architecture of HadoopDBThe Architecture of HadoopDB



HadoopDB – Join TaskHadoopDB – Join Task