hw09 hadoop db

27
HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009 HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.

Upload: cloudera-inc

Post on 06-May-2015

3.242 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Hw09   Hadoop Db

HadoopDB: An open source hybrid of MapReduceand DBMS technologies

Azza Abouzeid, Kamil Bajda-PawlikowskiDaniel J. Abadi, Avi Silberschatz

Yale Universityhttp://hadoopdb.sourceforge.net

October 2, 2009

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for

Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi

Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.

Page 2: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMotivation

Major Trends

1 Data explosion:Automation of business processes, proliferation of digitaldevices.eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.

2 Analysis over raw data

Bottom line

Analyzing massive structured data on 1000s of shared-nothingnodes.

Yale University, HadoopWorld 2009 HadoopDB 2/24

Page 3: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMotivation

Major Trends

1 Data explosion:Automation of business processes, proliferation of digitaldevices.eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.

2 Analysis over raw data

Bottom line

Analyzing massive structured data on 1000s of shared-nothingnodes.

Yale University, HadoopWorld 2009 HadoopDB 2/24

Page 4: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMotivation

Sales Record Example

Consider a large data set of sales log records, each consisting ofsales information including:

1 a date of sale

2 a price

We would like to take the log records and generate a reportshowing the total sales for each year.

Question:

How do we generate this report e!ciently and cheaply over massivedata contained in a shared-nothing cluster of 1000s of machines?

Yale University, HadoopWorld 2009 HadoopDB 3/24

Page 5: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMapReduce Parallel Databases

MapReduce (Hadoop)

MapReduce is a programming model which specifies:

A map function that processes a key/value pair to generate aset of intermediate key/value pairs,

A reduce function that merges all intermediate valuesassociated with the same intermediate key.

Hadoop

is a MapReduce implementation for processing large data setsover 1000s of nodes.

Maps (and Reduces) run independently of each other overblocks of data distributed across a cluster.

Yale University, HadoopWorld 2009 HadoopDB 4/24

Page 6: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMapReduce Parallel Databases

Sales Record Example using Hadoop

Query: Calculate total sales for each year.

We write a MapReduce program:

Map: Takes log records and extracts a key-value pair of yearand sale price in dollars. Outputs the key-value pairs.

Shu!e: Hadoop automatically partitions the key-value pairsby year to the nodes executing the Reduce function

Reduce: Simply sums up all the dollar values for a year.

Yale University, HadoopWorld 2009 HadoopDB 5/24

Page 7: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMapReduce Parallel Databases

Relational Databases

Suppose that the data is stored in a relational database system,the sales record example could be expressed in SQL as:

SELECT YEAR(date) AS year, SUM(price)FROM salesGROUP BY year

The execution plan is:

projection(year,price) ! hash aggregation(year,price).

Question:

How do we process this e!ciently if the data is very large?

Yale University, HadoopWorld 2009 HadoopDB 6/24

Page 8: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionMapReduce Parallel Databases

Parallel Databases

Parallel Databases are like single-node databases except:

Data is partitioned across nodes

Individual relational operations can be executed in parallel

xxxSELECT YEAR(date) AS year, SUM(price)FROM sales GROUP BY year

Execution plan for the query:projection(year,price) ! partial hash aggregation(year,price) !partitioning(year) ! final aggregation(year,price).

Note that the execution plan resembles the map and reduce phasesof Hadoop.

Yale University, HadoopWorld 2009 HadoopDB 7/24

Page 9: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation Conclusion

Di!erences between Parallel Databases and Hadoop

Yale University, HadoopWorld 2009 HadoopDB 8/24

Page 10: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation Conclusion

Di!erences between Parallel Databases and Hadoop

Yale University, HadoopWorld 2009 HadoopDB 8/24

Page 11: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation Conclusion

To summarize

Yale University, HadoopWorld 2009 HadoopDB 9/24

Page 12: Hw09   Hadoop Db

At Yale, we looked beyond the di!erences ...

Page 13: Hw09   Hadoop Db

At Yale, we looked beyond the di!erences ...

Page 14: Hw09   Hadoop Db

and we discovered ...

... that they complete each otherhttp://i214.photobucket.com/albums/cc19/brittanybutton/elephants.jpg

Basic design idea

Multiple, independent, singlenode databases coordinated byHadoop.

Page 15: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionBackground Architecture SMS

Hadoop Basics

Yale University, HadoopWorld 2009 HadoopDB 12/24

Page 16: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionBackground Architecture SMS

Architecture

Yale University, HadoopWorld 2009 HadoopDB 13/24

Page 17: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionBackground Architecture SMS

SQL-MR-SQL

SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

Yale University, HadoopWorld 2009 HadoopDB 14/24

Page 18: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability

Evaluating HadoopDB

Compare HadoopDB to

1 Hadoop

2 Parallel databases (Vertica, DBMS-X)

Features:1 Performance:

We expected HadoopDB to approach the performance ofparallel databases

2 Scalability:We expected HadoopDB to scale as well as Hadoop

We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2clusters of 10, 50, 100 nodes.

Yale University, HadoopWorld 2009 HadoopDB 15/24

Page 19: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability

Load

92 1

41

164

139

100 161

47

43 77

0

200

400

600

800

1000

1200

1400

1600

10 nodes 50 nodes 100 nodes

seconds

Vertica DB-X

HadoopDB Hadoop

Random Unstructured Data(535MB/node)

0

10

20

30

40

50

10 nodes 50 nodes 100 nodes

Thousands

seconds

Vertica DB-X

HadoopDB Hadoop

Structured data (20GB/node)

Yale University, HadoopWorld 2009 HadoopDB 16/24

Page 20: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability

Performance: Grep Task

0

10

20

30

40

50

60

70

10 nodes 50 nodes 100 nodes

se

co

nd

s

Vertica DB-X

HadoopDB Hadoop

SELECT * FROM grep WHERE field LIKE ‘%xyz%’;

1 Full table scan, highlyselective filter

2 Random data, noroom for indexing

3 Hadoop overheadoutweighs queryprocessing time insingle-node databases

Yale University, HadoopWorld 2009 HadoopDB 17/24

Page 21: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability

Performance: Join Task

20

.6

34

.7

67

.7

28

.0

29

.4

31

.912

6.4

22

4.2

30

0.5

0

200

400

600

800

1000

1200

1400

1600

1800

2000

10 nodes 50 nodes 100 nodes

se

co

nd

s

Vertica DB-X

HadoopDB Hadoop

SELECT sourceIP, AVG(pageRank), SUM(adRevenue)FROM rankings, uservisitsWHERE pageURL=destURLAND visitDate BETWEEN 2000-1-15 AND 2000-1-22GROUP BY sourceIPORDER BY SUM(adRevenue) DESC LIMIT 1;

1 No full table scan dueto clustered indexing

2 Hash partitioning ande!cient joinalgorithm

Yale University, HadoopWorld 2009 HadoopDB 18/24

Page 22: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability

Performance: Bottom Line

1 Unstructured dataHadoopDB’s performance matches Hadoop

2 Structured dataHadoopDB’s performance is close to parallel databases

Yale University, HadoopWorld 2009 HadoopDB 19/24

Page 23: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability

Scalability: Setup

1 Simple aggregation task - full table scan

2 Data replicated across 10 nodes

3 Fault-tolerance: Kill a node halfway

4 Fluctuation-tolerance: Slow down a node for the entireexperiment

Yale University, HadoopWorld 2009 HadoopDB 20/24

Page 24: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionHypotheses Load Performance Scalability

Scalability: Results

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

200%

Fault-tolerance Fluctuation-tolerance

perc

enta

ge s

low

dow

n

Vertica

HadoopDB

Hadoop

1 HadoopDB andHadoop takeadvantage of runtimescheduling bysplitting data intochunks or blocks

2 Parallel databasesrestart entire query onnode failure or waitfor the slowest node

Yale University, HadoopWorld 2009 HadoopDB 21/24

Page 25: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionSummary Future

To summarize

HadoopDB ...

1 is a hybrid of DBMS and MapReduce

2 scales better than commercial parallel databases

3 is as fault-tolerant as Hadoop

4 approaches the performance of parallel databases

5 is free and open-source

http://hadoopdb.sourceforge.net

Yale University, HadoopWorld 2009 HadoopDB 22/24

Page 26: Hw09   Hadoop Db

Introduction Candidates Di!erences HadoopDB Evaluation ConclusionSummary Future

Future work

Engineering work:

1 Full SQL support in SMS

2 Data compression

3 Integration with other open source databases

4 Full automation of the loading and replication process

5 Out-of-the box deployment

6 We’re hiring!

Research work:

Incremental loading and on-the-fly repartitioning

Dynamically adjusting fault-tolerance levels based on failurerate

Yale University, HadoopWorld 2009 HadoopDB 23/24

Page 27: Hw09   Hadoop Db

Thank You ...

We welcome all thoughts on how to raise HadoopDB ...http://www.jpbutler.com/thailand/images/elephant-8-days-old.jpg