a fault-tolerant environment for large-scale query processing mehmet can kurt gagan agrawal...

A Fault-Tolerant Environment for Large-Scale Query Processing

Mehmet Can Kurt Gagan AgrawalDepartment of Computer Science and

Engineering The Ohio State University

HiPC’12 Pune, India 1

Motivation

• “big data” problem– Walmart handles 1 million customer transaction every

hour, estimated data volume is 2.5 Petabytes.– Facebook handles more than 40 billion images– LSST generates 6 petabytes every year

• massive parallelism is the key


Motivation

• Mean-Time To Failure (MTTF) decreases

• Typical first year for a new cluster*– 1000 individual machine failures– 1 PDU failure (~500-1000 machines suddenly disappear)– 20 rack failures (40-80 machines disappear, 1-6 hours to get

back)


* taken from Jeff Dean’s talk in Google IO (http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx)

Our Work

• supporting fault-tolerant query processing and data analysis for a massive scientific dataset

• focusing on two specific query types:1. Range Queries on Spatial datasets2. Aggregation Queries on Point datasets

• supported failure types: single-machine failures and rack failures


* rack: a number of machines connected to the same hardware (network switch, …)

Our Work

• Primary Goals1) high efficiency of execution when there are no failures

(indexing if applicable, ensuring load-balance)

2) handling failures efficiently up to a certain number of nodes (low-overhead fault tolerance through data replication)

3) a modest slowdown in processing times when recovered from a failure (preserving load-balance)


Range Queries on Spatial Data

• nature of the task:– each data object is a rectangle in 2D space– each query is defined with a rectangle– return intersecting data rectangles

• computational model:– master/worker model– master serves as coordinator– each worker responsible for a portion of data


Y

X

query

data data data

worker worker worker

query

quer

y query

master


• data organization:– chunk is the smallest data unit– create chunks by grouping data objects together– assign chunks to workers in round-robin fashion


Y

X

chunk 1

chunk 2

chunk 3worker

chunk 4 worker

* actual number of chunks depends on chunk size parameter.


• ensuring load-balance:– enumerate & sort data objects according to Hilbert Space-Filling

Curve, then pack sorted data objects into chunks

• spatial index support:– Hilbert R-Tree deployed on master node– leaf nodes correspond to data chunks– initial filtering at master, tells workers which chunks to look


1

2 3

4

o1

o 4

o3

o8

o6

o 5

o 2

o 7

sorted objects: o1, o3 , o8, o6 , o2 , o7 , o4 , o5

chunk 1 chunk 2 chunk 3 chunk 4


• Fault-Tolerance Support – Sub-chunk Replication:step1: divide data chunks into k sub-chunksstep2: distribute sub-chunks in round-robin fashion


Worker 1 Worker 2 Worker 3 Worker 4

chunk1 chunk2 chunk3 chunk4

chunk1,1 chunk1,2

step1

chunk2,1 chunk2,2

step1

chunk3,1 chunk3,2

step1

chunk4,1 chunk4,2

step1

* rack-failure: same approach, but distribute sub-chunks to nodes in different rack

k = 2


• Fault-Tolerance Support - Bookkeeping:– add a sub-leaf level to the bottom of Hilbert R-Tree– Hilbert R-Tree both as a filtering structure and failure

management tool


Aggregation Queries on Point Data

• nature of the task:– each data object is a point in 2D space– each query is defined with a dimension (X or Y), and aggregation function (SUM, AVG, …)

• computational model:– master/worker model– divide space into M partitions– no indexing support– standard 2-phase algorithm:

local and global aggregation


worker 1 worker 2

worker 3 worker 4

X

Y

partial result

in worker 2

M = 4


• reducing communication volume– initial partitioning scheme has a direct impact– have insights about data and query workload:

P(X) and P(Y) = probability of aggregation along X and Y-axis|rx| and |ry| = range of X and Y coordinates

• expected communication volume Vcomm defined as:

• Goal: choose a partitioning scheme (cv and ch) that minimizes Vcomm



• Fault-Tolerance Support – Sub-partition Replication:step1: divide each partition evenly into M’ sub-partitionsstep2: send each of M’ sub-partitions to a different worker node

• Important questions: 1) how many sub-partitions (M’)?2) how to divide a partition (cv’ and ch’) ?3) where to send each sub-partition? (random vs. rule-based)


Y

X

M’ = 4ch’ = 2cv’ = 2

a better distribution

reduces comm. overhead

rule-based selection: assign to nodes which share

the same coordinate-range

Experiments• local cluster with nodes

– two quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM• entire system implemented in C by using MPI-library• range queries:

– comparison with chunk replication scheme– 32 GB spatial data– 1000 queries are run, and aggregate time is reported

• aggregation queries:– comparison with partition replication scheme– 24 GB point data

• 64 nodes used, unless noted otherwise


Experiments: Range Queries

Optimal Chunk Size Selection Scalability


- Execution Times with No Replication and No Failures

(chunk size = 10000)

Experiments: Range Queries

Single-Machine Failure Rack Failure


- Execution Times under Failure Scenarios (64 workers in total)- k is the number of sub-chunks for a chunk

Experiments: Aggregation Queries

Effect of Partitioning SchemeOn Normal Execution

Single-Machine Failure


P(X) = P(Y) = 0.5, |rx| = |ry| = 10000 P(X) = P(Y) = 0.5, |rx| = |ry| = 100000

Conclusion

• a fault-tolerant environment that can process– range queries on spatial data and aggregation queries on

point data– but, proposed approaches can be extended for other type

of queries and analysis tasks

• high efficiency under normal execution• sub-chunk and sub-partition replications

– preserve load-balance in presence of failures, and hence– outperform traditional replication schemes


Thank you for listening …

Questions


a fault-tolerant environment for large-scale query processing mehmet can kurt gagan agrawal...

Documents

spatial data data organization

data replication

data analysis

portion of data hipc12

worker chunk

spatial data nature

sorted data objects

sort data objects