big data platforms

Post on 01-Jan-2016

26 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Big Data Platforms. Mihai Budiu , Oct 6 2014. My work. Ph.D. from Carnegie Mellon, 2003 H ardware synthesis Reconfigurable hardware Compilers and computer architecture Researcher at Microsoft Research Silicon Valley 2004-2014 Computer security - PowerPoint PPT Presentation

TRANSCRIPT

Big Data Platforms

Mihai Budiu

, Oct 6 2014

2

My work• Ph.D. from Carnegie Mellon, 2003• Hardware synthesis• Reconfigurable hardware• Compilers and computer architecture

• Researcher at Microsoft Research Silicon Valley 2004-2014• Computer security• Cloud computing infrastructure:

• distributed computation platforms • monitoring and debugging• performance analysis

• Big data analysis and visualization • Large scale machine learning

3

500 Years Ago

Tycho Brahe(1546-1601)

Johannes Kepler(1571-1630)

4

The Laws of Planetary Motion

Tycho’s measurements Kepler’s laws

5

The Large Hadron Collider

25 PB/year WLHC Grid: 200K computing cores

6

Genetic Code

7

Astronomy

8

Weather

9

The Webs

Internet

Facebook friends graph

10

Big Data

11

Big Computers

12

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

13

Design Space

Throughput(batch)

Latency(interactive)

Internet

Datacenter

Data-parallel

Sharedmemory

DryadSearch

HPC

Grid

Transaction

Sketch

14

Dryad• Eurosys 2007• Continuously deployed in

Microsoft since 2006• Execution engine of Bing

analytics• > 105 machines•Many PB of data analyzed daily

Dryad painting by Evelyn de Morgan

15

Dryad = Execution Layer

Job (application)

Dryad

Cluster

Pipeline

Shell

Machine≈

16

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

17

Virtualized 2-D Pipelines

18

Virtualized 2-D Pipelines

19

Virtualized 2-D Pipelines

20

Virtualized 2-D Pipelines

21

Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized

22

Dryad Job Structure

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

ChannelsStage

23

Dryad System Architecture

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS,Sched RE RERE

V V V

job manager cluster

GM code

vertex code

Staging1. Build

2. Send .exe

3. Start manager

5. Generate graph

7. Serializevertices

8. MonitorVertex execution

4. Querycluster resources

Nameserver6. Initialize vertices

Remoteexecutionservice

25

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

26

Distributed Collections

Partition

Collection

.Net objects

27

LINQ

Dryad

=> DryadLINQ

28

LINQ = .Net+ Queries

Collection<T> collection;bool IsLegal(Key);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

29

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)Data

30

Language Summary

WhereSelectGroupByOrderByAggregateJoin

31

Very expressive

var result = input.SelectMany(r => Mapper(r)) .GroupBy(r => Key(r)) .Select(g => Reducer(g));

Map-Reduce

Distributed sorting

Iterative machine-learning (EM)

32

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

33

Debugging DryadLINQ jobs

34

Distributed performance counters

35

Training Kinect

Depth map Body parts

Classifier

Xbox GPU

36

Learn from Many Examples

DecisionTree

Classifier

Machine learning

37

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Bandwidth hierarchy

39

Principles

• Visualizations are bounded data displays• All computations are sketches

• Sketch is a runtime for (1) running streaming (sketching) algorithms(2) implementing visualizations with bounded data renderings

40

Streaming algorithms

• Sketches = randomized streaming algorithms • Input = set of size n• Result same independent of the order• Memory = O(log(n))• Multi-pass

• Linear input transformations

4 billion rows on 155 machines

42

Spreadsheet operations• Browsing/scrolling• Filtering• Using predicates• Heavy hitters• Sampling

• Searching• Sorting• Computing new columns• Set operations (intersection, union, etc.)• Charting

Histograms

Heat Maps

Sketch distributed service

45data

Sketchservice

data

Sketchservice

data

Sketchservice

data

Sketchservice

46

DataSets = distributed objects

Network

46

Client

Servers

DataSet<T>

Application

T T T T T T T T T T T

47

Sketch Spreadsheet architecture

DataSet<Table>

SQL Server CSV Files Column store Cosmos Storage layer

Table operations

GUI

Distributed objects

Spreadsheet logic

Spreadsheet display

48

DataSet API

interface IDataSet<T> { IDataSet<S> Map<S>(Func<T,S> f); IDataSet<Pair<T,S>> Zip(IDataSet<S> other); R Sketch(ISketch<T, R> sketch);}interface ISketch<T,R> {

R Create(T data);R Combine(List<R> parts);

}

49

DataSet Implementations

Application

Network

Client Parallel

Proxy Proxy

GUI

Parallel

Local Local Local Local

Parallel

Local Local

Parallel

Datasetinterface

Rack aggregation

Core parallelism

Cluster parallelism

RMI layer

Proxy

ref ref ref

Parallel

Server 0 Server 1 Server n

Rack 0 Rack r

Address space

T T T T T T

Proxy

Local Local

Parallel

Proxy

Local Local

Parallel

T T S Sff

Map(f)

51

Sketch(s)

Proxy

Local Local

ParallelR R

R

R

s.Combine

T T

s.Create

interface ISketch<T,R> {R Create(T data);R Combine(List<R> parts);

}

52

Zip

Proxy

Local Local

Parallel

Proxy

Local Local

Parallel

T T S S

Proxy

Local Local

Parallel

T,S T,S

53

Histograms

CDF

2Dhistogram

54

Compute

Computing a histogram

Client

Server 1

Server n

Histogram

1D + 2Dcomposit

esketch

Datarangesketch

Render

Displayhistogra

m

User click tr th

ta

55

Some numbers

• Window Server 2012 R2 • 8-core 2.1GHz

AMD Opteron 2373 EE • > 16GB RAM• 3 x 1TB disks using RAID-0• 155 machines • 5 racks • 1Gbps Ethernet

56

1 2 4 8 16 24 32 64 128

155

0

100

200

300

400

500

600 No aggregation network

With aggregation network

Null Sketch

Machines

Tim

e (m

s)

57

Histogram computation

• 26M rows/machine• Scale-out

1 2 4 8 16 24 32 64 128

155

0200400600800

1000120014001600

machines

Tim

e (m

s)

58

Conclusions

• Big data is here to stay• Better tools are needed• Quest for high-level abstractions for

building distributed systems• Execution graphs• Distributed collections• Higher-order transformations• Distributed stateful objects• Sketching algorithms

59

Execution

Application

Data-Parallel Computation

60

Storage

Language

Map-Reduce

GFSBigTable

CosmosAzure

SQL Server

Dryad

DryadLINQScope

Sawzall,FlumeJava

Hadoop

HDFSS3

Pig, Hive≈SQL LINQ, SQLSawzall, Java

top related