map reducecloudtech

Cloud Computing Systems

Lin Gu

Hong Kong University of Science and Technology

Sept. 14, 2011

How to effectively compute in a datacenter?

Is MapReduce the best answer to computation in the cloud?

What is the limitation of MapReduce?

How to provide general-purpose parallel processing in DCs?

• MapReduce—parallel computing for Web-scale data processing

• Fundamental component in Google’s technological architecture– Why didn’t Google use parallel Fortran, MPI, …?

• Followed by many technology firms

The MapReduce ApproachProgram Execution on Web-Scale Program Execution on Web-Scale DataData

MapReduce

Old ideas can be fabulous, too!

( = Lisp “Lost In Silly Parentheses”) ?

• Map and Fold

– Map: do something to all elements in a list

– Fold: aggregate elements of a list

• Used in functional programming languages such as Lisp

• Map is a higher-order function: apply an op to all elements in a list– Result is a new list

• Parallelizable

f f f f f

MapReduce

(map (lambda (x) (* x x)) '(1 2 3 4 5)) '(1 4 9 16 25)

• Reduce is also a higher-order function• Like “fold”: aggregate elements of a list

– Accumulator set to initial value– Function applied to list element and the accumulator– Result stored in the accumulator– Repeated for every item in the list– Result is the final value in the accumulator

f f f f ffinal result

Initial value

(fold + 0 '(1 2 3 4 5)) 15(fold * 1 '(1 2 3 4 5)) 120

The MapReduce ApproachProgram Execution on Web-Scale Program Execution on Web-Scale DataData

Massive parallel processing made simple• Example: word count• Map: parse a document and generate <word, 1> pairs• Reduce: receive all pairs for a specific word, and count

(sum)

// D is a documentfor each word w in D output <w, 1>

Map ReduceReduce for key w:count = 0for each input item count = count + 1output <w, count>

The MapReduce ApproachProgram Execution on Web-Scale DataProgram Execution on Web-Scale Data

Design Context

• Big data, but simple dependence– Relatively easy to partition data

• Supported by a distributed system– Distributed OS services across thousands of

commodity PCs (e.g., GFS)

• First users are search oriented– Crawl, index, search

Designed years ago, still working today, growing adoptions

Single Master node

Worker threads

Worker threads

Workflow

Single master, numerous worker threads

Workflow

• 1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.

• 2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

Workflow• 3. A worker who is assigned a map task reads the contents

of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.

• 4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

Workflow• 5. When a reduce worker is notified by the master about

these locations, it uses RPCs to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.

• 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

• 7. When all map tasks and reduce tasks have been completed, the MapReduce returns back to the user code.

Programming

• How to write a MapReduce programto– Generate inverted indices?– Sort?

• How to express more sophisticated logic?

• What if some workers (slaves) or the master fails?

Workflow

Where is the communication-intensive part?

Initial data splitinto 64MB blocks

Computed, resultslocally stored

Master informed ofresult locations

R reducers retrieveData from mappers

Final output written

• Distributed, scalable storage for key-value pairs

• Example: Dynamo (Amazon)

• Another example may be P2P storage (e.g., Chord)

• Key-value store can be a general foundation for more complex data structures

• But performance may suffer

Data Storage – Key-Value Store

Data Storage – Key-Value StoreDynamo: a decentralized, scalable key-value

store– Used in Amazon– Use consistent hashing to distributed data

among nodes– Replicated, versioning, load balanced– Easy-to-use interface: put()/get()

• Networked block storage– ND by SUN Microsystems

• Remote block storage over Internet– Use S3 as a block device [Brantner]

• Block-level remote storage may become slow in networks with long latencies

Data Storage – Network Block Device

• PC file systems• Link together all clusters of a file

– Directory entry: filename, attributes, date/time, starting cluster, file size

• Boot sector (superblock) : file system wide information

• File allocation table, root directory, …

Data Storage – Traditional File Systems

Boot sector

FAT 1 FAT 2 (dup)

ROOT dir Normal directories and files

• NFS—Network File System [Sandberg]– Designed by SUN Microsystems in the 1980’s

• Transparent remote access to files stored remotely– XDR, RPC, VNode, VFS– Mountable file system, synchronous behavior

• Stateless server

Data Storage – Network File System

NFS organizationClient Server

Data Storage – Network File System

• A distributed file system at work (GFS)

• Single master and numerous slaves communicate with each other

• File data unit, “chunk”, is up to 64MB. Chunks are replicated.

• “master” is a single point of failure and bottleneck of scalability, the consistency model is difficult to use

Data Storage – Google File System (GFS)

22

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel databaseParallel database ReplicationReplication

Indexes and viewsIndexes and views

Structured schemaStructured schema

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

Data Storage – Database

Designed and used by Yahoo!

PNUTS – a relational database service

MapReduce/Hadoop• Around 2004, Google invented MapReduce to parallelize

computation of large data sets. It’s been a key component in Google’s technology foundation

• Around 2008, Yahoo! developed the open-source variant of MapReduce named Hadoop

• After 2008, MapReduce/Hadoop become a key technology component in cloud computing

• In 2010, the U.S. conferred the MapReduce patent to GoogleMapReduce … Hadoop or variants …Hadoop

• MapReduce provides an easy-to-use framework for parallel programming, but is it the most efficient and best solution to program execution in datacenters?

• MapReduce has its discontents– DeWitt and Stonebraker: “MapReduce: A major step backwards” –

MapReduce is far less sophisticated and efficient than parallel query processing

• MapReduce is a parallel processing framework, not a database system, nor a query language– It is possible to use MapReduce to implement some of the parallel query

processing functions– What are the real limitations?

• Inefficient for general programming (and not designed for that)– Hard to handle data with complex dependence, frequent updates, etc.– High overhead, bursty I/O, difficult to handle long streaming data– Limited opportunity for optimization

MapReduce—LimitationsMapReduce—Limitations

CritiquesMapReduce: A major step backwards

-- David J. DeWitt and Michael Stonebraker

(MapReduce) is– A giant step backward in the programming paradigm for large-

scale data intensive applications– A sub-optimal implementation, in that it uses brute force

instead of indexing– Not novel at all– Missing features– Incompatible with all of the tools DBMS users have come to

depend on

• Inefficient for general programming (and not designed for that)– Hard to handle data with complex dependence, frequent

updates, etc.– High overhead, bursty I/O

• Experience with developing a Hadoop-based distributed compiler– Workload: compile Linux kernel– 4 machines available to Hadoop for parallel compiling– Observation: parallel compiling on 4 nodes with Hadoop can

be even slower than sequential compiling on one node

MapReduce—LimitationsMapReduce—Limitations

• Proprietary solution developed in an environment with one prevailing application (web search)– The assumptions introduce several important constraints in

data and logic– Not a general-purpose parallel execution technology

• Design choices in MapReduce– Optimizes for throughput rather than latency– Optimizes for large data set rather than small data structures– Optimizes for coarse-grained parallelism rather than fine-

grained

Re-thinking MapReduceRe-thinking MapReduce

• A lightweight parallelization framework following the MapReduce paradigm– Implemented in C++– More than just an efficient implementation of MapReduce– Goal: a lightweight “parallelization” service that programs

can invoke during execution

• MRlite follows several principles– Memory is media—avoid touching hard drives– Static facility for dynamic utility—use and reuse threads

for map tasks

MRlite: Lightweight Parallel ProcessingMRlite: Lightweight Parallel Processing

MRlite ： Towards Lightweight, Scalable, and General Parallel Processing

MRlite clientMRlite client

MRlite masterscheduler

MRlite masterscheduler

slaveslave

slaveslave

slaveslave

slaveslave

applicationapplication

Data flowData flow

Command flowCommand flow

Linked together with the app, the MRlite client library accepts calls from app and submits jobs to the master

Linked together with the app, the MRlite client library accepts calls from app and submits jobs to the master High speed distributed

storage, stores intermediate files

High speed distributed storage, stores intermediate files

The MRlite master accepts jobs from clients and schedules them to execute on slaves

The MRlite master accepts jobs from clients and schedules them to execute on slaves

Distributed nodes accept tasks from master and execute them

Distributed nodes accept tasks from master and execute them

30

Computing Capability

Using MRlite, the parallel compilation jobs, mrcc, is 10 Using MRlite, the parallel compilation jobs, mrcc, is 10 times faster than that running on Hadoop!times faster than that running on Hadoop!

Z. Ma and L. Gu. The Limitation of MapReduce: a Probing Case and a Lightweight Solution. CLOUD COMPUTING 2010

Network activities under MapReduce/Hadoop workload• Hadoop: open-source implementation of MapReduce• Processing data with 3 servers (20 cores)

– 116.8GB input data

• Network activities captured with Xen virtual machines

Inside MapReduce-Style ComputationInside MapReduce-Style Computation

Workflow

Where is the communication-intensive part?

Initial data splitinto 64MB blocks

Computed, resultslocally stored

Master informed ofresult locations

R reducers retrieveData from mappers

Final output written

• Packet reception under MapReduce/Hadoop workload– Large data volume– Bursty network traffic

• Genrality—widely observed in MapReduce workloads

Packet reception on a slave server

Inside MapReduceInside MapReduce

Packet reception on the master server


Packet transmission on the master server


Major Components of a Datacenter

• Computing hardware (equipment racks)

• Power supply and distribution hardware

• Cooling hardware and cooling fluid distribution hardware

• Network infrastructure

• IT Personnel and office equipment

Datacenter Networking

Growth Trends in Datacenters• Load on network & servers continues to rapidly grow

– Rapid growth: a rough estimate of annual growth rate: enterprise data centers: ~35%, Internet data centers: 50% - 100%

– Information access anywhere, anytime, from many devices• Desktops, laptops, PDAs & smart phones, sensor

networks, proliferation of broadband

• Mainstream servers moving towards higher speed links– 1-GbE to10-GbE in 2008-2009– 10-GbE to 40-GbE in 2010-2012

• High-speed datacenter-MAN/WAN connectivity– High-speed datacenter syncing for disaster recovery


• A large part of the total cost of the DC hardware– Large routers and high-bandwidth switches are very

expensive• Relatively unreliable – many components may fail.• Many major operators and companies design their

own datacenter networking to save money and improve reliability/scalability/performance.– The topology is often known– The number of nodes is limited– The protocols used in the DC are known

• Security is simpler inside the data center, but challenging at the border

• We can distribute applications to servers to distribute load and minimize hot spots


Networking components (examples)

• High Performance & High Density Switches & Routers

– Scaling to 512 10GbE ports per chassis

– No need for proprietary protocols to scale

• Highly scalable DC Border Routers

– 3.2 Tbps capacity in a single chassis

– 10 Million routes, 1 Million in hardware

– 2,000 BGP peers– 2K L3 VPNs, 16K L2 VPNs– High port density for GE and

10GE application connectivity– Security

768 1-GE port Downstream

64 10-GE port Upstream


Common data center topologyInternet

Servers

Layer-2 switchAccess

Data Center

Layer-2/3 switchAggregation

Layer-3 routerCore


Data center network design goals

• High network bandwidth, low latency• Reduce the need for large switches in the core• Simplify the software, push complexity to the

edge of the network• Improve reliability• Reduce capital and operating cost


Avoid this…

Data Center Networking

and simplify this…and simplify this…

??

Can we avoid using high-end switches?• Expensive high-end switches to

scale up• Single point of failure and

bandwidth bottleneck– Experiences from real systems

• One answer: DCell43

Interconnect

DCell Ideas• #1: Use mini-switches to scale out • #2: Leverage servers to be part of the routing

infrastructure– Servers have multiple ports and need to forward

packets

• #3: Use recursion to scale and build complete graph to increase capacity

Interconnect

One approach: switched network with a hypercube interconnect

• Leaf switch: 40 1Gbps ports+2 10 Gbps ports.– One switch per rack.– Not replicated (if a switch fails, lose one rack of

capacity)• Core switch: 10 10Gbps ports

– Form a hypercube• Hypercube – high-dimensional rectangle

Data Center Networking

Hypercube properties• Minimum hop count• Even load distribution for all-all communication.• Can route around switch/link failures.• Simple routing:

– Outport = f(Dest xor NodeNum)– No routing tables

Interconnect

A 16-node (dimension 4) hypercube

0

3

2

1

0 0

1

2

3 3

1 1

3

0 2021 5 4

6732

10 11 15 14

8 9 13 12

1 1 1 1

1 1 1 1

1 1 1 1

3 3 3 3

2

2

2

2

2

2

2

2

2

2

2

2

0

0

0

0

0

0

0

0

0

0

0

0

3 3 3 3

3 3 3 3

Interconnect

64-switch Hypercube

63 * 4 links toother containers

One container:

4 links

Level 0: 32 40-port 1 Gb/sec switches


64 10 Gb/sec links

16 10 Gb/sec links


1280 Gb/sec links

4X4Sub-cube

4X4Sub-cube

4X4Sub-cube

4X4Sub-cube

16links

16links

16links

16links

Interconnect

How many servers can be connected in this system?

81920 servers with 1Gbps bandwidth

Core switch: 10Gbps port x 10

Leaf switch: 1Gbps port x 40 + 10Gbps port x 2.

The Black BoxData Center Networking

Shipping Container as Data Center Module

• Data Center Module– Contains network gear, compute, storage, &

cooling– Just plug in power, network, & chilled water

• Increased cooling efficiency– Water & air flow– Better air flow management

• Meet seasonal load requirements

Data Center Network

Unit of Data Center Growth

• One at a time: – 1 system– Racking & networking: 14 hrs ($1,330)

• Rack at a time:– ~40 systems– Install & networking: .75 hrs ($60)

• Container at a time:– ~1,000 systems– No packaging to remove– No floor space required– Power, network, & cooling only– Weatherproof & easy to transport

• Data center construction takes 24+ months

Data Center Network

Multiple-Site Redundancy and Enhanced Performance using load balancing

• Handling site failures

transparently• Providing best site

selection per user• Leveraging both DNS and

non-DNS methods for multi-site redundancy

• Providing disaster recovery and non-stop operation

LB system

DNS

Datacenter

Datacenter

Datacenter

LB (load balancing) System• The load balancing systems regulate global data center traffic• Incorporates site health, load, user proximity, and service response for user

site selection• Provides transparent site failover in case of disaster or service outage

Global Data Center Deployment Problems

Data Center Network

Challenges and Research Problems

Hardware– High-performance, reliable, cost-effective

computing infrastructure– Cooling, air cleaning, and energy efficiency

[Barraso] Clusters

[Fan] Power

[Andersen] FAWN

[Reghavendra] Power


System software– Operating systems– Compilers– Database– Execution engines and containers

Ghemawat: GFS

Chang: Bigtable

DeCandia: Dynamo

Brantner: DB on S3

Cooper: PNUTS

Yu: DryadLINQ

Dean: MapReduce

Burrows: Chubby Isard: Quincy


Networking– Interconnect and global network structuring– Traffic engineering

Al-Fares: Commodity DC

Guo 2008: DCellGuo 2009: BCube


• Data and programming– Data consistency mechanisms (e.g., replications)– Fault tolerance– Interfaces and semantics

• Software engineering• User interface• Application architecture

Pike: Sawzall

Olston: Pig Latin

Buyya: IT services

Resources• [Al-Fares] Al-Fares, M., Loukissas, A., and Vahdat, A. A scalable, commodity data center

network architecture. In Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication (Seattle, WA, USA, August 17 - 22, 2008). SIGCOMM '08. 63-74. http://baijia.info/showthread.php?tid=139

• [Andersen] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan. FAWN: A Fast Array of Wimpy Nodes. SOSP'09. http://baijia.info/showthread.php?tid=179

• [Barraso] Luiz Barroso, Jeffrey Dean, Urs Hoelzle, "Web Search for a Planet: The Google Cluster Architecture," IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar./Apr. 2003 http://baijia.info/showthread.php?tid=133

• [Brantner] Brantner, M., Florescu, D., Graf, D., Kossmann, D., and Kraska, T. Building a database on S3. In Proceedings of the 2008 ACM SIGMOD international Conference on Management of Data (Vancouver, Canada, June 09 - 12, 2008). SIGMOD '08. 251-264. http://baijia.info/showthread.php?tid=125

Resources• [Burrows] Burrows, M. The Chubby lock service for loosely-coupled distributed systems.

In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Seattle, Washington, November 06 - 08, 2006). 335-350. . http://baijia.info/showthread.php?tid=59

• [Buyya] Buyya, R. Chee Shin Yeo Venugopal, S. Market-Oriented Cloud Computing. The 10th IEEE International Conference on High Performance Computing and Communications, 2008. HPCC '08. http://baijia.info/showthread.php?tid=248

• [Chang] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Seattle, Washington, November 06 - 08, 2006). 205-218. http://baijia.info/showthread.php?tid=4

• [Cooper] Cooper, B. F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., and Yerneni, R. PNUTS: Yahoo!'s hosted data serving platform. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1277-1288. http://baijia.info/showthread.php?tid=126

Resources• [Dean] Dean, J. and Ghemawat, S. 2004. MapReduce: simplified data processing on large

clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (San Francisco, CA, December 06 - 08, 2004). http://baijia.info/showthread.php?tid=2

• [DeCandia] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: amazon's highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles (Stevenson, Washington, USA, October 14 - 17, 2007). SOSP '07. ACM, New York, NY, 205-220. http://baijia.info/showthread.php?tid=120

• [Fan] Fan, X., Weber, W., and Barroso, L. A. Power provisioning for a warehouse-sized computer. In Proceedings of the 34th Annual international Symposium on Computer Architecture (San Diego, California, USA, June 09 - 13, 2007). ISCA '07. 13-23. http://baijia.info/showthread.php?tid=144

Resources• [Ghemawat] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google file system. In

Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM, New York, NY, 29-43. http://baijia.info/showthread.php?tid=1

• [Guo 2008] Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and Songwu Lu, DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers, in ACM SIGCOMM 08. http://baijia.info/showthread.php?tid=142

• [Guo 2009] Chuanxiong Guo, Guohan Lu, Dan Li, Xuan Zhang, Haitao Wu, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu, BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers, in ACM SIGCOMM 09. http://baijia.info/showthread.php?tid=141

• [Isard] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar and Andrew Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP'09. http://baijia.info/showthread.php?tid=203

Resources• [Olston] Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig Latin: a

not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international Conference on Management of Data (Vancouver, Canada, June 09 - 12, 2008). SIGMOD '08. 1099-1110. http://baijia.info/showthread.php?tid=124

• [Pike] Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data: Parallel analysis with Sawzall. Sci. Program. 13, 4 (Oct. 2005), 277-298. http://baijia.info/showthread.php?tid=60

• [Reghavendra] Ramya Raghavendra, Parthasarathy Ranganathan, Vanish Talwar, Zhikui Wang, Xiaoyun Zhu. No "Power" Struggles: Coordinated Multi-level Power Management for the Data Center. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Seattle, WA, March 2008. http://baijia.info/showthread.php?tid=183

• [Yu] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ãš. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), December 8-10 2008. http://baijia.info/showthread.php?tid=5

Thank you!

Questions?

map reducecloudtech

Technology

input data

buffered data

fold map

map workers

map tasks

sorted intermediate

initial data split

word count map