winter

WINTERTemplate

Distributed Computing at Web Scale

Kyonggi University. DBLAB. Haesung Lee

Introduction

Data analysis at a large scale

• Very large data collections (TB to PB) stored on

distributed file systems

• Query logs

• Search engine indexed

• Sensor data

• Need efficient ways for analyzing, reformatting,

processing them

• In particular, we want

• Parallelization of computation (benefiting of the processing

power of all nodes in a cluster)

• Resilience to failure

Centralized computing with distributed data storage

• Run the program at client node, get data from the

distributed system

• Downsides: important data flows, no use of the

cluster computing resources

Pushing the program near the data

• MapReduce: A programming model to facilitate the

development and execution of distributed tasks

• Published by Google Labs in 2004 at OSDI (DG04).

Widely used since then, open-source implementation in

Hadoop

MapReduce in Brief

• The programmer defines the program logic as two

functions

• Map transforms the input into key-value pairs to

process

• Reduce aggregates the list of values for each key

• The MapReduce environment takes in charge

distribution aspects

• A complex program can be decomposed as a succession

of Map and Reduce tasks

• Higher-level languages (Pig, Hive, etc.) help with

writing distributed applications

MAPREDUCE

Three operations on key-value pairs

① User-defined:

② Fixed behavior:

regroups all the intermediate pairs on the key

③ User-defined:

• Each pair at each phase, is processed independently from

the other pairs

• Network and distribution are transparently managed by the

MapReduce environment

Job workflow in MapReduce

Example: term count in MapReduce (input)

Example: term count in MapReduce

A MapReduce cluster

• Node inside a MapReduce cluster are

decomposed as follows

• A jobtracker acts as a master node

• MapReduce jobs are submitted to it

• Several tasktrackers runs the computation itself,

i.e., map and reduce tasks

• A given tasktracker may run several tasks in

parallel

• Tasktrackers usually also act as data nodes of a

distributed filesystem (e.g. GFS, HDFS)

Processing a MapReduce job

• A MapReduce job takes care of the distribution,

synchronization and failure handling

• The input is split into M groups; each group is assigned to

a mapper (assignment is based on the data locality

principle)

• Each mapper processes a group and stores the

intermediate pairs locally

• Grouped instances are assigned to reducers thanks to a

hash function

• Intermediate pairs are sorted on their key by the reducer

• Remark: the data locality does no longer hold for the

reduce phase, since it reads from the mappers

Assignment to reducer and mappers

• Each mapper task processes a fixed amount of data

(split), usually set to the distributed filesystem block

size (e.g., 64MB)

• The number of mapper nodes is function of the number

of mapper tasks and the number of available nodes in

the cluster: each mapper nodes can process (in parallel

and sequentially) several mapper tasks

• Assignment to mapper tries optimizing data locality

• The number of reducer tasks is set by the user

• Assignment to reducers is done through a hashing of

the key, usually uniformly at random; no data locality

possible

Distributed execution of a MapReduce job

Processing the term count example

Processing the term count example (2)

Failure management

• In case of failure, because the tasks are distributed

over hundreds or thousands of machines, the chances

that a problems occurs somewhere are much larger;

starting the job from the beginning is not a valid option

• The Master periodically checks the availability and

reachability of the tasktrackers (heartbeats) and

whether map or reduce jobs make any progress

• If a reducer fails, its task is reassigned to another

tasktracker; this usually require restarting mapper tasks

as well

• If a mapper fails, its task is reassigned to another

tasktracker

• If the jobtracker fails, the whole job should be re-initiated

Joins in MapReduce

• Two datasets, A and B that we need to join for a

MapReduce task

• If one of the dataset is small, it can be sent over

fully to each tasktracker and exploited inside the

map (and possibly reduce) functions

• Otherwise, each dataset should be grouped

according to the join key, and the result of the join

can be computing in the reduce function

• Not very convenient to express in MapReduce

Using MapReduce for solving a problem

• Prefer

• Simple map and reduce functions

• Mapper tasks processing large data chunks (at least the

size of distributed filesystem block)

• A given application may have

• A chain of map functions (input processing, filtering,

extraction…)

• A sequence of several map-reduce jobs

• No reduce task when everything can be expressed in the

map (zero reducers, or the identity reducer function)

• Not the right tool for everything (see further)

Data replication and consistency

• Replication

• A mechanism that copies data item located on a machine

A to a remote machine B

• One obtains replica

• Consistency

• Ability of a system to behave as if the transaction of each

user always run in isolation from other transactions and

never fails

• Example: shopping basket in an e-commerce application

• Difficult in centralized systems because of multi-users

and concurrency

• Even more difficult in distributed systems because of

replica

• Some illustrative scenarios

• Case (a) : Eager, primary

• Because the replication is managed synchronously, a read

request by Client 2 always access a consistent state of d .

• Because there is a primary copy, requests sent by several

clients relating to a same item can be queued, which ensures

that updates are applied sequentially and not in parallel.

• Down side

• Applications have wait for the completion of other

client’s request, both for writing and reading



• Case (b) : Async, primary

• There is still a primary copy, but the replication is

asynchronous.

• Thus, some of the replicas may be out of date with respect to

Client’s requests

• Because of the primary copy, the replicas will be eventually

consistent because there cannot be independent updates of

distinct replicas

• Being considered acceptable in many modern, NoSQL data

management systems that accept to trade strong consistency

for a higher read throughput



• Case (c) : Eager, no primary

• A complex situation where two clients can simultaneously

write on distinct replicas

• The eager replication implies that these replication must be

synchronized right away.

• Some kind of interlocking, where both Clients wait for some

resource locked by another one



• Case (d) : Async, no primary

• The most flexible case

• Client operations are never stalled by concurrent operations,

at the price of possibly inconsistent states

• Data reconciliation


• Consistency management in distributed systems

• Strong consistency (ACID properties) requires a

synchronous replication, and possibly heavy locking

mechanisms

• Weak consistency – accept to serve some requests with

outdated data

• Eventual consistency – same as before, but the system

is guaranteed to coverage towards a consistent state

based on the last version.

• In a system that is not eventually consistent, conflicts occur

and the application must take care of data reconciliation:

given the two conflicting copies, determine the new current

one

• Standard RDBMS favor consistency over availability – one

of the NoSQL trend


Distributed transactions

• A transaction is a sequence of data update operations

that is required to be an all-or-nothing unit of work

• The two-phase commit(2PC)

• Main algorithm of choice to ensure ACID properties in a

distributed setting

① First, the coordinator asks each participant whether it is

able to perform the required operation with a Prepare

message

② Second, if all participants answered with a confirmation,

the coordinator sends a Decision message: the transaction

is then committed at each site

Distributed transactions

• Two phase commit

Failure management

Failure recovery• Recovery techniques for

centralized architecture• Recovery techniques for

replicated architecture

Failure recovery• Replicated architectures

• Synchronous protocol

• The server acknowledges the Client only when the remote nodes

have sent a confirmation of the successful completion of their

write() operation

• This may severely hinder the efficiency of updates, but the

obvious advantage is that all the replicas are consistent

• Asynchronous protocol

• The client application waits only until one of the copies has been

effectively written

• The multi-log recovery process has a cost, but it brings

availability

Required properties of a distributed system

Scalability• Scalability refers to the ability of a system to

continuously evolve in order to support an ever-growing

amount of tasks

• A scalable system should (i) distribute evenly the task

load to all participants, and (ii) ensure a negligible

distribution management cost

Availability• Availability is the capacity of a system to limit as

much as possible its latency

• Failure detection

• Monitor the participating nodes to detect failures as

early as possible (usually via “heartbeat” messages)

• Design quick restart protocols

• Replication on several nodes

• A scalable system should (i) distribute evenly the task

load to all participants, and (ii) ensure a negligible

distribution management cost

Efficiency• Two usual measures of its efficiency are the response

time (or latency) which denotes the delay to obtain the

first item, and the throughput (or bandwidth) which

denotes the number of items delivered in a given period

unit

• Unit costs

• Number of messages globally sent by the nodes of

the system, regardless of the message size

• Size of message representing the volume of data

exchanges

The CAP theorem• No distributed system can simultaneously provide all

three of the following properties

• Consistency: all nodes see the same data at the same time

• Availability: node failures do not prevent survivors from

continuing to operate

• Partition tolerance: the system continues to operate

despite arbitrary message loss

Consistent hashing

Basics: Centralized Hash files

• The collection consists of (key, value) pairs

• A hash function evenly distributes the values in buckets w.r.t

the key

• This is the basic, static scheme

• The number of buckets is fixed

• Dynamic hashing extends the number of buckets as the

collection grows

• The most popular method is linear hashing

• Issues with hash structures distribution

• Straightforward idea: everybody uses the

same hash function, and buckets are replaced

by servers

• Two issues

• Dynamicity: At web scale, we must be able to

add or remove servers at any moment

• Inconsistencies: It is very hard to ensure that all

participants share an accurate view of the

system (e.g. the hash function)

Basics: Centralized Hash files

Consistent hashing• Let N be the number of servers. The following functions

maps a pair (key, value) to

server

• If N changes, or if a client uses an invalid value of N,

the mapping becomes inconsistent

• With consistent hashing, addition or removal of an

instance does not significantly change the mapping of

keys to servers

• A simple, non-mutable hash function h maps both

keys and the servers IPs to a large address space A

• A is organized as a ring, scanned in clockwise order

• If S and S’ are two adjacent servers on the ring: all

the keys in range [h(S), h(S’)] are mapped to S’

Consistent hashing• Illustration

• A server is added or removed?

• A local re-hashing is sufficient

Consistent hashing• What if a server fails? How can we balance the

load?

• Failure

• Use replication: Put a copy on the next machine (on

the ring), then on the next after the next. And so on.

• Load balancing

• Map a server to several points on the ring

• The more points, the more

load received by a server

• Also useful if the server fails

• Also useful in case

of heterogeneity (the rule in large-scale systems)

Consistent hashing• Distributed indexing based on consistent hashing

• Where is the hash directory? (several possible answers)

• On a specific (“Master”) node acting as a load balancer

• Raises scalability issues

• Each node records its successor on the ring

• Many require O(N) messages for routing queries – not

resilient to failures

• Each node records logN carefully chosen other nodes

• Ensures O(log N) messages for routing queries

• Full duplication of the hash directory at each node

• Ensure 1 message for routing – heavy maintenance

protocol which can be achieved through gossiping

(broadcast of any event affecting the network topology)

Case study: large scale distributed file system (GFS)

History and development of GFS

• Google File System, a paper published in 2003 by

Google Labs at OSDI

• Explains the design and architecture of a distributed

system serving very large data files

• Internally used by Google for storing documents collected

from the Web

• Open Source versions have been developed at once

• Hadoop File System (HDFS), and Kosmos File System (KFS)

The problem

• Why do we need a distributed file system in the first

place?

• Fact

• Standard NFS (left part) does not meet scalability

requirements (what if file 1 get really big?)

• Right part

• GFS/HDFS storage based on (i) a virtual file

namespace, and (ii) partitioning of files in “chunks”

Architecture

• A master node performs administrative tasks, while

servers store “chunks” and send them to Client nodes

• The Client maintains a cache with chunks locations, and

directly communicates with servers

• The following figure shows a non-concurrent append()

operation

• In case of concurrent appends of a chunk, the primary replica

assign serial numbers to the mutation, and coordinates the

secondary replicas

Workflow of a write() operation

• Extension of standard techniques for recovery (left:

centralized ; right: distributed)

• If a node fails, the replicated log file can be used to

recover the last transactions on one of its mirrors

Namespace updates: distributed recovery protocol

winter

Documents

data mapreduce

data nodes

distributed data storagerun

data split

mapreduce joba mapreduce

mapreduce jobprocessing

mapreduce inputexample

hadoop mapreduce