winter
DESCRIPTION
Kyonggi University. DBLAB. Haesung Lee. WINTER. Template. Distributed Computing at Web Scale. Introduction. Data analysis at a large scale. Very large data collections (TB to PB) stored on distributed file systems Query logs Search engine indexed Sensor data - PowerPoint PPT PresentationTRANSCRIPT
WINTERTemplate
Distributed Computing at Web Scale
Kyonggi University. DBLAB. Haesung Lee
Introduction
Data analysis at a large scale
• Very large data collections (TB to PB) stored on
distributed file systems
• Query logs
• Search engine indexed
• Sensor data
• Need efficient ways for analyzing, reformatting,
processing them
• In particular, we want
• Parallelization of computation (benefiting of the processing
power of all nodes in a cluster)
• Resilience to failure
Centralized computing with distributed data storage
• Run the program at client node, get data from the
distributed system
• Downsides: important data flows, no use of the
cluster computing resources
Pushing the program near the data
• MapReduce: A programming model to facilitate the
development and execution of distributed tasks
• Published by Google Labs in 2004 at OSDI (DG04).
Widely used since then, open-source implementation in
Hadoop
MapReduce in Brief
• The programmer defines the program logic as two
functions
• Map transforms the input into key-value pairs to
process
• Reduce aggregates the list of values for each key
• The MapReduce environment takes in charge
distribution aspects
• A complex program can be decomposed as a succession
of Map and Reduce tasks
• Higher-level languages (Pig, Hive, etc.) help with
writing distributed applications
MAPREDUCE
Three operations on key-value pairs
① User-defined:
② Fixed behavior:
regroups all the intermediate pairs on the key
③ User-defined:
• Each pair at each phase, is processed independently from
the other pairs
• Network and distribution are transparently managed by the
MapReduce environment
Job workflow in MapReduce
Example: term count in MapReduce (input)
Example: term count in MapReduce
A MapReduce cluster
• Node inside a MapReduce cluster are
decomposed as follows
• A jobtracker acts as a master node
• MapReduce jobs are submitted to it
• Several tasktrackers runs the computation itself,
i.e., map and reduce tasks
• A given tasktracker may run several tasks in
parallel
• Tasktrackers usually also act as data nodes of a
distributed filesystem (e.g. GFS, HDFS)
Processing a MapReduce job
• A MapReduce job takes care of the distribution,
synchronization and failure handling
• The input is split into M groups; each group is assigned to
a mapper (assignment is based on the data locality
principle)
• Each mapper processes a group and stores the
intermediate pairs locally
• Grouped instances are assigned to reducers thanks to a
hash function
• Intermediate pairs are sorted on their key by the reducer
• Remark: the data locality does no longer hold for the
reduce phase, since it reads from the mappers
Assignment to reducer and mappers
• Each mapper task processes a fixed amount of data
(split), usually set to the distributed filesystem block
size (e.g., 64MB)
• The number of mapper nodes is function of the number
of mapper tasks and the number of available nodes in
the cluster: each mapper nodes can process (in parallel
and sequentially) several mapper tasks
• Assignment to mapper tries optimizing data locality
• The number of reducer tasks is set by the user
• Assignment to reducers is done through a hashing of
the key, usually uniformly at random; no data locality
possible
Distributed execution of a MapReduce job
Processing the term count example
Processing the term count example (2)
Failure management
• In case of failure, because the tasks are distributed
over hundreds or thousands of machines, the chances
that a problems occurs somewhere are much larger;
starting the job from the beginning is not a valid option
• The Master periodically checks the availability and
reachability of the tasktrackers (heartbeats) and
whether map or reduce jobs make any progress
• If a reducer fails, its task is reassigned to another
tasktracker; this usually require restarting mapper tasks
as well
• If a mapper fails, its task is reassigned to another
tasktracker
• If the jobtracker fails, the whole job should be re-initiated
Joins in MapReduce
• Two datasets, A and B that we need to join for a
MapReduce task
• If one of the dataset is small, it can be sent over
fully to each tasktracker and exploited inside the
map (and possibly reduce) functions
• Otherwise, each dataset should be grouped
according to the join key, and the result of the join
can be computing in the reduce function
• Not very convenient to express in MapReduce
Using MapReduce for solving a problem
• Prefer
• Simple map and reduce functions
• Mapper tasks processing large data chunks (at least the
size of distributed filesystem block)
• A given application may have
• A chain of map functions (input processing, filtering,
extraction…)
• A sequence of several map-reduce jobs
• No reduce task when everything can be expressed in the
map (zero reducers, or the identity reducer function)
• Not the right tool for everything (see further)
Data replication and consistency
• Replication
• A mechanism that copies data item located on a machine
A to a remote machine B
• One obtains replica
• Consistency
• Ability of a system to behave as if the transaction of each
user always run in isolation from other transactions and
never fails
• Example: shopping basket in an e-commerce application
• Difficult in centralized systems because of multi-users
and concurrency
• Even more difficult in distributed systems because of
replica
• Some illustrative scenarios
• Case (a) : Eager, primary
• Because the replication is managed synchronously, a read
request by Client 2 always access a consistent state of d .
• Because there is a primary copy, requests sent by several
clients relating to a same item can be queued, which ensures
that updates are applied sequentially and not in parallel.
• Down side
• Applications have wait for the completion of other
client’s request, both for writing and reading
Data replication and consistency
• Some illustrative scenarios
• Case (b) : Async, primary
• There is still a primary copy, but the replication is
asynchronous.
• Thus, some of the replicas may be out of date with respect to
Client’s requests
• Because of the primary copy, the replicas will be eventually
consistent because there cannot be independent updates of
distinct replicas
• Being considered acceptable in many modern, NoSQL data
management systems that accept to trade strong consistency
for a higher read throughput
Data replication and consistency
• Some illustrative scenarios
• Case (c) : Eager, no primary
• A complex situation where two clients can simultaneously
write on distinct replicas
• The eager replication implies that these replication must be
synchronized right away.
• Some kind of interlocking, where both Clients wait for some
resource locked by another one
Data replication and consistency
• Some illustrative scenarios
• Case (d) : Async, no primary
• The most flexible case
• Client operations are never stalled by concurrent operations,
at the price of possibly inconsistent states
• Data reconciliation
Data replication and consistency
• Consistency management in distributed systems
• Strong consistency (ACID properties) requires a
synchronous replication, and possibly heavy locking
mechanisms
• Weak consistency – accept to serve some requests with
outdated data
• Eventual consistency – same as before, but the system
is guaranteed to coverage towards a consistent state
based on the last version.
• In a system that is not eventually consistent, conflicts occur
and the application must take care of data reconciliation:
given the two conflicting copies, determine the new current
one
• Standard RDBMS favor consistency over availability – one
of the NoSQL trend
Data replication and consistency
Distributed transactions
• A transaction is a sequence of data update operations
that is required to be an all-or-nothing unit of work
• The two-phase commit(2PC)
• Main algorithm of choice to ensure ACID properties in a
distributed setting
① First, the coordinator asks each participant whether it is
able to perform the required operation with a Prepare
message
② Second, if all participants answered with a confirmation,
the coordinator sends a Decision message: the transaction
is then committed at each site
Distributed transactions
• Two phase commit
Failure management
Failure recovery• Recovery techniques for
centralized architecture• Recovery techniques for
replicated architecture
Failure recovery• Replicated architectures
• Synchronous protocol
• The server acknowledges the Client only when the remote nodes
have sent a confirmation of the successful completion of their
write() operation
• This may severely hinder the efficiency of updates, but the
obvious advantage is that all the replicas are consistent
• Asynchronous protocol
• The client application waits only until one of the copies has been
effectively written
• The multi-log recovery process has a cost, but it brings
availability
Required properties of a distributed system
Scalability• Scalability refers to the ability of a system to
continuously evolve in order to support an ever-growing
amount of tasks
• A scalable system should (i) distribute evenly the task
load to all participants, and (ii) ensure a negligible
distribution management cost
Availability• Availability is the capacity of a system to limit as
much as possible its latency
• Failure detection
• Monitor the participating nodes to detect failures as
early as possible (usually via “heartbeat” messages)
• Design quick restart protocols
• Replication on several nodes
• A scalable system should (i) distribute evenly the task
load to all participants, and (ii) ensure a negligible
distribution management cost
Efficiency• Two usual measures of its efficiency are the response
time (or latency) which denotes the delay to obtain the
first item, and the throughput (or bandwidth) which
denotes the number of items delivered in a given period
unit
• Unit costs
• Number of messages globally sent by the nodes of
the system, regardless of the message size
• Size of message representing the volume of data
exchanges
The CAP theorem• No distributed system can simultaneously provide all
three of the following properties
• Consistency: all nodes see the same data at the same time
• Availability: node failures do not prevent survivors from
continuing to operate
• Partition tolerance: the system continues to operate
despite arbitrary message loss
Consistent hashing
Basics: Centralized Hash files
• The collection consists of (key, value) pairs
• A hash function evenly distributes the values in buckets w.r.t
the key
• This is the basic, static scheme
• The number of buckets is fixed
• Dynamic hashing extends the number of buckets as the
collection grows
• The most popular method is linear hashing
• Issues with hash structures distribution
• Straightforward idea: everybody uses the
same hash function, and buckets are replaced
by servers
• Two issues
• Dynamicity: At web scale, we must be able to
add or remove servers at any moment
• Inconsistencies: It is very hard to ensure that all
participants share an accurate view of the
system (e.g. the hash function)
Basics: Centralized Hash files
Consistent hashing• Let N be the number of servers. The following functions
maps a pair (key, value) to
server
• If N changes, or if a client uses an invalid value of N,
the mapping becomes inconsistent
• With consistent hashing, addition or removal of an
instance does not significantly change the mapping of
keys to servers
• A simple, non-mutable hash function h maps both
keys and the servers IPs to a large address space A
• A is organized as a ring, scanned in clockwise order
• If S and S’ are two adjacent servers on the ring: all
the keys in range [h(S), h(S’)] are mapped to S’
Consistent hashing• Illustration
• A server is added or removed?
• A local re-hashing is sufficient
Consistent hashing• What if a server fails? How can we balance the
load?
• Failure
• Use replication: Put a copy on the next machine (on
the ring), then on the next after the next. And so on.
• Load balancing
• Map a server to several points on the ring
• The more points, the more
load received by a server
• Also useful if the server fails
• Also useful in case
of heterogeneity (the rule in large-scale systems)
Consistent hashing• Distributed indexing based on consistent hashing
• Where is the hash directory? (several possible answers)
• On a specific (“Master”) node acting as a load balancer
• Raises scalability issues
• Each node records its successor on the ring
• Many require O(N) messages for routing queries – not
resilient to failures
• Each node records logN carefully chosen other nodes
• Ensures O(log N) messages for routing queries
• Full duplication of the hash directory at each node
• Ensure 1 message for routing – heavy maintenance
protocol which can be achieved through gossiping
(broadcast of any event affecting the network topology)
Case study: large scale distributed file system (GFS)
History and development of GFS
• Google File System, a paper published in 2003 by
Google Labs at OSDI
• Explains the design and architecture of a distributed
system serving very large data files
• Internally used by Google for storing documents collected
from the Web
• Open Source versions have been developed at once
• Hadoop File System (HDFS), and Kosmos File System (KFS)
The problem
• Why do we need a distributed file system in the first
place?
• Fact
• Standard NFS (left part) does not meet scalability
requirements (what if file 1 get really big?)
• Right part
• GFS/HDFS storage based on (i) a virtual file
namespace, and (ii) partitioning of files in “chunks”
Architecture
• A master node performs administrative tasks, while
servers store “chunks” and send them to Client nodes
• The Client maintains a cache with chunks locations, and
directly communicates with servers
• The following figure shows a non-concurrent append()
operation
• In case of concurrent appends of a chunk, the primary replica
assign serial numbers to the mutation, and coordinates the
secondary replicas
Workflow of a write() operation
• Extension of standard techniques for recovery (left:
centralized ; right: distributed)
• If a node fails, the replicated log file can be used to
recover the last transactions on one of its mirrors
Namespace updates: distributed recovery protocol