computing in the rain: a reliable array of independent nodes group a3 ka hou wong jahanzeb faizan...

37
Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Upload: bethanie-cook

Post on 29-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Computing in the RAIN: A Reliable Array of Independent Nodes

Group A3Ka Hou Wong

Jahanzeb FaizanJonathan Sippel

Page 2: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Introduction

Presenter: Ka Hou Wong

Page 3: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Introduction RAIN

Research collaboration between Caltech and Jet Propulsion Laboratory

Goal Identify and develop key building

blocks for reliable distributed systems built with inexpensive off-the-shelf components

Page 4: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Hardware Platform Heterogeneous cluster of computing and/or

storage nodes connected via multiple interfaces through a network of switches

C0 C1 C2 C3 C4

S0 S1

C5 C6 C7 C8 C9

S2 S3

C = Computer

S = Switch

Page 5: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Software Platform Collection of software modules that run

in conjunction with operating system services and standard network protocols

Network Connections

Application MPI/PVM

TCP/IP

RAIN

Ethernet Myrinet ATM Servernet

Page 6: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Key Building Blocks For Distributed Computer Systems Communication

Fault-tolerant communication topologies

Reliable communication protocols Fault Management

Group membership techniques Storage

Distributed data storage schemes based on error-control codes

Page 7: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Features of RAIN Communication

Provides fault tolerance in the network via the following mechanisms

Bundled interfaces Link monitoring Fault-tolerant interconnect topologies

Page 8: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Features of RAIN (cont’d) Group membership

Identifies healthy nodes that are participating in the cluster

Data storage Uses redundant storage schemes

over multiple disks for fault tolerance

Page 9: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Communication

Presenter: Jahanzeb Faizan

Page 10: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Communication Fault-tolerant interconnect topologies Network interfaces

Page 11: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Fault-tolerant Interconnect Technologies Goal

To connect computer nodes to a network of switches in order to maximize the network’s resistance to partitioning

SSS

S S

SSS

C

C

C

CC C

CC

How do you connect n nodes to a ring of n switches?

Page 12: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Naïve Approach Connect the computer nodes to the

nearest switches in a regular fashion

SSS

S S

SSS

C

C

C

C C

C

C

C

1-fault-tolerant

The network is easily partitioned with two switch failures

Page 13: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Diameter Construction Approach Connect computer nodes to the switching

network in the most non-local way possible Computer nodes are connected to maximally

distant switches Nodes of degree 2 connected between

switches should form a diameter

Page 14: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Diameter Construction Approach (cont’d)Construction (Diameters). Let ds = 4 and dc = 2. i, 0 < i < n, label all compute nodes ci and switches si. Connect switch si to s(i+1)mod n, i.e., in a ring. Connect node ci to switches si and s(i+ n/2 +1)mod n.S0

S1S6

S2

S4 S3

S5

C0

C1

C2C3

C4

C5

C6

n = 7

S0

S1S7

S6 S2

S4

S3S5

C2

C1

C0

C7C6

C5

C4

C3

n = 8

Can tolerate 3 faults of any kind without partitioning the network

Page 15: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Protocol for Link Failure Goal

Monitoring of available paths Requirements

Correctness Bounded Slack Stability

Page 16: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Correctness Must correctly reflect the true

state of the channel

Bi-directional Communication

A B If one side sees timeouts…

Both sides should mark the channel as being down

Page 17: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Bounded Slack Ensure that both have a maximum

slack of n transactions

Link History

Time

U = link upD = link down

DD

U

D

U

DU

U

BA

DD

U

D

U

D

U

D

U

D

UU

BANode A sees many more transactions than node B

Nodes A and B see tightly

coupled views of the channel

Page 18: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Stability Each real channel event (i.e. time-

out) should cause at most some bounded number state transactions at each endpoint

Page 19: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Consistent-History Protocol for Link Failures Monitor available paths in the

network for proper functioning Modified Ping Protocol guarantees

each side of communication channel sees the same history (bounded slack)

Page 20: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

The Protocol Reliable Message Passing Implementation:

Sliding window protocol Existing reliable communication layer

not needed Reliable messaging built on top of

ping messages

Page 21: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

The Protocol (cont’d)

Protocol

Sending and receiving of token using reliable messaging

Tokens are sent on request

Consistent history maintained

Sending and receiving of Ping messages using unreliable messaging

Detect when link is up or down

Implemented by Pings or hardware feedback

Page 22: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Demonstration

Downt = 1

Upt = 2

Downt = 2

Downt = 0

Upt = 1

T/0 T/1 T/0

tout/1

tout/1

T/1

T/1

t: token countT: token arrival eventtout: time-out event

trigger event / token sent

Start

Page 23: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Group Membership

Presenter: Jonathan Sippel

Page 24: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Group Membership Provides a level of agreement

between non-faulty processes in a distributed application

Tolerates permanent and transient failures in both nodes and links

Based on two mechanisms Token Mechanism 911 Mechanism

Page 25: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Token Mechanism Nodes in the membership are ordered in

a logical ring Token passed at a regular interval from

one node to the next Token carries the authoritative

knowledge of the membership Node updates its local membership

information according to the received token

Page 26: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Token Mechanism (cont’d) Aggressive Failure Detection

A D

B C

A D

B C

Page 27: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Token Mechanism (cont’d) Conservative Failure Detection

A D

B C

A D

B C

Page 28: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

911 Mechanism When is the 911 Mechanism used?

Token Regeneration - Regenerate a token that is lost if a node or a link fails

Dynamic Scalability - Add a new node to the system

What is a 911 message? Request for the right to regenerate the lost

token Must be approved by all the live nodes in

the membership

Page 29: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Token Regeneration Only one node is allowed to regenerate the token Token sequence number is used to guarantee

mutual exclusivity and is incremented every time the token is passed from one node to the next

Each node makes a local copy of the token on receipt

Sequence number on the node’s local copy of the token is added to the 911 message and compared to all the sequence numbers on the local copies of the token on the other live nodes

911 request is denied by any node with a more recent copy of the token

Page 30: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Dynamic Scalability 911 message sent by a new node

to join the group Receiving node

Treats the message as a join request because the originating node is not in the membership

Updates the membership the next time it receives the token and sends it to the new node

Page 31: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Data Storage The RAIN system provides a

distributed storage system based on a class of erasure-correcting codes called array codes that provide a mathematical means of representing data so lost information can be recovered

Page 32: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Data Storage (cont’d) Array codes

With an (n, k) erasure-correcting code, k symbols of original data are represented with n symbols of encoded data

With an m-erasure-correcting code, the original data can be recovered even if m symbols of the encoded data are lost

A code is said to be Maximum Distance Separable (MDS) if m = n – k

The only operations necessary to encode/decode an array code are simple binary XOR operations

Page 33: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Data Storage (cont’d)

A+C+d+eF+B+c+dE+A+b+cD+F+a+bC+E+f+aB+D+e+f

FDDCBA

fedcba

Data Placement Scheme for a (6, 4) Array Code

Page 34: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Data Storage (cont’d)

A+C+d+eF+B+c+dE+A+b+cD+F+a+b??

FDDC??

fedc??

Data Placement Scheme for a (6, 4) Array Code

A = C + d + e + (A + C + d + e)b = A + (E + A + b + c) + c + Ea = b + (D + F + a + b) + D + F

B = a + c + (F + B + c + d) + d

Page 35: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

Data Storage (cont’d) Distributed store/retrieve operations

For a store operation a block of data of size d is encoded into n symbols, each of size d/k, using an (n, k) MDS array code

For a retrieve operation, symbols are collected from any k nodes and decoded

The original data can be recovered with up to n – k node failures

The encoding scheme provides for dynamic reconfigurability and load balancing

Page 36: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

RAIN Contributions to Distributed Computing Systems Fault-tolerant interconnect

topologies and communication protocols providing consistent error reporting of link failures

Fault management techniques based on group membership

Data storage schemes based on computationally efficient error-control codes

Page 37: Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

References Vasken Bohossian, Chenggong C. Fan,

Paul S. LeMahieu, Marc D. Riedel, Lihao Xu, Jehoshua Bruck, “Computing in the RAIN: A Reliable Array of Independent Nodes,” IEEE Transactions On Parallel and Distributed Systems, Vol. 12, No. 2, February 2001

http://www.rainfinity.com/