helix talk at relateiq

74
Apache Helix: Simplifying Distributed Systems Kanak Biscuitwala and Jason Zhang helix.incubator.apache.org @apachehelix

Upload: kishore-gopalakrishna

Post on 06-May-2015

1.755 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Helix talk at RelateIQ

Apache Helix: Simplifying Distributed SystemsKanak Biscuitwala and Jason Zhang

helix.incubator.apache.org@apachehelix

Page 2: Helix talk at RelateIQ

• Background• Resource Assignment Problem• Helix Concepts• Putting Concepts to Work• Getting Started• Plugins• Current Status

Outline

Page 3: Helix talk at RelateIQ

Building distributed systems is hard.

Responding to node entry and

exit

Alerting based on metrics

Managing data replicas

Supporting event listeners

Load balancing

Page 4: Helix talk at RelateIQ

Helix abstracts away problems distributed systems need to solve.

Responding to node entry and

exit

Alerting based on metrics

Managing data replicas

Supporting event listeners

Load balancing

Page 5: Helix talk at RelateIQ

System Lifecycle

Single Node

Multi-NodePartitioningDiscovery

Co-Location

Fault Tolerance

ReplicationFault Detection

Recovery

Cluster Expansion

Throttle movementRedistribute data

Page 6: Helix talk at RelateIQ

Resource Assignment Problem

Page 7: Helix talk at RelateIQ

NODESRESOURCES

Resource Assignment Problem

Page 8: Helix talk at RelateIQ

25%

25%

25%

25%

Resource Assignment ProblemSample Allocation

NODESRESOURCES

Page 9: Helix talk at RelateIQ

34%

33%

33%

Resource Assignment ProblemFailure Handling

NODESRESOURCES

Page 10: Helix talk at RelateIQ

Resource Assignment Problem

ZooKeeper provides low-level primitives We need high-level primitives

Application

ZooKeeperFile system

LockEphemeral

Application

Helix

Consensus System

NodePartitionReplicaState

Transition

Making it Work: Take 1 (ZooKeeper)

Page 11: Helix talk at RelateIQ

Resource Assignment ProblemMaking it Work: Take 2 (Decisions by Nodes)

Consensus System

config changesnode updates node changes

S S

S

S

S service running on a node

Page 12: Helix talk at RelateIQ

Resource Assignment ProblemMaking it Work: Take 2 (Decisions by Nodes)

Consensus System

config changesnode updatesnode changes multiple brains

app-specific logic

unscalable traffic

S

S

S S

Page 13: Helix talk at RelateIQ

Resource Assignment ProblemMaking it Work: Take 2 (Decisions by Nodes)

Consensus System

config changesnode updatesnode changes multiple brains

app-specific logic

unscalable traffic

S

S

S S

Page 14: Helix talk at RelateIQ

Resource Assignment ProblemMaking it Work: Take 3 (Single Brain)

Controller Consensus System S

S

S

config changesnode changes

node updates

node updates

Node logic is drastically simplified!

Page 15: Helix talk at RelateIQ

Controller NODES (Participants)

Spectators

ControllerController

ManageRESOURCES

Resource Assignment ProblemHelix View

Page 16: Helix talk at RelateIQ

Controller NODES (Participants)

Spectators

ControllerController

ManageRESOURCES

Resource Assignment ProblemHelix View

Question: How do we make this controller generic enough to work for different resources?

Page 17: Helix talk at RelateIQ

Helix Concepts

Page 18: Helix talk at RelateIQ

Resource

Partition PartitionPartition

All partitions can be replicated.

Helix ConceptsResources

Page 19: Helix talk at RelateIQ

Offline MasterSlave

Helix ConceptsDeclarative State Model

Page 20: Helix talk at RelateIQ

State Constraints

MASTER: [1, 1]SLAVE: [0, R]

Special Constraint ValuesR: Replica count per partitionN: Number of participants

Helix ConceptsConstraints: Augmenting the State Model

Page 21: Helix talk at RelateIQ

State Constraints Transition Constraints

MASTER: [1, 1]SLAVE: [0, R]

Scope: ClusterOFFLINE-SLAVE: 3 concurrent

Scope: Resource R1SLAVE-MASTER: 1 concurrentSpecial Constraint Values

R: Replica count per partitionN: Number of participants

Helix ConceptsConstraints: Augmenting the State Model

Scope: Participant P4OFFLINE-SLAVE: 2 concurrent

Page 22: Helix talk at RelateIQ

State Constraints Transition Constraints

MASTER: [1, 1]SLAVE: [0, R]

Scope: ClusterOFFLINE-SLAVE: 3 concurrent

Scope: Resource R1SLAVE-MASTER: 1 concurrent

States and transitions are ordered by priority in computing replica states.

Special Constraint ValuesR: Replica count per partitionN: Number of participants

Helix ConceptsConstraints: Augmenting the State Model

Scope: Participant P4OFFLINE-SLAVE: 2 concurrent

Transition constraints can be restricted to cluster, resource, and participant scopes. The most restrictive constraint is used.

Page 23: Helix talk at RelateIQ

Resource

Partition PartitionPartition

All partitions can be replicated.Each replica is in a state governed by the augmented state model.

Helix ConceptsResources and the Augmented State Model

master

slave

offline

Page 24: Helix talk at RelateIQ

Failure and Expansion Semantics

Partition Placement

Helix ConceptsObjectives

Distribution policy for partitions and replicas

Changing existing replica statesCreate new replicas and assign states

Making effective use of the cluster and the resource

Page 25: Helix talk at RelateIQ

Putting Concepts to Work

Page 26: Helix talk at RelateIQ

Rebalancing StrategiesMeeting Objectives within Constraints

Full-Auto Semi-Auto Customized User-Defined

Replica Placement

Replica State

Helix App App

App code plugged into the Helix controller

Helix Helix App

App code plugged into the Helix controller

Page 27: Helix talk at RelateIQ

Rebalancing StrategiesFull-Auto

P1: M P3: M

P2: S P1: S

P2: M

P3: S

By default, Helix optimizes for minimal movement and even distribution of partitions and states

Node 1 Node 2 Node 3

Page 28: Helix talk at RelateIQ

Rebalancing StrategiesFull-Auto

P1: M P3: M

P2: S P1: S

P2: M

P3: S

By default, Helix optimizes for minimal movement and even distribution of partitions and states

Node 1 Node 2 Node 3

Page 29: Helix talk at RelateIQ

Rebalancing StrategiesFull-Auto

P1: M

P3: M

P2: S

P1: S

P2: M

P3: S

By default, Helix optimizes for minimal movement and even distribution of partitions and states

Node 1 Node 2 Node 3

Page 30: Helix talk at RelateIQ

Rebalancing StrategiesSemi-Auto

P1: M P3: M

P2: S P1: S

P2: M

P3: S

Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints.

This is ideal for resources that are expensive to move.

Node 1 Node 2 Node 3

Page 31: Helix talk at RelateIQ

Rebalancing StrategiesSemi-Auto

P1: M P3: M

P2: S P1: S

P2: M

P3: S

Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints.

This is ideal for resources that are expensive to move.

Node 1 Node 2 Node 3

Page 32: Helix talk at RelateIQ

Rebalancing StrategiesSemi-Auto

P1: M P3: M

P2: S P1: S

P2: M

P3: S

Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints.

This is ideal for resources that are expensive to move.

P3: M

Node 1 Node 2 Node 3

Page 33: Helix talk at RelateIQ

Rebalancing StrategiesCustomized

The app specifies the location and state of each replica. Helix still ensures that transitions

are fired according to constraints.

Page 34: Helix talk at RelateIQ

Rebalancing StrategiesCustomized

The app specifies the location and state of each replica. Helix still ensures that transitions

are fired according to constraints.

Need to respond to node changes? Use the Helix custom code invoker to run on one

participant, or...

Page 35: Helix talk at RelateIQ

Rebalancing StrategiesUser-Defined

Rebalancer implemented by app computes

replica placement and state

Helix fires transitions without

violating constraints

Helix controller

invokes code plugged in by the

app

Node joins or leaves the cluster

The rebalancer receives a full snapshot of the current cluster state, as well as access to the backing data store. Helix

rebalancers implement the same interface.

Page 36: Helix talk at RelateIQ

Rebalancing StrategiesUser-Defined: Distributed Lock Manager

Offline Locked

Released

Node 1

Node 2

Each lock is a partition!

Page 37: Helix talk at RelateIQ

Rebalancing StrategiesUser-Defined: Distributed Lock Manager

Offline Locked

Released

Node 1

Node 2

Node 3

Each lock is a partition!

Page 38: Helix talk at RelateIQ

Rebalancing StrategiesUser-Defined: Distributed Lock Manager

public  ResourceAssignment  computeResourceMapping(        Resource  resource,  IdealState  currentIdealState,        CurrentStateOutput  currentStateOutput,        ClusterDataCache  clusterData)  {    ...    int  i  =  0;    for  (Partition  partition  :  resource.getPartitions())  {        Map<String,  String>  replicaMap  =  new  HashMap<String,  String>();        int  participantIndex  =  i  %  liveParticipants.size();        String  participant  =  liveParticipants.get(participantIndex);        replicaMap.put(participant,  “LOCKED”);        assignment.addReplicaMap(partition,  replicaMap);        i++;    }    return  assignment;}

Page 39: Helix talk at RelateIQ

Rebalancing StrategiesUser-Defined: Distributed Lock Manager

public  ResourceAssignment  computeResourceMapping(        Resource  resource,  IdealState  currentIdealState,        CurrentStateOutput  currentStateOutput,        ClusterDataCache  clusterData)  {    ...    int  i  =  0;    for  (Partition  partition  :  resource.getPartitions())  {        Map<String,  String>  replicaMap  =  new  HashMap<String,  String>();        int  participantIndex  =  i  %  liveParticipants.size();        String  participant  =  liveParticipants.get(participantIndex);        replicaMap.put(participant,  “LOCKED”);        assignment.addReplicaMap(partition,  replicaMap);        i++;    }    return  assignment;}

Page 40: Helix talk at RelateIQ

ControllerFault Tolerance

Offline LeaderStandby

The augmented state model concept applies to controllers too!

Page 41: Helix talk at RelateIQ

ControllerScalability

Controller 1

Controller 2

Controller 3

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Page 42: Helix talk at RelateIQ

ControllerScalability

Controller 1

Controller 2

Controller 3

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Page 43: Helix talk at RelateIQ

ZooKeeper View

{    "id"  :  "SampleResource",    "simpleFields"  :  {        "REBALANCE_MODE"  :  "USER_DEFINED",        "NUM_PARTITIONS"  :  "2",        "REPLICAS"  :  "2",        "STATE_MODEL_DEF_REF"  :  "MasterSlave",        "STATE_MODEL_FACTORY_NAME"  :  "DEFAULT"    },    "mapFields"  :  {        "SampleResource_0"  :  {            "node1_12918"  :  "MASTER",            "node2_12918"  :  "SLAVE"        }        ...    },    "listFields"  :  {}}

Ideal State

P1 P2

N1: M N2: M

N2: S N1: S

Replica Placement and State

Page 44: Helix talk at RelateIQ

ZooKeeper ViewCurrent State and External View

P1 P2

N1: M N1: M

N2: O N2: O

Helix’s responsibility is to make the external view match the ideal state as closely as possible

External View

N2 P1: OFFLINEP2: OFFLINE

N1 P1: MASTERP2: MASTER

Current State

Page 45: Helix talk at RelateIQ

Logical Deployment

ZooKeeper Helix Controller

Spectator

Participant Participant Participant

Helix Agent

Helix Agent Helix Agent Helix Agent

P1: M P2: M P3: MP2: S P3: S P1: S

Page 46: Helix talk at RelateIQ

Getting Started

Page 47: Helix talk at RelateIQ

Node 1

P.10P.9

P.5 P.6P.4

P.2 P.3P.1

Node 2

P.12P.11

P.1 P.2P.8

P.6 P.7P.5

Node 3

P.8P.7

P.3 P.4P.12

P.10 P.11P.9Master

Slave

Partition Management• multiple replicas• 1 master• even distribution

Fault Tolerance• fault detection• promote master to slave

• even distribution• no SPOF

Elasticity• minimize downtime• minimize data movement

• throttle movement

Example: Distributed Data Store

Page 48: Helix talk at RelateIQ

Define state model

state transitions

Configure create cluster

add nodes add resource

config rebalancer

Run start controller

start participants

Helix-Based SolutionExample: Distributed Data Store

Page 49: Helix talk at RelateIQ

States all possible states priority

Transitions legal transitions priority

Applicable to each partition of a resource

Offline Master

Slave

Example: Distributed Data StoreState Model Definition: Master-Slave

Page 50: Helix talk at RelateIQ

builder  =  new  StateModelDefinition.Builder(“MasterSlave”);

 //  add  states  and  their  ranks  to  indicate  priority  builder.addState(MASTER,  1);  builder.addState(SLAVE,  2);  builder.addState(OFFLINE);

 //  set  the  initial  state  when  participant  starts  builder.initialState(OFFLINE);

//  add  transitions  builder.addTransition(OFFLINE,  SLAVE);  builder.addTransition(SLAVE,  OFFLINE);  builder.addTransition(SLAVE,  MASTER);  builder.addTransition(MASTER,  SLAVE);

Example: Distributed Data StoreState Model Definition: Master-Slave

Page 51: Helix talk at RelateIQ

State TransitionPartitionResourceNodeCluster

Y Y- YY Y- Y

Offline Master

Slave

StateCount=2

StateCount=1

Example: Distributed Data StoreDefining Constraints

Page 52: Helix talk at RelateIQ

//  static  constraints  builder.upperBound(MASTER,  1);

 //  dynamic  constraints  builder.dynamicUpperBound(SLAVE,  “R”);

 //  unconstrained  builder.upperBound(OFFLINE,  -­‐1);

Example: Distributed Data StoreDefining Constraints: Code

Page 53: Helix talk at RelateIQ

@StateModelInfo(initialState=“OFFLINE”,  states={“OFFLINE”,  “SLAVE”,  “MASTER”})class  DistributedDataStoreModel  extends  StateModel  {    @Transition(from=“OFFLINE”,  to=“SLAVE”)    public  void  fromOfflineToSlave(Message  m,  NotificationContext  ctx)  {        //  bootstrap  data,  setup  replication,  etc.    }        @Transition(from=“SLAVE”,  to=“MASTER”)    public  void  fromSlaveToMaster(Message  m,  NotificationContext  ctx)  {        //  catch  up  previous  master,  enable  writes,  etc.    }    ...}

Example: Distributed Data StoreParticipant Plug-In Code

Page 54: Helix talk at RelateIQ

HelixAdmin  -­‐zkSvr  <zk-­‐address>

Create Cluster-­‐-­‐  addCluster  MyCluster

Add Participants-­‐-­‐  addNode  MyCluster  localhost_12000...

Add Resource-­‐-­‐  addResource  MyDB  16  MasterSlave  SEMI_AUTO

Configure Rebalancer-­‐-­‐  rebalance  MyDB  3

Example: Distributed Data StoreConfigure and Run

Page 55: Helix talk at RelateIQ

class  RoutingLogic  {      public  void  write(Request  request)  {            partition  =  getPartition(request.key);            List<Node>  nodes  =  routingTableProvider.getInstance(partition,  “MASTER”);            nodes.get(0).write(request);      }

     public  void  read(Request  request)  {            partition  =  getPartition(request.key);            List<Node>  nodes  =  routingTableProvider.getInstance(partition);            random(nodes).read(request);      }

Example: Distributed Data StoreSpectator Plug-In Code

Page 56: Helix talk at RelateIQ

Example: Distributed Data StoreWhere is the Code?

Controller Consensus System

Participant

config changesnode changes

node updatesnode updates

Participant Plug-In Code

Spectator Plug-In Code

Participant Plug-In Code

Participant

Spectator

Page 57: Helix talk at RelateIQ

Node 1

P.4P.2 P.3P.1

Node 2

P.6P.4 P.5P.3

Node 3

P.2P.6 P.1P.5

Partition Management• multiple replicas• rack-aware placement• even distribution

Fault Tolerance• fault detection• auto create replicas• controlled creation of replicas

Elasticity• redistribute partitions• minimize data movement

• throttle movement

Index shard

Example: Distributed Search

Page 58: Helix talk at RelateIQ

Offline

Online

Bootstrap

Idle

Error

setup node

cleanup

consume data to build index

stop consume data

can serve requests

stop indexing and serving

recover

StateCount=5

StateCount=3

Example: Distributed SearchState Model Definition: Bootstrap

Page 59: Helix talk at RelateIQ

Create Cluster-­‐-­‐  addCluster  MyCluster

Add Participants-­‐-­‐  addNode  MyCluster  localhost_12000...

Add Resource-­‐-­‐  addResource  MyIndex  16  Bootstrap  CUSTOMIZED

Configure Rebalancer-­‐-­‐  rebalance  MyIndex  8

Example: Distributed SearchConfigure and Run

Page 60: Helix talk at RelateIQ

C1

Partition Management• one consumer per queue• even distribution

Elasticity• redistribute queues among consumers• minimize movement

Fault Tolerance• redistribute• minimize data movement

• limit max queue per consumer

C2

Partitioned Queue Consumer

C1

C2

C1

C2

C3 C3

Example: Message Consumers

Partitioned Queue Consumer Partitioned

Queue Consumer

Assignment Scaling Fault Tolerance

Page 61: Helix talk at RelateIQ

Offline Online

Start consumption

Stop consumption

Max 10 queues per consumerStateCount = 1

Example: Message ConsumersState Model Definition: Online-Offline

Page 62: Helix talk at RelateIQ

@StateModelInfo(initialState=“OFFLINE”,  states={“OFFLINE”,  “ONLINE”})class  MessageConsumerModel  extends  StateModel  {    @Transition(from=“OFFLINE”,  to=“ONLINE”)    public  void  fromOfflineToOnline(Message  m,  NotificationContext  ctx)  {        //  register  listener    }        @Transition(from=“ONLINE”,  to=“OFFLINE”)    public  void  fromOnlineToOffline(Message  m,  NotificationContext  ctx)  {        //  unregister  listener    }}

Example: Message ConsumersParticipant Plug-In Code

Page 63: Helix talk at RelateIQ

Plugins

Page 64: Helix talk at RelateIQ

Chaos MonkeyData-Driven Testing and Debugging

Intra-Cluster Messaging

Rolling Upgrade

On-Demand Task

SchedulingHealth

Monitoring

PluginsOverview

Page 65: Helix talk at RelateIQ

Analyze invariants like state

and transition constraints

Simulate execution with Chaos Monkey

Instrument ZK, controller, and participant logs

Data-Driven Testing and Debugging

The exact sequence of events can be replayed: debugging made easy!

Plugins

Page 66: Helix talk at RelateIQ

Data-Driven Testing and Debugging: Sample Log Filetimestamp partition participantName sessionId state

1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1.32331E+12 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1.32331E+12 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

Plugins

Page 67: Helix talk at RelateIQ

Data-Driven Testing and Debugging: Count Aggregation

Time State Slave Count Participant42632 OFFLINE 0 10.117.58.247_1291842796 SLAVE 1 10.117.58.247_1291843124 OFFLINE 1 10.202.187.155_1291843131 OFFLINE 1 10.220.225.153_1291843275 SLAVE 2 10.220.225.153_1291843323 SLAVE 3 10.202.187.155_1291885795 MASTER 2 10.220.225.153_12918

Error! The state constraint for SLAVE has an upper bound of 2.

Plugins

Page 68: Helix talk at RelateIQ

Data-Driven Testing and Debugging: Time Aggregation

Slave Count Time Percentage0 1082319 0.51 35578388 16.462 179417802 82.993 118863 0.05

Master Count Time Percentage0 1082319 0.51 35578388 16.46

83% of the time, there were 2 slaves to a partition

93% of the time, there was 1 master to a partition

We can see for exactly how long the cluster was out of whack.

Plugins

Page 69: Helix talk at RelateIQ

Current Status

Page 70: Helix talk at RelateIQ

Helix at LinkedIn

OraclePrimary

DBEspresso

Data Change Events

Databus

Updates

Standardization

Search Index

Graph Index

Read Replicas

Page 71: Helix talk at RelateIQ

Coming Up Next

Automatic scaling with YARNNew APIs Non-JVM

participants

Page 72: Helix talk at RelateIQ

• Helix: A generic framework for building distributed systems

• Abstraction and modularity allow for modifying and enhancing system behavior

• Simple programming model: declarative state machine

Summary

Page 73: Helix talk at RelateIQ

Questions?

website

dev mailing list

user mailing list

twitter

helix.incubator.apache.org

[email protected]

[email protected]

@apachehelix?

Page 74: Helix talk at RelateIQ