apache helix presentation at socc 2012
DESCRIPTION
Untangling Cluster Management with Helix SOCC 2012 presentationTRANSCRIPT
1Recruiting SolutionsRecruiting SolutionsRecruiting Solutions
Untangling Cluster Management with Helix
Helix team @ LinkedIn
Kishore Gopalakrishna
http://www.linkedin.com/in/kgopalak
@kishoreg1980
2
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
3
What is Helix
Cluster management framework for distributed systems using declarative state model
4
Distributed system examples
5
Motivation
A system starts out simple… …but gets complex in the real world …as you address real requirements
Application
client library
SystemCall Routing
Replica 1
Replica 2
…
Scale Failover Bootstrapping
…
6
Motivation
These are cluster management problems Helix solves them once… …so you can focus on your system
Scale Failover Bootstrapping
7
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
8
Use-Case: Distributed Data Store
Distributed
Node 1 Node 3
P.1
Node 2
9
Use-Case: Distributed Data Store
Distributed Partitioned
Node 1 Node 3Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7P.5 P.6
P.8
10
Use-Case: Distributed Data Store
Distributed Partitioned Replicated
Node 1 Node 3Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7P.5 P.6
P.8 P.1P.5 P.6
P.9 P.10
P.4P.3
P.7 P.8P.11 P.12
P.2P.1
11
Partition Layout
Highly Available Master accepts writes Balanced distribution
Node 1 Node 3Node 2
P.4
P.9 P.10
P.11
P.12
P.1 P.2 P.3 P.7P.5 P.6
P.8 P.1P.5 P.6
P.9 P.10
P.4P.3
P.7 P.8P.11 P.12
P.2P.1
Master
Slave
Failover
Node 1
P.5 P.6
P.9 P.10
P.4
Node 3
P.9 P.10
P.11
P.4P.3P.12
P.7 P.8
P.1 P.2 P.3
P.1
Node 2
P.7
P.11 P.12
P.2
P.5 P.6
P.8 P.1
Master
Slave
P.1 P.2 P.3 P.4
Add Capacity
Node 1
P.5 P.6
P.9
P.4
Node 3
P.10
P.11
P.4P.3P.12
P.7
P.2 P.3
P.1
Node 2
P.7
P.11P.10
P.8P.12
P.2
P.9P.1 P.5 P.6
P.8 P.1
Node 4
P.10
P.8P.12 Master
Slave
P.1 P.5 P.9
14
Use-case requirements
• Partition constraints• 1 master per partition• Balance partitions across cluster• No single-point-of-failure: replicas on different nodes
• Handle failures: transfer mastership
• Elasticity• Distribute workload across added nodes Minimize partition movement
• Meet SLAs Throttle concurrent data movement
15
State machine– States
offline, slave, master
– Transitions O-S, S-O, S-M, M-S
COUNT=2
COUNT=1
minimize(maxnj N ∈ S(nj) )t1≤ 5
Declarative Problem Statement
Constraints– States– Transitions
Objective – Partition placement
S
MO
t1 t2
t3 t4
minimize(maxnj N ∈ M(nj) )
16
Generalizing cluster management
STATE MACHINE
CONSTRAINTS OBJECTIVE
HELIX
17
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
18
Helix Based System Roles
Node 1 Node 3Node 2
P.4
P.9 P.10 P.11
P.12
P.1 P.2 P.3 P.7P.5 P.6
P.8 P.1P.5 P.6
P.9 P.10
P.4P.3
P.7 P.8P.11 P.12
P.2P.1
RESPONSE COMMAND
Controller Execution Flow
P1:OSP1:SM
20
Controller fault tolerance
21
Controller fault tolerance
22
Participant Plug-in code
23
Spectator Plug-in code
24
Benefits
Cluster operations “just work”– Bootstrapping– Failover– Add nodes
Global vs Local– Helix Controller
Global knowledge Makes cluster decisions
– Participant Local knowledge Follows orders
25
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
26
consumer group
27
Consumer group: Scaling
28
Consumer group: Fault tolerance
29
Consumer group: state model
30
Outline
What is Helix Use case 1: distributed data store Architecture Use case 2: consumer group Helix at LinkedIn Q&A
31
Helix usage at LinkedIn (Pictures)
Espresso– a timeline-consistent, distributed data store
Databus– a change data capture service
Search as a Service– a multi-tenant service for multiple search applications
More planned
32
Summary
Building Distributed Data Systems is hard– Abstraction and modularity is key
Helix: A Generic framework for Cluster Management Simple programming model: declarative state machine
Helix: Future Roadmap
• Features• Span multiple data centers• Load balancing
• Announcement • Open source: https://github.com/linkedin/helix• Apache incubation• New contributors
34
Questions?