condor week – april 2006artyom sharov, technion, haifa1 adding high availability to condor central...

23
Artyom Sharov, Technion, Haifa 1 Condor week – April 2006 Adding Adding High Availability High Availability to Condor to Condor Central Manager Central Manager Artyom Sharov Technion – Israel Institute of Technology, Haifa

Upload: trevor-caldwell

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 1 Condor week – April 2006

Adding Adding High Availability High Availability

to Condor to Condor Central Manager Central Manager

Artyom SharovTechnion – Israel Institute of Technology, Haifa

Page 2: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 2 Condor week – April 2006

Collector

Negotiator

Startd and ScheddStartd and Schedd

Startd and ScheddStartd and ScheddStartd and Schedd

Startd and ScheddStartd and ScheddCentral Manager

Condor Pool without High Availability

Page 3: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 3 Condor week – April 2006

Central Manager is a single-point-of-

failure No additional matches are possible

Condor tools do not work

Unfair resource sharing and user priorities

Our goal - continuous pool functioning

in case of failure

Why Highly Available CM?

Page 4: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 4 Condor week – April 2006

Highly AvailableCentral

ManagerStartd and ScheddStartd and Schedd

Startd and ScheddStartd and Schedd Startd and Schedd

Startd and ScheddStartd and Schedd

Highly Available Condor Pool

Page 5: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 5 Condor week – April 2006

Automatic failure detection

Transparent failover

“Split brain” reconciliation

Persistency of CM state

No changes to CM code

Solution Requirements

Page 6: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 6 Condor week – April 2006

Highly AvailableCentral

Manager

Negotiator

Condor Pool with HA

Replicator

HAD

Replicator

HAD

Collector

Replicator

HAD

Collector

Collector

Page 7: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 7 Condor week – April 2006

Backup 1

Backup 2

Backup 3

HA – Election + Main

#1

#2

Election message

Election message

Election message

I win

Raise Negotiator

I loose

Active

I am alive

I loose

Page 8: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 8 Condor week – April 2006

HA – CrashActive Backup

1Backup 2

#3

Election messages

#4

I am alive

I win

Raise Negotiator

I loose

Active

Page 9: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 9 Condor week – April 2006

Active Backup

Joining

#1

#2

State update

State update

Solicit version

Solicit version reply

#3

Downloading request

Replication – Main + Joining

Page 10: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 10 Condor week – April 2006

Replication – CrashActive Backup

1Backup 2

#4

#5

State update

State updateActive

Page 11: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 11 Condor week – April 2006

Stabilization time Depends on number of CMs and network

performance HAD_CONNECT_TIMEOUT – upper bound on the

time to establish TCP connection Example: HAD_CONNECT_TIMEOUT = 2 and 2

CMs - new Negotiator is guaranteed to be up and

running after 48 seconds

Replication frequency REPLICATION_INTERVAL

Configuration

Page 12: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 12 Condor week – April 2006

Automatic distributed testing framework: simulation of node crashes, network disconnections,

network partition and merges

Extensive testing: distributed testing on 5 machines in the Technion

interactive distributed testing in Wisconsin pool

automatic testing with NMI framework

Testing

Page 13: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 13 Condor week – April 2006

Already deployed and fully functioning for more than a year in Technion GLOW, UW California Department of Water Resources,

Delta Modeling Section, Sacramento, CA Hartford Life Cycle Computing Additional commercial users

HA in Production

Page 14: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 14 Condor week – April 2006

HAD Monitoring System

Configuration/administration utilities

Detailed manual section

Full support by Technion team

Usability and Administration

Page 15: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 15 Condor week – April 2006

HA in WAN HAIFA – High Availability Is For Anyone

HA for any Condor service (e.g.: HA for schedd) More consistency schemes and HA semantics Dynamic registration of services requiring HA Dynamic addition/removal of replicas

More details in "Materializing Highly Available Grids" - hot topic paper, to appear in HPDC 2006.

Future Work

Page 16: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 16 Condor week – April 2006

Ongoing collaboration for 3 years Compliance with Condor coding standards Peer-reviewed code Integration with NMI framework Automation of testing Open-minded attitude of Condor team to

numerous requests and questions Unique experience of working with large

peer-managed group of talented programmers

Collaboration with Condor Team

Page 17: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 17 Condor week – April 2006

This work was a collaborative effort of: Distributed Systems Laboratory in

Technion Prof. Assaf Schuster, Gabi Kliot, Mark

Zilberstein, Artyom Sharov Condor team

Prof. Miron Livny, Nick, Todd, Derek, Greg, Anatoly, Peter, Becky, Bill, Tim

Collaboration with Condor Team

Page 18: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 18 Condor week – April 2006

Part of the official 6.7.18 development release

Will soon appear in stable 6.8 release More information:

http://dsl.cs.technion.ac.il/projects/gozal/project_pages/ha/ha.html

http://dsl.cs.technion.ac.il/projects/gozal/project_pages/replication/replication.html

more details + configuration in my tutorial Contact:

{gabik,marks,sharov}@cs.technion.ac.il [email protected]

You Should Definitely Try It

Page 19: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 19 Condor week – April 2006

In case of time

Page 20: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 20 Condor week – April 2006

Replication – “Split Brain”Active 1

Active 2Merge of networks

I am alive, Active 1

I am alive, Active 2

Decision making :

my ID > ‘Active 2’ ID, I am a leader

Decision making :

my ID < ‘Active 1’ ID, give up

HAD

Replication

HAD

Replication

Page 21: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 21 Condor week – April 2006

Active BackupMerge of networks

HAD

Replication

HAD

Replication

You’re leader

‘Active 2’ last version before merge

merg

ing

vers

ions

from

tw

o p

ools

State update

Replication – “Split Brain”

Page 22: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 22 Condor week – April 2006

HAD State Diagram

Page 23: Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of

Artyom Sharov, Technion, Haifa 23 Condor week – April 2006

RD State Diagram