migratory file services for batch-pipelined workloads
DESCRIPTION
Migratory File Services for Batch-Pipelined Workloads. John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003. How to Run a Batch-Pipelined Workload?. in.1. in.2. in.3. Batch Shared Data. a1. a2. a3. pipe.1. pipe.2. - PowerPoint PPT PresentationTRANSCRIPT
Migratory File Services
forBatch-Pipelined
WorkloadsJohn Bent, Douglas Thain, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dusseau, and Miron LivnyWiND and Condor Projects
6 May 2003
How to Run aBatch-Pipelined Workload?
in.1
pipe.1
b1
out.1
a1
in.2
pipe.2
b2
out.2
a2
in.3
pipe.3
b3
out.3
a3Batch
SharedData
The
Internet
Archive
Node Node Node
NodeNodeNodeNode Node Node
NodeNodeNode
Node Node Node
Node Node Node
NodeNodeNode
Node Node Node
Node
Node
Node
Node Node
NodeNode
Condor Pool
PBS Cluster
Grid Engine Cluster
HomeSystem
Cluster-to-Cluster Computing
How to Run aBatch-Pipelined Workload?
“Remote I/O”• Submit jobs to a remote batch system.• Let all I/O come directly home.• Inefficient if re-use is common.• (( But perfect if no data sharing! ))
“FTP-Net”• User finds remote clusters.• Manually stages data in.• Submits jobs, deals with failures.• Pulls data out.• Lather, rinse, repeat.
Hawk: A Migratory File Servicefor Batch-Pipelined Workloads
Automatically deploys a “task force” across multiple wide-area systems.
Manages applications from a high level, using knowledge of process interactions.
Provides dependable performance with peer-to-peer techniques. (( Locality is key! ))
Understands and reacts to failures using knowledge of the system and workloads.
Dangers Failures
• Physical: Networks fail, disks crash, CPUs halt.• Logical: Out of space/memory, lease expired.• Administrative: You can’t use cluster X today.
Dependencies• A comes before C and D, which are simultaneous.• What do we do if the output of C is lost?
Risk vs Reward• A gamble: Staging input data to a remote CPU.• A gamble: Leaving output data at a remote CPU.
The
Internet
Archive
Node Node Node
NodeNodeNodeNode Node Node
NodeNodeNode
Node Node Node
Node Node Node
NodeNodeNode
Node Node Node
Node
Node
Node
Node Node
NodeNode
Condor Pool
PBS Cluster
Grid Engine Cluster
HomeSystem
Hawk HawkHawk
Hawk
Hawk Hawk
Hawk
Hawk
Hawk
Hawk
Hawk
Hawk
a1 a2
b1 b2
a3
b3
c3c2c1
Batch-PipelinedWorkload
i1a1
i1 i2 i3
i2a2
i3a3
Hawk In Action
b1
b2
b3
c1
c2
c3
o1
o2
o3
o1 o2 o3
Workflow Language 1(Start With Condor DAGMan)
job a a.condorjob b b.condorjob c c.condorjob d d.condorparent a child cparent b child d
a b
c d
v1
Archive Storage
mydata
v2 v3
Workflow Language 2
volume v1ftp://archive/
mydatamount v1 a /datamount v1 b /data
volume v2 scratchmount v2 a /tmpmount v2 c /tmp
volume v3 scratchmount v3 b /tmpmount v3 d /tmp
a b
c d
v1
Archive Storage
mydata
v2 v3
Workflow Language 3
extract v2 x ftp://home/out.1
extract v3 x ftp://home/out.2
a b
c dx
out.1 out.2
x
Mapping the Workflow tothe Migratory File System
Abstract Jobs• Become jobs in a batch system• May start, stop, fail, checkpoint, restart...
Logical “scratch” volumes• Become temporary containers on a scratch
disk.• May be created, replicated, and destroyed...
Logical “read” volumes• Become blocks in a cooperative proxy cache.• May be created, cached, and evicted...
Node
System Components
CondorMM
CondorSchedDArchive
Node Node
NodeNodeNode
PBS Cluster
Node Node
NodeNode
Condor Pool
WorkflowManager
Node
Gliding In
CondorMM
CondorSchedDArchive
StartDProxy
Master
Node Node
NodeNodeNodeStartDProxy
Master
StartDProxy
Master
PBS Cluster
Node Node
NodeNode
Condor Pool
StartDProxy
Master
StartDProxy
Master
StartDProxy
Master Glide-InJob
Node
System Components
CondorMM
CondorSchedDArchive
StartDProxy
Node Node
NodeNodeNodeStartDProxy StartDProxy
PBS Head Node
Node Node
NodeNode
Condor Pool
StartDProxy
StartDProxy
WorkflowManager
Node
Cooperative Proxies
CondorMM
CondorSchedDArchive
StartDProxy
Node Node
NodeNodeNodeStartDProxy StartDProxy
PBS Head Node
Node Node
NodeNode
Condor Pool
StartDProxy
StartDProxy
WorkflowManager
Node
System Components
CondorMM
CondorSchedDArchive
StartDProxy
Node Node
NodeNodeNodeStartDProxy StartDProxy
PBS Head Node
Node Node
NodeNode
Condor Pool
StartDProxy
StartDProxy
WorkflowManager
Node
Batch Execution System
CondorMM
CondorSchedDArchive
Proxy
Node Node
NodeNodeNodeProxy Proxy
PBS Head Node
Node Node
NodeNode
Condor Pool
Proxy
StartD
StartD StartD StartD
StartDProxy
WorkflowManager
Node
Proxy
Node Node
NodeNodeNode
PBS Head Node
Node Node
NodeNode
Condor Pool
Proxy
StartD
StartD
StartDProxy
Proxy StartD
System Components
CondorMM
CondorSchedD
Proxy StartD
ArchiveWorkflowManager
Workflow Manager Detail
CondorMM
CondorSchedDArchive
Proxy StartD
WorkflowManager
Archive
StartDProxy
StartD
Proxy
CondorMM
CondorSchedD
Archive
WorkflowManager
StartD
Proxy
CondorMM
CondorSchedD
Archive
WorkflowManager
CoopBlockInput
Cache
Container 120
/mydata
d15
d16
Wide Area NetworkLo
cal A
rea
Net
wor
kCreateContainer
StartD
Job
Proxy
POSIX Library Interface
Agent/tmp cont://host5/120/data cache://host5/archive/mydata
CondorMM
CondorSchedD
Archive
WorkflowManager
CoopBlockInput
Cache
Container 120
outfile
creat(“/tmp/outfile”);open(“/data/d15”);
/mydata
d15
d16
Wide Area NetworkLo
cal A
rea
Net
wor
k
ExecuteJob
StartD
Proxy
Loca
l Are
a N
etw
ork
CondorMM
CondorSchedD
Archive
WorkflowManager
CoopBlockInput
Cache
Container 120
outfile
out65
/mydata
d15
d16
JobCompleted
Wide Area Network ExtractOutput
StartD
Proxy
Loca
l Are
a N
etw
ork
CondorMM
CondorSchedD
Archive
WorkflowManager
CoopBlockInput
Cache
Container 120
outfile
out65
/mydata
d15
d16
JobCompleted
Wide Area Network DeleteContainer
StartD
Proxy
Loca
l Are
a N
etw
ork
CondorMM
CondorSchedD
Archive
WorkflowManager
CoopBlockInput
Cache
/mydata
d15
d16
JobCompleted
ContainerDeleted
out65
Wide Area Network
Fault Detection and Repair The proxy, startd, and agent detect failures:
• Job evicted by machine owner.• Network disconnection between job and proxy.• Container evicted by storage owner.• Out of space at proxy.
The workflow manager knows the consequences:• Job D couldn’t perform it’s I/O.• Check: Are volumes V1 and V3 still in place?• Aha: Volume V3 was lost -> Run B to create it.
Performance Testbed
Controlled “remote” cluster:• 32 cluster nodes at UW.• Hawk submitter also at UW.• Connected by a restricted 800 Kb/s link.
Also some preliminary tests on uncontrolled systems:• Hawk over PBS cluster at Los Alamos• Hawk over Condor system at INFN Italy.
Batch-Pipelined Applications
Name Stages
Load Remote (jobs/hr)
Hawk (jobs/hr)
BLAST 1 BatchHeavy
4.67 747.40
CMS 2 Batchand
Pipe
33.78 1273.96
HF 3 PipeHeavy
40.96 3187.22
Cascading
Failure
Rollback
Failure Recovery
A Little Bit of Philosophy
Most systems build from the bottom up:• “This disk must have five nines, or else!”
MFS works from the top down:• “If this disk fails, we know what to do.”
By working from the top down, we finesse many of the hard problems in traditional filesystems.
Future Work
Integration with Stork P2P Aspects: Discovery & Replication Optional Knowledge: Size & Time Delegation and Disconnection Names, names, names:
• Hawk – A migratory file service.• Hawkeye – A system monitoring tool.
Feelingoverwhelmed?
jobs
data
Let Hawk juggle your
work!
job
data
job
data
?