coregrid summer school bonn, july 24 th - 28 th 2006 hpc4u: realizing sla-aware resource management...
TRANSCRIPT
CoreGRID Summer SchoolCoreGRID Summer SchoolBonn, July 24th - 28th 2006
HPC4U: Realizing SLA-aware HPC4U: Realizing SLA-aware Resource ManagementResource Management
Simon AlexandreSimon AlexandreCETIC, Charleroi, BelgiumCETIC, Charleroi, Belgium
Matthias HovestadtMatthias HovestadtUniversity of Paderborn, GermanyUniversity of Paderborn, Germany
HPC4U 2
TopicsTopics
• Motivation• Architecture of an SLA-aware RMS• Phases of Operation• SLA-aware Scheduling• Cross-border Migration• Summary
HPC4U 3
Grid Computing TodayGrid Computing Today
• How do Grids look like today? Grids are in usage, but…
– … commercial usage is rare and limitedo only isolated applications
– … mostly used as a prototypic solution in researcho testbeds within research projects
• Problem: No contractually fixed QoS levels Deadline bounded business critical jobs
HPC4U 4
What is an SLA?What is an SLA?• Service Level Agreement (SLA)
Contract between Provider and Customer– Describes all obligations and expectations
Flexible formulation for each use case
• SLA is in focus of research in Grid Middleware
Se
rvic
e L
ev
el A
gre
eme
nt
Terms R-Type: HW, OS, Compiler, Software Packages, …R-Quantity: Number CPUs, main memory, …R-Quality: CPU>2GHz, Network Bandwidth, … Deadline: Date, Time,…Policies: Demands on Security and Privacy, …
Price for Resource Consumtion (fulfilled SLA)Penalty Fee in case of SLA violation
Contract Parties, Responsible Persons
ID or Description of SLAName
Context
Se
rvice
Le
vel A
gre
eme
nt
HPC4U 5
The Gap between Grid and RMSThe Gap between Grid and RMS
SLA
RMS RMS RMS
M1 M2 M3
grid middleware
user request
Reliability? Quality of Service?
Best Effort!
• User asks forService Level Agreement
• Grid Middleware realizes job by means of local RMS systems
• BUT: These RMS only offer Best Effort!
• Goal: SLA-aware RMS runtime responsibility reliability
– fault tolerance
Guaranteed!
HPC4U 6
Demands on an SLA-aware RMSDemands on an SLA-aware RMS
• Negotiation active negotiation with upper layers accept new job only if SLA can be fulfilled
• System Management taking terms of SLAs into account allocation of nodes according SLAs
• Fault Tolerance ensure terms of SLAs also in case of failures mechanisms for failure handling
HPC4U 7
TopicsTopics
• Motivation• Architecture of an SLA-aware RMS• Phases of Operation• SLA-aware Scheduling• Cross-border Migration• Summary
HPC4U 8
SLA-aware RMSSLA-aware RMS• Central component
Interface to Grid middleware for SLA-Negotiation Interfaces to Subsystems for provision of FT
• Tasks SLA Negotiation Policies
– security, … Monitoring FT
– checkpoints– migration
• Open interfaces
Alternative Solution Alternative SolutionAlternative Solution
HPC4U Grid-enabled, SLA-aware Resource Management System:
SLA Negotiation SLA-aware Scheduling Monitoring Support of Checkpointing and Migration Security
Interface to Storage
Storage Solution
Interface to Checkpointing
Checkpointing Solution
Interface to Network
Networking Solution
Outcome 1(Open Source)
Outcome 2(Commercial)
Interface to Grid Middleware
Outcome 3(Non-Comm.)
HPC4U 9
Process SubsystemProcess Subsystem
• Concept: “virtual bubble” Virtualization of Resources
– virtual network devices, virtual process ids, … Application runs in virtual environment only minimal impact on job runtime
• Checkpoint of entire “virtual bubble” no re-linking necessary
– also applicable for commercial applications
• Restart of checkpointed “virtual bubble” compatibility has to be ensured application does not detect restart
HPC4U 10
Network SubsystemNetwork Subsystem
• Provision of FT also for parallel jobs communication between nodes checkpoint of network state necessary
– ensuring consistency between process and network at restart
• Network Checkpointing checkpoint of network queues checkpoint of in-transit packets
• Cooperative Checkpoint Protocol (CCP) direct communication between process checkpointing
and network checkpointing
HPC4U 11
Storage SubsystemStorage Subsystem
• Task of Storage Subsystem Storage-related QoS Checkpointing of storage
• Overall consistency of checkpoint restoring state of storage at process checkpointing time checkpoint = process+network+storage
• Storage Checkpoint may be huge Problem: Delay until restart on remote resource
– Grid migration over “slow” WAN-connections Solution: Data replication with COW (copy on write)
– precautionary data transfer to remote resource
HPC4U 12
Generation of a new CheckpointGeneration of a new Checkpoint
RMS
Network StorageCP
1. CP job+halt
2. In-TransitPackets
4. Snap-shot !
5. Link to Snapshot
6. Resume job
7. Job runningagain
8. Migration from last checkpoint
3. Return: “Checkpoint
completed!”
HPC4U 13
CheckpointingCheckpointing
• Backup of consistent image of running job running process network state in case of parallel jobs storage partition
• Process checkpointing causes delay in job completion depends on number of jobs, memory, interconnect, …
• Delay has to be regarded at job scheduling partition size = estimated runtime + checkpointing
overhead
HPC4U 14
TopicsTopics
• Motivation• Architecture of an SLA-aware RMS• Phases of Operation• SLA-aware Scheduling• Cross-border Migration• Summary
HPC4U 15
Phases of OperationPhases of Operation
• Negotiation of SLA• Pre-Runtime: Configuration of Resources
e.g. network, storage, compute nodes
• Runtime: Stage-In, Computation, Stage-Out• Post-Runtime: Re-configuration
StageIn
Negotiation Pre-Runtime
Runtime
Lifetimeof SLA
Allocationof systemresources
Post-Runtime
time
Acceptance(or rejection)
of SLA Compu-tation
StageOut
HPC4U 16
Negotiation PhaseNegotiation Phase
• Negotiation Grid customer and provider try to agree on a Service
Level Agreement– which resources have to be provided?– which QoS level is required?
o specification of a deadline
• RMS in central position steering of negotiation process current system condition has to be regarded
StageIn
Negotiation Pre-Runtime
Runtime Post-Runtime
time
Compu-tation
StageOut
HPC4U 17
Pre-Runtime PhasePre-Runtime PhaseStage
In
Negotiation Pre-Runtime
Runtime Post-Runtime
time
Compu-tation
StageOut
• Task of Pre-Runtime Phase Configuration of all allocated resources Goal: Fulfill requirements of SLA
• Reconfiguration affects all system elements Resource Management System
– e.g. configuration of assigned compute nodes Storage Subsystem
– e.g. initialization of a new data partition Network Subsystem
– e.g. configuration of network infrastructure
HPC4U 18
Runtime PhaseRuntime Phase
• Runtime Phase lifetime of job in system adherence with SLA has to be assured FT mechanisms have to be utilized
• Phase consists of three distinct steps Stage-In
– transmission of required input data from Grid customer to compute resource
Computation– execution of application
Stage-Out– transmission of generated output data from
compute resource back to Grid customer
StageIn
Negotiation Pre-Runtime
Runtime Post-Runtime
time
Compu-tation
StageOut
HPC4U 19
Post-Runtime PhasePost-Runtime Phase
• Task of Post-Runtime Phase: Re-Configuration of all resources
– e.g. re-configuration of network– e.g. deletion of checkpoint datasets– e.g. deletion of temporary data
Counterpart to Pre-Runtime Phase
• Allocation of resources ends Update of schedules in RMS and storage Resources are available for new jobs
StageIn
Negotiation Pre-Runtime
Runtime Post-Runtime
time
Compu-tation
StageOut
HPC4U 20
TopicsTopics
• Motivation• Architecture of an SLA-aware RMS• Phases of Operation• SLA-aware Scheduling• Cross-border Migration• Summary
HPC4U 21
Negotiation of new SLANegotiation of new SLA
• Incoming SLA-request 3 nodes, 7h runtime, earliest start: 20:00, deadline 6:00 request can be accepted, buffer time frame 2h
• Regular Checkpointing new checkpoint to be generated every 60 minutes checkpointing causes delay in job completion
– depends on CP-system and job size
12am 6pm 12pm 6am
HPC4U 22
Suspending JobsSuspending Jobs• valuable resources may be blocked by non-SLA jobs
23:00: SLA-request: 3 nodes, 7 hours, deadline 6:00– insufficient capacity: rejection of new SLA-request
• checkpoint and suspend of non-SLA jobs (best effort only)
• acceptance of request and execution of SLA-job• resume of suspended SLA-bound job
completion of best-effort job, completion of SLA-job
12am 6pm 12pm 6am
HPC4U 23
Increasing system utilizationIncreasing system utilization
• Jobs are requesting number of nodes and runtime users do not align their requests to free capacities
• Reservations must be guaranteed No other complete jobs fits into gaps
• Using of job suspend to use gaps for partial job execution
• Realization of „background jobs“
HPC4U 24
Runtime of SLA jobRuntime of SLA job
• Pre-runtime phase configuration of network, storage, and nodes
• Runtime phase Monitoring of system Regular checkpointing
• Post-runtime phase
12am 6pm 12pm 6am
HPC4U 25
Handling of Resource FailuresHandling of Resource Failures• Resource outage in partition of job
job crashes immediately last checkpoint after 4h runtime
– computation time since last checkpoint is lost
• allocation of partition with 3h runtime• restore from last checkpointed state• scheduling of regular checkpoint intervals• resuming computation
12am 6pm 12pm 6am
HPC4U 26
Availability of spare resourcesAvailability of spare resources
• Migration presumes availability of resources but: resources may be blocked by other jobs
• Solution: Suspension of other jobs• Problem: What to do in case of SLA-jobs blocking
resource SLA-job can only be suspended if deadline is held
• Buffer nodes: execution of non-SLA jobs only
12am 6pm 12pm 6am
HPC4U 27
TopicsTopics
• Motivation• Architecture of an SLA-aware RMS• Phases of Operation• SLA-aware Scheduling• Cross-border Migration• Summary
HPC4U 28
Cross-border migrationCross-border migration
• Goal: Successful execution of SLA-jobs handling of failures depends on local load situation goal of provider: utilization of resources high load + massive failure → no migration → SLA violation
• Idea: Cross-border migration usage of resources on other local machines
– multiple clusters available on most sites transfer of checkpoint dataset to remote cluster resume of job from checkpointed state
RMS RMS
HPC4U 29
Grid MigrationGrid Migration
• Cross border migration enhances FT-level additional alternatives for migration process
• Grid Migration = usage of Grid as migration target Virtual Resource Manager as active Grid component Negotiation with Grid on resources
• Migration process request for spare resources transfer using standard protocols
• Transparent for the user user will receive results from new site
• Problem: Compatibility of resources
HPC4U 30
Compatibility ProfileCompatibility Profile
• Checkpoint dataset needs compatible resources for restart processor architecture main and storage memory interconnect type libraries
– exact version for loaded libs– compatible version for unloaded libs
paths
• Compatibility profile describes requirements of checkpointed jobs
• Resource query according to this profile
HPC4U 31
Grid IntegrationGrid Integration
Grid Middleware
Grid Customer
Grid Interface
RMS HPC4U Grid Fabric
HPC4U Grid Fabric
HPC4U Grid Fabric
Grid Customer
Grid Customer
HPC4U 32
SummarySummary
• New requirements from future commercial Grids Transparent fault tolerance, SLA negotiation and mgmt.
• SLA-aware Resource Management System orchestrated operation of subsystems for
Process, Storage, and Network SLA scheduling in RMS
• Cross-border migration for increased FT level virtual resource management, compatibility profile
• Progress support for single node jobs running support of parallel applications close to completion next: cross-border migration
HPC4U 33
Further InformationFurther Information
• please visit our website http://www.hpc4u.org
• you will find… … general information about HPC4U … movies showing fault tolerance in action … downloadable demo system for playing … links and contact addresses