cloud computing for hpc - university of oklahoma · 2009. 10. 7. · • cloud is happening due to...

Cloud Computing for HPC

William LuPlatform Computing

Trends

• Increasing demand for compute power in simulation, analysis, and research– More Data, More Compute, More Analysis

• Organizations are searching for an integrated management stack to further improve user service level and simplify HPC operations

Plans to Deploy Private Cloud in 2009

Source: International Supercomputing Conference 2009 attendee survey

Yes28%

No72%

Motivations for Private Cloud

41%

9%

17%

15%

18%

0% 10% 20% 30% 40% 50%

Improving efficiency

Improving IT responsiveness

Cutting costs

Experimenting with cloud …

Resource scalability


Barriers to Private Cloud

37%

21%

8%

26%

8%

0% 10% 20% 30% 40%

Organisational culture

Security

Upfront costs

Complexity of managing

Application software licensing


Challenges in HPC grid/clusters

• Job interference– Memory leak– Run time environment (OS+tools)

• Security• “One fits all” scheduler

– Fair share with history vs. current resource allocation

Reality Behind Resource Allocation

• Static (advance) Allocation vs. Real Workload

• Realities– Lacking of visibility of real workload– This delivers SLA/QoS– But… real utilization is low

Time

ResourceAdvanced Allocation

Real workload

Backfilled Allocation

Backfilled over run job

Elastic vs. Traditional Scheduling

4Resource

132

Time

4

Time

Resource1

32

Elastic scheduling• Better SLA• Higher utilization

Traditional scheduling

Resource Scavenging

• Scavenging resources in desktop, labs, external data centers, public cloud…

4Resource

13

2Time

Resources in HPC center

Resources outside of HPC center

Different job environment

SLES 10

Resource

RHEL

Scientific Linux

SLES 11

Time

• Job runtime environment needs to be dynamically scheduled together with the job

Moving from Cluster/Grid to Cloud

Cluster

Single App

Multiple Apps /Multiple Clusters

Grid

Cloud

Dynamic and elastic

Management of Private Cloud for HPC

Workload Management

Servers Storage Network

VMmanagement Provisioning

Private Cloud Management

Internal Data center

External Data Center/ Public

Cloud

HPC Applications

Resource Integration

Allocation Engine

Service Delivery

Application & workload integrations

Match supplyto demand

Service creation, access, delivery, accounting

Create a shared pool of physical & virtual resources

Software Stack from Platform Computing

Platform LSF

Servers Storage Network

VMmanagement Provisioning

Platform ISFPlatform ISF HPC Cluster

Internal Data center

External Data Center/ Public

Cloud

Platform MPI

Large Research Institute

Before AfterUser support incidence 15/day 2/day

Job failure due to app environment conflict

5/day ~0

App env set up time 1 weeks 1 day

Profile Problems SolutionHPC center with thousands of servers used by multiple user groups for data analysis

• Application environment set up is too long• Job fails due to application environment conflicts on the same server• Scheduling policy conflicts across multiple user groups causing user dissatisfaction

• Elastic and autonomous Platform LSF clusters scheduled by Platform ISF for each user group• When workload changes, Platform ISF dynamically add and remove resources from individual clusters• VM is used as job container to reduce job interference

First-come, first server capacityOn-Demand:Pay-per-use for non-critical or burst workloads

Reserve resources for critical workloadsGuaranteedReservation: Optimization utilization, guaranteed capacity

Reservation & On-Demand Allocations

PackingPack workload on fewest number of physical serversMaximizes usable capacity, reduces fragmentations, reduce energy consumption

Striping Spread workload across as many physical servers as possibleReduce impact of host failures, higher application performance

Load-Aware Allocate physical servers with lowest load to new workloadsHigher application performance

HA-Aware Allocate HA-enabled resources to critical workloadsMatch availability levels to service requirements and costs

Energy-AwarePlace workload according to energy indices and datacenter hot spotsReduce energy consumption

Affinity-Aware Place workload close to critical resources such as storageHigher application performance

Server Model- Aware

Allocate resource to workload according to model typesMaximize utilization of higher performing & more expensive resources

Topology-AwareAllocate resources on the same interconnect to the same applicationImprove application performance

Resource-Aware Allocation Policies

Reporting and Accounting

10/9/2009 17

User Access Point – LSF Web Portal

App Centric Interface

Job notification

User Access Point – LSF Web Portal

Job data management

Customize without HTML programming

Summary

• Cloud is happening due to its superior SLA• Cloud is to add dynamics and elasticity to

grid and cluster environment• Platform ISF delivers elastic logical

clusters in a shared HPC infrastructure. It transforms cluster and grid to cloud

Thank You

www.platform.com

Cloud Computing for HPCTrendsPlans to Deploy Private Cloud in 2009Motivations for Private CloudBarriers to Private CloudChallenges in HPC grid/clustersReality Behind Resource AllocationElastic vs. Traditional SchedulingResource ScavengingDifferent job environmentMoving from Cluster/Grid to CloudManagement of Private Cloud for HPCSoftware Stack from Platform ComputingLarge Research InstituteReservation & On-Demand AllocationsResource-Aware Allocation PoliciesReporting and AccountingUser Access Point – LSF Web PortalUser Access Point – LSF Web PortalSummaryThank You

cloud computing for hpc - university of oklahoma · 2009. 10. 7. · • cloud is happening due to...

Documents