cloud computing for hpc - university of oklahoma · 2009. 10. 7. · • cloud is happening due to...

21
Cloud Computing for HPC William Lu Platform Computing

Upload: others

Post on 05-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Cloud Computing for HPC

    William LuPlatform Computing

  • Trends

    • Increasing demand for compute power in simulation, analysis, and research– More Data, More Compute, More Analysis

    • Organizations are searching for an integrated management stack to further improve user service level and simplify HPC operations

  • Plans to Deploy Private Cloud in 2009

    Source: International Supercomputing Conference 2009 attendee survey

    Yes28%

    No72%

  • Motivations for Private Cloud

    41%

    9%

    17%

    15%

    18%

    0% 10% 20% 30% 40% 50%

    Improving efficiency

    Improving IT responsiveness

    Cutting costs

    Experimenting with cloud …

    Resource scalability

    Source: International Supercomputing Conference 2009 attendee survey

  • Barriers to Private Cloud

    37%

    21%

    8%

    26%

    8%

    0% 10% 20% 30% 40%

    Organisational culture

    Security

    Upfront costs

    Complexity of managing

    Application software licensing

    Source: International Supercomputing Conference 2009 attendee survey

  • Challenges in HPC grid/clusters

    • Job interference– Memory leak– Run time environment (OS+tools)

    • Security• “One fits all” scheduler

    – Fair share with history vs. current resource allocation

  • Reality Behind Resource Allocation

    • Static (advance) Allocation vs. Real Workload

    • Realities– Lacking of visibility of real workload– This delivers SLA/QoS– But… real utilization is low

    Time

    ResourceAdvanced Allocation

    Real workload

    Backfilled Allocation

    Backfilled over run job

  • Elastic vs. Traditional Scheduling

    4Resource

    132

    Time

    4

    Time

    Resource1

    32

    Elastic scheduling• Better SLA• Higher utilization

    Traditional scheduling

  • Resource Scavenging

    • Scavenging resources in desktop, labs, external data centers, public cloud…

    4Resource

    13

    2Time

    Resources in HPC center

    Resources outside of HPC center

  • Different job environment

    SLES 10

    Resource

    RHEL

    Scientific Linux

    SLES 11

    Time

    • Job runtime environment needs to be dynamically scheduled together with the job

  • Moving from Cluster/Grid to Cloud

    Cluster

    Single App

    Multiple Apps /Multiple Clusters

    Grid

    Cloud

    Dynamic and elastic

  • Management of Private Cloud for HPC

    Workload Management

    Servers Storage Network

    VMmanagement Provisioning

    Private Cloud Management

    Internal Data center

    External Data Center/ Public

    Cloud

    HPC Applications

    Resource Integration

    Allocation Engine

    Service Delivery

    Application & workload integrations

    Match supplyto demand

    Service creation, access, delivery, accounting

    Create a shared pool of physical & virtual resources

  • Software Stack from Platform Computing

    Platform LSF

    Servers Storage Network

    VMmanagement Provisioning

    Platform ISFPlatform ISF HPC Cluster

    Internal Data center

    External Data Center/ Public

    Cloud

    Platform MPI

  • Large Research Institute

    Before AfterUser support incidence 15/day 2/day

    Job failure due to app environment conflict

    5/day ~0

    App env set up time 1 weeks 1 day

    Profile Problems SolutionHPC center with thousands of servers used by multiple user groups for data analysis

    • Application environment set up is too long• Job fails due to application environment conflicts on the same server• Scheduling policy conflicts across multiple user groups causing user dissatisfaction

    • Elastic and autonomous Platform LSF clusters scheduled by Platform ISF for each user group• When workload changes, Platform ISF dynamically add and remove resources from individual clusters• VM is used as job container to reduce job interference

  • First-come, first server capacityOn-Demand:Pay-per-use for non-critical or burst workloads

    Reserve resources for critical workloadsGuaranteedReservation: Optimization utilization, guaranteed capacity

    Reservation & On-Demand Allocations

  • PackingPack workload on fewest number of physical serversMaximizes usable capacity, reduces fragmentations, reduce energy consumption

    Striping Spread workload across as many physical servers as possibleReduce impact of host failures, higher application performance

    Load-Aware Allocate physical servers with lowest load to new workloadsHigher application performance

    HA-Aware Allocate HA-enabled resources to critical workloadsMatch availability levels to service requirements and costs

    Energy-AwarePlace workload according to energy indices and datacenter hot spotsReduce energy consumption

    Affinity-Aware Place workload close to critical resources such as storageHigher application performance

    Server Model- Aware

    Allocate resource to workload according to model typesMaximize utilization of higher performing & more expensive resources

    Topology-AwareAllocate resources on the same interconnect to the same applicationImprove application performance

    Resource-Aware Allocation Policies

  • Reporting and Accounting

    10/9/2009 17

  • User Access Point – LSF Web Portal

    App Centric Interface

    Job notification

  • User Access Point – LSF Web Portal

    Job data management

    Customize without HTML programming

  • Summary

    • Cloud is happening due to its superior SLA• Cloud is to add dynamics and elasticity to

    grid and cluster environment• Platform ISF delivers elastic logical

    clusters in a shared HPC infrastructure. It transforms cluster and grid to cloud

  • Thank You

    www.platform.com

    Cloud Computing for HPCTrendsPlans to Deploy Private Cloud in 2009Motivations for Private CloudBarriers to Private CloudChallenges in HPC grid/clustersReality Behind Resource AllocationElastic vs. Traditional SchedulingResource ScavengingDifferent job environmentMoving from Cluster/Grid to CloudManagement of Private Cloud for HPCSoftware Stack from Platform ComputingLarge Research InstituteReservation & On-Demand AllocationsResource-Aware Allocation PoliciesReporting and AccountingUser Access Point – LSF Web PortalUser Access Point – LSF Web PortalSummaryThank You