efficient high performance computing in the cloud

EFFICIENT HIGH PERFORMANCE COMPUTING IN THE CLOUD

Abhishek Gupta ([email protected])

Department of Computer Science,

University of Illinois at Urbana Champaign, Urbana, IL

1

mailto:[email protected]

MOTIVATION: WHY CLOUDS FOR HPC ? Rent vs. own, pay-as-you-go

No startup/maintenance cost, cluster create time Elastic resources

No risk e.g. in under-provisioning Prevents underutilization

Benefits of virtualization Flexibility and customization Security and isolation Migration and resource control

2

Cloud for HPC: A cost-effective and timely solution? Multiple providers – Infrastructure as a Service

MOTIVATION: HPC-CLOUD DIVIDE

3

Application performance

Dedicated execution HPC-optimized

interconnects, OS Not cloud-aware

Service, cost, resource utilization

Multi-tenancy Commodity network,

virtualization Not HPC-aware

HPC Cloud

Mismatch: HPC requirements and cloud characteristics Only embarrassingly parallel, small scale HPC applications in clouds

OBJECTIVES

HPC-cloud: What, why, who

How: Bridge HPC-cloud Gap

HPC in cloud

Improve HPC performance

Improve Cloud utilization=> Reduce cost

OBJECTIVES AND CONTRIBUTIONS

HPC-cloud: What, why, who

How: Bridge HPC-cloud Gap

Perf, cost

Analysis

Heterogeneity, Multi-tenancy

aware HPC

HPC in cloud

Tools Extended

Techniques

Goals

Load Balancing Framework

OpenStack Nova Scheduler

Object Migratio

n

CloudSim Simulator

Malleable jobs:

Dynamic shrink/expan

d

Application-aware VM

consolidation

Smart selection of

platforms for applications

5

6

OUTLINE

Performance of HPC in cloud Trends Challenges and Opportunities

Application-aware cloud schedulers HPC-aware schedulers: improve HPC

performance Application-aware consolidation: improve cloud

utilization => reduce cost Cloud-aware HPC runtime

Dynamic Load balancing: improve HPC performance

Parallel runtime for shrink/expand: improve cloud utilization => reduce cost

Conclusions

EXPERIMENTAL TESTBED AND APPLICATIONS

7

NAS Parallel Benchmarks class B (NPB3.3-MPI)

NAMD - Highly scalable molecular dynamics ChaNGa - Cosmology, N-body Sweep3D - A particle in ASCI code Jacobi2D - 5-point stencil computation kernel Nqueens - Backtracking state space search

Platform/Resource

Ranger (TACC)

Taub (UIUC)

Open Cirrus (HP)

Private Cloud (HP) Public Cloud (HP)

Network Infiniband (10Gbps)

Voltaire QDR Infiniband

10 Gbps Ethernet internal; 1 Gbps Ethernet x-rack

Emulated network card under KVM hypervisor (1Gbps Physical Ethernet)

Emulated network under KVM hypervisor (1Gbps Physical Ethernet)

PERFORMANCE (1/3)

8

Some applications cloud-friendly

PERFORMANCE (2/3)

9Some applications scale till 16-64

cores

PERFORMANCE (3/3)

10

Some applications cannot survive in cloud

BOTTLENECKS IN CLOUD: COMMUNICATION LATENCY

11

Cloud message latencies (256μs) off by 2 orders of magnitude compared to supercomputers (4μs)

Low is better

BOTTLENECKS IN CLOUD: COMMUNICATION BANDWIDTH

12

Cloud communication performance off by 2 orders of magnitude – why?

High is better

COMMODITY NETWORK OR VIRTUALIZATION OVERHEAD (BOTH?)

Significant virtualization overhead (Physical vs. virtual) Led to collaborative work on “Optimizing virtualization for

HPC – Thin VMs, Containers, CPU affinity” with HP labs, Singapore.

Low is better

High is better

13

PUBLIC VS. PRIVATE CLOUD

14

Similar network performance for public and private cloud. Then, why does public cloud perform worse?

Heterogeneity and Multi-tenancy

Low is better

15

HETEROGENEITY AND MULTI-TENANCY CHALLENGE Heterogeneity: Cloud economics is based on:

Creation of a cluster from existing pool of resources and Incremental addition of new resources.

Multi-tenancy: Cloud providers run a profitable business by improving utilization of underutilized resources Cluster-level by serving large number of users, Server-level by consolidating VMs of complementary

nature (such as memory- and compute-intensive) on same server. Heterogeneity and multi-tenancy intrinsic in

clouds, driven by cloud economics butFor HPC, one slow processor => all underutilized processors

16

OUTLINE







Conclusions

OpportunitiesChallenges/Bottlenecks

Heterogeneity Multi-tenancy

VM consolidation

Application-awareCloud schedulers

SCHEDULING/PLACEMENT

HPC in HPC-aware cloud

Next …

17

18

BACKGROUND: OPENSTACK NOVA

Nova scheduler is HPC-agnostic

OpenStack Open source cloud management

system Linux of cloud management

Nova scheduler (VM placement) Ignore the nature of application Ignores heterogeneity and network

topology Considers the k VMs requested by an

HPC user as k separate placement problems

No co-relation between the VMs of a single request

19

HARDWARE, TOPOLOGY-AWARE VM PLACEMENT

CPU Timelines of 8 VMs running Jacobi2D – one iteration

OpenStack on Open Cirrus test bed at HP Labs. 3 types of servers: Intel Xeon E5450 (3.00 GHz) Intel Xeon X3370 (3.00 GHz) Intel Xeon X3210 (2.13 GHz)

KVM as hypervisor, virtio-net for n/w virtualization, VMs: m1.small

20% improvement in time, across all processors

Decrease in execution time

20

WHAT ABOUT MULTI-TENANCY: VM CONSOLIDATION FOR HPC IN CLOUD(1)

HPC performance vs. Resource utilization (prefers dedicated execution) (shared usage in cloud)?

Up to 23% savingsHow much interference?

0.5 GB

VM Requests

21

VM CONSOLIDATION FOR HPC IN CLOUD (2)Experiment: Shared mode (2 apps on each node – 2 cores each on 4 core node) performance normalized wrt. dedicated mode

Challenge: Interference

EP = Embarrisingly ParallelLU = LU factorizationIS = Integer SortChaNGa = Cosmology

4 VM per appHigh is better

Careful co-locations can actually improve performance. Why?Correlation : LLC misses/sec and shared mode performance.

Scope

22

METHODOLOGY: (1) APPLICATION CHARACTERIZATION

Characterize applications along two dimensions:1. Cache intensiveness

Assign each application a cache score (= 100K LLC misses/sec)

Representative of the pressure they put on the last level cache and memory controller subsystem

2. Parallel Synchronization and network sensitivity ExtremeHPC: IS (Parallel sorting) SyncHPC: LU, ChaNGa AsyncHPC: EP, MapReduce applications NonHPC: Web applications

23

METHODOLOGY: (2)HPC AWARE SCHEDULER: DESIGN AND IMPLEMENTATION

Dedicated execution for ExtremeHPC

Resource Packing for rest classes

Topo-awareness for ExtremeHPC, SyncHPC

Less aggressive packing for SyncHPC - bulk sync apps

Cross-interference aware using LLC misses/sec

Co-locate applications with complementary profiles

24

MDOBP (MULTI-DIMENSIONAL ONLINE BIN PACKING) :

Pack a VM request into hosts (bin) Dimension-aware heuristic Select the host for which the vector of requested

resources aligns the most with the vector of remaining capacities* the host with the minimum α where cos(α) is

calculated using dot product of the two vectors, and is given by:

Residual capacities = (CPURes, MemRes) of a host , Requested VM: (CPUReq, MemReq).

αRequestedRemaining

CPUs

Memory

*S. Lee, R. Panigrahy, V. Prabhakaran, V. Ramasubramanian, K. Talwar, L. Uyeda, and U. Wieder., “Validating Heuristics for Virtual Machines Consolidation,” Microsoft Research, Tech. Rep., 2011.

Physical host

25

IMPLEMENTATION ATOP OPEN STACK NOVA

26

RESULTS: CASE STUDY OF APPLICATION-AWARE SCHEDULINGEP = Embarrassingly

ParallelLU = LU factorizationIS = Integer Sort

IS.B.4

Problem Size Number of requested VMs

8 nodes (32 cores)

Performance gains up to 45% for a single application Limiting negative impact of interference to 8% But, what about resource utilization?

High is better

Less aggressive Packing

27

SIMULATION CloudSim: simulation tool for modeling a cloud

computing environment in a datacenter Extended the existing vmAllocationPolicySimple

class to create an vmAllocationPolicyHPC Handle a user request comprising multiple VM instances Perform Application-aware scheduling

Implemented dynamic VM creation and termination

28

SIMULATION RESULTS Assigned each job a cache score from (0-30) using a uniform

distribution random number generator Modified execution times by -10% and -20% to account for the

improvement in performance resulting from cache-awareness

METACENTRUM-02.swf log from Parallel workload archive

Simulated first 1500 jobs on 1024 cores, for 100 seconds

β=Cache threshold

259 jobs

For cache threshold of 60 and adjustment of -10%,improvement in throughput by 259/801 = 32.3%

High is better

30

OUTLINE







Conclusions

HETEROGENEITY AND MULTI-TENANCY

31

Multi-tenancy => Dynamic heterogeneity Interference random and unpredictable Challenge: Running in VMs makes it difficult to

determine if (and how much of) the load imbalance is Application-intrinsic or Caused by extraneous factors such as interference

Idle times

VMs sharing CPU: application functions appear to be taking longer time

Existing HPC load balancers ignore effect of extraneous factors

Time

CPU/VM

CHARM++ AND LOAD BALANCING

Migratable objects (chares) Object-based over-decomposition

Background/ Interfering VM running on same host

Objects (Work/Data Units)

Load balancer migrates objects from overloaded to under loaded VM

Physical Host 1 Physical Host 2

HPC VM1 HPC VM2

32

33

CLOUD-AWARE LOAD BALANCER Static Heterogeneity:

Estimate the CPU capabilities for each VCPU, and use those estimates to drive the load balancing.

Simple estimation strategy + periodic load re-distribution

Dynamic Heterogeneity Instrument the time spent on each task Impact of interference: instrument the load external to

the application under consideration (background load) Normalize execution time to number of ticks (processor-

independent) Predict future load based on the loads of recently

completed iterations (principle of persistence). Create sets of overloaded and under loaded cores Migrate objects based on projected loads from

overloaded to underloaded VMs (Periodic refinement)

LOAD BALANCING APPROACH

All processors should have load close to

average load

Average load depends on task execution time and

overhead

Overhead is the time processor is not executing tasks and not in idle

mode. Charm++ LB database

from /proc/stat file

Tlb: wall clock time between two load balancing steps, Ti: CPU time consumed by task i on VCPU p

To get a processor-independentmeasure of task loads, normalize the execution times to number of ticks

34

35

RESULTS: STENCIL3D

Periodically measuring idle time and migrating load away from time-shared VMs works well in practice.

• OpenStack on Open Cirrus test bed (3 types of processors), KVM, virtio-net, VMs: m1.small, vcpupin for pinning VCPU to physical cores

• Sequential NPB-FT as interference, Interfering VM pinned to one of the cores that the VMs of our parallel runs use

Low is betterMulti-tenancy awareness

Heterogeneityawareness

36

RESULTS

Improved CPU utilization

Load Imbalance

High colored bars are better

37

RESULTS: IMPROVEMENTS BY LB

Heterogeneity and Interference – one Slow node, hence four Slow VMs, rest Fast, one interfering VM (on a Fast core) which starts at iteration 50.

Up to 40% Benefits

High is better

38

OUTLINE







Conclusions

MALLEABLE PARALLEL JOBS

Malleable jobs: dynamic shrink/expand number of processors

Twofold merit in the context of cloud computing. Cloud user perspective:

Dynamic pricing offered by cloud providers, such as Amazon EC2

Better value for the money spent based on priorities and deadlines

Cloud provider perspective Malleable jobs + smart scheduler => better system

utilization, response time, and throughput while following QoS Honor job priorities

39

Application Processes

Object Evacuation Load Balancing

Sync. Point, Check for Shrink/Expand Request

Checkpoint to linux shared memory

Rebirth (exec) or die (exit)Reconnect protocol

Restore Object from Checkpoint

Execution Resumesvia stored callback

Launcher (Charmrun)

CCS Shrink Request

ShrinkAck to external client

Time

Tasks/Objects

Application Processes

Sync. Point, Check for Shrink/Expand Request

Checkpoint to linux shared memory

Rebirth (exec) or launch (ssh, fork)Connect protocol

Restore Object from Checkpoint

Execution Resumesvia stored callback

Launcher (Charmrun)

CCS Expand Request

ExpandAck to external client

Time

Tasks/Objects

Load Balancing

42

OUTLINE







Conclusions

CONCLUSIONS Bridge the gap between HPC and Cloud

Performance and Cost HPC-aware clouds and cloud-aware HPC

Key ideas can be extended beyond HPC-Clouds Application-aware scheduling, characterization and

consolidation Load balancing Malleable jobs

Comprehensive evaluation and analysis Performance benchmarking Application characterization

43

44

FINDINGSQuestion Answers

Who • Small and medium scale organizations (pay-as-you-go benefits)

• Owning applications which result in best performance/cost ratio in cloud vs. other platforms.

What • Applications with less-intensive communication patterns• Less sensitivity to noise/interference• Small to medium scale

Why • HPC users in small-medium enterprises much more sensitive to the CAPEX/OPEX argument.

• Ability to exploit a large variety of different architectures (Better utilization at global scale, potential consumer savings)

How • Technical: Lightweight virtualization, CPU affinity, HPC-aware Cloud schedulers, Cloud-Aware HPC runtime

• HPC in the cloud models: cloud bursting, hybrid supercomputer–cloud approach: application-aware mapping

44

FUTURE WORK Application-aware cloud consolidation + cloud-

aware HPC load balancer Mapping applications to platforms

45

46

POSSIBLE BENEFITS

Interesting cross-over points when considering cost. Best platform depends on scale, budget.

Time constraintChoose this

Cost constraint

Choose this

Low is better

Cost = Charging rate($ per core-hour) × P × Time

Low is better

47

CLOUD PROVIDER PERSPECTIVE

Queue of JobsStandard scheduling

( Backfilling, FCFS, Priority) + Application-awareness

Multi-dimension optimization

• Online job scheduling Ji = (t,pn), pn= f(t, n), n∈N platforms, deadlines

• Output: start time (si), ni (which platform) • Optimization fn: Utilization, turnaround time for a job, throughput• Simplifications

• Less load on SC• Reduced wait time • Better cloud utilization

48

WHAT ELSE I HAVE DONE

Large scale HPC Applications EpiSimdemics:

Collaborated with V-tech researchers to enable parallel simulation of contagion diffusion over very large social networks.

scales up to 300,000 cores on Blue Waters. My focus on leveraging (and developing) Charm++

runtime features to optimize performance of EpiSimdemics.

Information Set for Game Trees: Parallelized information set generation for game tree

search applications. Analyzed the impact of load balancing strategies,

problem sizes, and computational granularity on parallel scaling.

49


Runtime Systems and Schedulers Charm++ Runtime system:

various projects for research and development of Charm++ parallel programming system and the associated ecosystem (tools etc).

Adaptive Job Scheduler: extending an open-source job scheduler (SLURM)

for enabling malleable HPC jobs. runtime support in Charm++ for such dynamic

shrink/expand capability. Power-aware load balancing and scheduling

Scalable Tree Startup: A multi-level scalable startup technique for parallel

applications.

50


Architectures for Data intensive applicationsGraph500, HPCC Gups

Simulation – Sandia SST

51

QUESTIONS?

52

BACKUP SLIDES

HPC-CLOUD ECONOMICS

Then why cloud for HPC? Small-medium enterprises, startups with HPC needs Lower cost of running in cloud vs. supercomputer?

For some applications?

53

HPC-CLOUD ECONOMICS*

54


Cloud can be cost-effective till some scale but what about performance?

High means cheaper to run in cloud

$ per CPU-hour on SC$ per CPU-hour on cloud

* Ack to Dejan Milojicic and Paolo Faraboschi who originally drew this figure

HPC-CLOUD ECONOMICS

55


Low is better

Best platform depends on application characteristics. How to select a platform for an application?

56

PROPOSED WORK(1): APP-TO-PLATFORM

1. Application characterization and relative performance estimation for structured applications

One-time benchmarking + interpolation for complex apps.

2. Platform selection algorithms (cloud user perspective)

Minimize cost meeting performance target Maximize performance under cost constraint Consider an application set as a whole

Which application, which cloudBenefits: Performance, Cost

57

IMPACT Effective HPC in cloud (Performance, cost) Some techniques applicable beyond clouds Charm++ production system OpenStack scheduler CloudSim Industry participation (HP Lab’s award, internships) 2 patents

efficient high performance computing in the cloud

Documents

hpccloud divide

cloud characteristics

bridge hpccloud gaphpc

bridge hpccloud gapperf

hpc requirements

outlineperformance of

cloud abhishek gupta

clouds objectiveshpccloud