efficient high performance computing in the cloud
Post on 17-Jan-2016
20 Views
Preview:
DESCRIPTION
TRANSCRIPT
EFFICIENT HIGH PERFORMANCE COMPUTING IN THE CLOUD
Abhishek Gupta (gupta59@illinois.edu)
Department of Computer Science,
University of Illinois at Urbana Champaign, Urbana, IL
1
MOTIVATION: WHY CLOUDS FOR HPC ? Rent vs. own, pay-as-you-go
No startup/maintenance cost, cluster create time Elastic resources
No risk e.g. in under-provisioning Prevents underutilization
Benefits of virtualization Flexibility and customization Security and isolation Migration and resource control
2
Cloud for HPC: A cost-effective and timely solution? Multiple providers – Infrastructure as a Service
MOTIVATION: HPC-CLOUD DIVIDE
3
Application performance
Dedicated execution HPC-optimized
interconnects, OS Not cloud-aware
Service, cost, resource utilization
Multi-tenancy Commodity network,
virtualization Not HPC-aware
HPC Cloud
Mismatch: HPC requirements and cloud characteristics Only embarrassingly parallel, small scale HPC applications in clouds
OBJECTIVES
HPC-cloud: What, why, who
How: Bridge HPC-cloud Gap
HPC in cloud
Improve HPC performance
Improve Cloud utilization=> Reduce cost
OBJECTIVES AND CONTRIBUTIONS
HPC-cloud: What, why, who
How: Bridge HPC-cloud Gap
Perf, cost
Analysis
Heterogeneity, Multi-tenancy
aware HPC
HPC in cloud
Tools Extended
Techniques
Goals
Load Balancing Framework
OpenStack Nova Scheduler
Object Migratio
n
CloudSim Simulator
Malleable jobs:
Dynamic shrink/expan
d
Application-aware VM
consolidation
Smart selection of
platforms for applications
5
6
OUTLINE
Performance of HPC in cloud Trends Challenges and Opportunities
Application-aware cloud schedulers HPC-aware schedulers: improve HPC
performance Application-aware consolidation: improve cloud
utilization => reduce cost Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
Parallel runtime for shrink/expand: improve cloud utilization => reduce cost
Conclusions
EXPERIMENTAL TESTBED AND APPLICATIONS
7
NAS Parallel Benchmarks class B (NPB3.3-MPI)
NAMD - Highly scalable molecular dynamics ChaNGa - Cosmology, N-body Sweep3D - A particle in ASCI code Jacobi2D - 5-point stencil computation kernel Nqueens - Backtracking state space search
Platform/Resource
Ranger (TACC)
Taub (UIUC)
Open Cirrus (HP)
Private Cloud (HP) Public Cloud (HP)
Network Infiniband (10Gbps)
Voltaire QDR Infiniband
10 Gbps Ethernet internal; 1 Gbps Ethernet x-rack
Emulated network card under KVM hypervisor (1Gbps Physical Ethernet)
Emulated network under KVM hypervisor (1Gbps Physical Ethernet)
PERFORMANCE (1/3)
8
Some applications cloud-friendly
PERFORMANCE (2/3)
9Some applications scale till 16-64
cores
PERFORMANCE (3/3)
10
Some applications cannot survive in cloud
BOTTLENECKS IN CLOUD: COMMUNICATION LATENCY
11
Cloud message latencies (256μs) off by 2 orders of magnitude compared to supercomputers (4μs)
Low is better
BOTTLENECKS IN CLOUD: COMMUNICATION BANDWIDTH
12
Cloud communication performance off by 2 orders of magnitude – why?
High is better
COMMODITY NETWORK OR VIRTUALIZATION OVERHEAD (BOTH?)
Significant virtualization overhead (Physical vs. virtual) Led to collaborative work on “Optimizing virtualization for
HPC – Thin VMs, Containers, CPU affinity” with HP labs, Singapore.
Low is better
High is better
13
PUBLIC VS. PRIVATE CLOUD
14
Similar network performance for public and private cloud. Then, why does public cloud perform worse?
Heterogeneity and Multi-tenancy
Low is better
15
HETEROGENEITY AND MULTI-TENANCY CHALLENGE Heterogeneity: Cloud economics is based on:
Creation of a cluster from existing pool of resources and Incremental addition of new resources.
Multi-tenancy: Cloud providers run a profitable business by improving utilization of underutilized resources Cluster-level by serving large number of users, Server-level by consolidating VMs of complementary
nature (such as memory- and compute-intensive) on same server. Heterogeneity and multi-tenancy intrinsic in
clouds, driven by cloud economics butFor HPC, one slow processor => all underutilized processors
16
OUTLINE
Performance of HPC in cloud Trends Challenges and Opportunities
Application-aware cloud schedulers HPC-aware schedulers: improve HPC
performance Application-aware consolidation: improve cloud
utilization => reduce cost Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
Parallel runtime for shrink/expand: improve cloud utilization => reduce cost
Conclusions
OpportunitiesChallenges/Bottlenecks
Heterogeneity Multi-tenancy
VM consolidation
Application-awareCloud schedulers
SCHEDULING/PLACEMENT
HPC in HPC-aware cloud
Next …
17
18
BACKGROUND: OPENSTACK NOVA
Nova scheduler is HPC-agnostic
OpenStack Open source cloud management
system Linux of cloud management
Nova scheduler (VM placement) Ignore the nature of application Ignores heterogeneity and network
topology Considers the k VMs requested by an
HPC user as k separate placement problems
No co-relation between the VMs of a single request
19
HARDWARE, TOPOLOGY-AWARE VM PLACEMENT
CPU Timelines of 8 VMs running Jacobi2D – one iteration
OpenStack on Open Cirrus test bed at HP Labs. 3 types of servers: Intel Xeon E5450 (3.00 GHz) Intel Xeon X3370 (3.00 GHz) Intel Xeon X3210 (2.13 GHz)
KVM as hypervisor, virtio-net for n/w virtualization, VMs: m1.small
20% improvement in time, across all processors
Decrease in execution time
20
WHAT ABOUT MULTI-TENANCY: VM CONSOLIDATION FOR HPC IN CLOUD(1)
HPC performance vs. Resource utilization (prefers dedicated execution) (shared usage in cloud)?
Up to 23% savingsHow much interference?
0.5 GB
VM Requests
21
VM CONSOLIDATION FOR HPC IN CLOUD (2)Experiment: Shared mode (2 apps on each node – 2 cores each on 4 core node) performance normalized wrt. dedicated mode
Challenge: Interference
EP = Embarrisingly ParallelLU = LU factorizationIS = Integer SortChaNGa = Cosmology
4 VM per appHigh is better
Careful co-locations can actually improve performance. Why?Correlation : LLC misses/sec and shared mode performance.
Scope
22
METHODOLOGY: (1) APPLICATION CHARACTERIZATION
Characterize applications along two dimensions:1. Cache intensiveness
Assign each application a cache score (= 100K LLC misses/sec)
Representative of the pressure they put on the last level cache and memory controller subsystem
2. Parallel Synchronization and network sensitivity ExtremeHPC: IS (Parallel sorting) SyncHPC: LU, ChaNGa AsyncHPC: EP, MapReduce applications NonHPC: Web applications
23
METHODOLOGY: (2)HPC AWARE SCHEDULER: DESIGN AND IMPLEMENTATION
Dedicated execution for ExtremeHPC
Resource Packing for rest classes
Topo-awareness for ExtremeHPC, SyncHPC
Less aggressive packing for SyncHPC - bulk sync apps
Cross-interference aware using LLC misses/sec
Co-locate applications with complementary profiles
24
MDOBP (MULTI-DIMENSIONAL ONLINE BIN PACKING) :
Pack a VM request into hosts (bin) Dimension-aware heuristic Select the host for which the vector of requested
resources aligns the most with the vector of remaining capacities* the host with the minimum α where cos(α) is
calculated using dot product of the two vectors, and is given by:
Residual capacities = (CPURes, MemRes) of a host , Requested VM: (CPUReq, MemReq).
αRequestedRemaining
CPUs
Memory
*S. Lee, R. Panigrahy, V. Prabhakaran, V. Ramasubramanian, K. Talwar, L. Uyeda, and U. Wieder., “Validating Heuristics for Virtual Machines Consolidation,” Microsoft Research, Tech. Rep., 2011.
Physical host
25
IMPLEMENTATION ATOP OPEN STACK NOVA
26
RESULTS: CASE STUDY OF APPLICATION-AWARE SCHEDULINGEP = Embarrassingly
ParallelLU = LU factorizationIS = Integer Sort
IS.B.4
Problem Size Number of requested VMs
8 nodes (32 cores)
Performance gains up to 45% for a single application Limiting negative impact of interference to 8% But, what about resource utilization?
High is better
Less aggressive Packing
27
SIMULATION CloudSim: simulation tool for modeling a cloud
computing environment in a datacenter Extended the existing vmAllocationPolicySimple
class to create an vmAllocationPolicyHPC Handle a user request comprising multiple VM instances Perform Application-aware scheduling
Implemented dynamic VM creation and termination
28
SIMULATION RESULTS Assigned each job a cache score from (0-30) using a uniform
distribution random number generator Modified execution times by -10% and -20% to account for the
improvement in performance resulting from cache-awareness
METACENTRUM-02.swf log from Parallel workload archive
Simulated first 1500 jobs on 1024 cores, for 100 seconds
β=Cache threshold
259 jobs
For cache threshold of 60 and adjustment of -10%,improvement in throughput by 259/801 = 32.3%
High is better
30
OUTLINE
Performance of HPC in cloud Trends Challenges and Opportunities
Application-aware cloud schedulers HPC-aware schedulers: improve HPC
performance Application-aware consolidation: improve cloud
utilization => reduce cost Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
Parallel runtime for shrink/expand: improve cloud utilization => reduce cost
Conclusions
HETEROGENEITY AND MULTI-TENANCY
31
Multi-tenancy => Dynamic heterogeneity Interference random and unpredictable Challenge: Running in VMs makes it difficult to
determine if (and how much of) the load imbalance is Application-intrinsic or Caused by extraneous factors such as interference
Idle times
VMs sharing CPU: application functions appear to be taking longer time
Existing HPC load balancers ignore effect of extraneous factors
Time
CPU/VM
CHARM++ AND LOAD BALANCING
Migratable objects (chares) Object-based over-decomposition
Background/ Interfering VM running on same host
Objects (Work/Data Units)
Load balancer migrates objects from overloaded to under loaded VM
Physical Host 1 Physical Host 2
HPC VM1 HPC VM2
32
33
CLOUD-AWARE LOAD BALANCER Static Heterogeneity:
Estimate the CPU capabilities for each VCPU, and use those estimates to drive the load balancing.
Simple estimation strategy + periodic load re-distribution
Dynamic Heterogeneity Instrument the time spent on each task Impact of interference: instrument the load external to
the application under consideration (background load) Normalize execution time to number of ticks (processor-
independent) Predict future load based on the loads of recently
completed iterations (principle of persistence). Create sets of overloaded and under loaded cores Migrate objects based on projected loads from
overloaded to underloaded VMs (Periodic refinement)
LOAD BALANCING APPROACH
All processors should have load close to
average load
Average load depends on task execution time and
overhead
Overhead is the time processor is not executing tasks and not in idle
mode. Charm++ LB database
from /proc/stat file
Tlb: wall clock time between two load balancing steps, Ti: CPU time consumed by task i on VCPU p
To get a processor-independentmeasure of task loads, normalize the execution times to number of ticks
34
35
RESULTS: STENCIL3D
Periodically measuring idle time and migrating load away from time-shared VMs works well in practice.
• OpenStack on Open Cirrus test bed (3 types of processors), KVM, virtio-net, VMs: m1.small, vcpupin for pinning VCPU to physical cores
• Sequential NPB-FT as interference, Interfering VM pinned to one of the cores that the VMs of our parallel runs use
Low is betterMulti-tenancy awareness
Heterogeneityawareness
36
RESULTS
Improved CPU utilization
Load Imbalance
High colored bars are better
37
RESULTS: IMPROVEMENTS BY LB
Heterogeneity and Interference – one Slow node, hence four Slow VMs, rest Fast, one interfering VM (on a Fast core) which starts at iteration 50.
Up to 40% Benefits
High is better
38
OUTLINE
Performance of HPC in cloud Trends Challenges and Opportunities
Application-aware cloud schedulers HPC-aware schedulers: improve HPC
performance Application-aware consolidation: improve cloud
utilization => reduce cost Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
Parallel runtime for shrink/expand: improve cloud utilization => reduce cost
Conclusions
MALLEABLE PARALLEL JOBS
Malleable jobs: dynamic shrink/expand number of processors
Twofold merit in the context of cloud computing. Cloud user perspective:
Dynamic pricing offered by cloud providers, such as Amazon EC2
Better value for the money spent based on priorities and deadlines
Cloud provider perspective Malleable jobs + smart scheduler => better system
utilization, response time, and throughput while following QoS Honor job priorities
39
Application Processes
Object Evacuation Load Balancing
Sync. Point, Check for Shrink/Expand Request
Checkpoint to linux shared memory
Rebirth (exec) or die (exit)Reconnect protocol
Restore Object from Checkpoint
Execution Resumesvia stored callback
Launcher (Charmrun)
CCS Shrink Request
ShrinkAck to external client
Time
Tasks/Objects
Application Processes
Sync. Point, Check for Shrink/Expand Request
Checkpoint to linux shared memory
Rebirth (exec) or launch (ssh, fork)Connect protocol
Restore Object from Checkpoint
Execution Resumesvia stored callback
Launcher (Charmrun)
CCS Expand Request
ExpandAck to external client
Time
Tasks/Objects
Load Balancing
42
OUTLINE
Performance of HPC in cloud Trends Challenges and Opportunities
Application-aware cloud schedulers HPC-aware schedulers: improve HPC
performance Application-aware consolidation: improve cloud
utilization => reduce cost Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
Parallel runtime for shrink/expand: improve cloud utilization => reduce cost
Conclusions
CONCLUSIONS Bridge the gap between HPC and Cloud
Performance and Cost HPC-aware clouds and cloud-aware HPC
Key ideas can be extended beyond HPC-Clouds Application-aware scheduling, characterization and
consolidation Load balancing Malleable jobs
Comprehensive evaluation and analysis Performance benchmarking Application characterization
43
44
FINDINGSQuestion Answers
Who • Small and medium scale organizations (pay-as-you-go benefits)
• Owning applications which result in best performance/cost ratio in cloud vs. other platforms.
What • Applications with less-intensive communication patterns• Less sensitivity to noise/interference• Small to medium scale
Why • HPC users in small-medium enterprises much more sensitive to the CAPEX/OPEX argument.
• Ability to exploit a large variety of different architectures (Better utilization at global scale, potential consumer savings)
How • Technical: Lightweight virtualization, CPU affinity, HPC-aware Cloud schedulers, Cloud-Aware HPC runtime
• HPC in the cloud models: cloud bursting, hybrid supercomputer–cloud approach: application-aware mapping
44
FUTURE WORK Application-aware cloud consolidation + cloud-
aware HPC load balancer Mapping applications to platforms
45
46
POSSIBLE BENEFITS
Interesting cross-over points when considering cost. Best platform depends on scale, budget.
Time constraintChoose this
Cost constraint
Choose this
Low is better
Cost = Charging rate($ per core-hour) × P × Time
Low is better
47
CLOUD PROVIDER PERSPECTIVE
Queue of JobsStandard scheduling
( Backfilling, FCFS, Priority) + Application-awareness
Multi-dimension optimization
• Online job scheduling Ji = (t,pn), pn= f(t, n), n∈N platforms, deadlines
• Output: start time (si), ni (which platform) • Optimization fn: Utilization, turnaround time for a job, throughput• Simplifications
• Less load on SC• Reduced wait time • Better cloud utilization
48
WHAT ELSE I HAVE DONE
Large scale HPC Applications EpiSimdemics:
Collaborated with V-tech researchers to enable parallel simulation of contagion diffusion over very large social networks.
scales up to 300,000 cores on Blue Waters. My focus on leveraging (and developing) Charm++
runtime features to optimize performance of EpiSimdemics.
Information Set for Game Trees: Parallelized information set generation for game tree
search applications. Analyzed the impact of load balancing strategies,
problem sizes, and computational granularity on parallel scaling.
49
WHAT ELSE I HAVE DONE
Runtime Systems and Schedulers Charm++ Runtime system:
various projects for research and development of Charm++ parallel programming system and the associated ecosystem (tools etc).
Adaptive Job Scheduler: extending an open-source job scheduler (SLURM)
for enabling malleable HPC jobs. runtime support in Charm++ for such dynamic
shrink/expand capability. Power-aware load balancing and scheduling
Scalable Tree Startup: A multi-level scalable startup technique for parallel
applications.
50
WHAT ELSE I HAVE DONE
Architectures for Data intensive applicationsGraph500, HPCC Gups
Simulation – Sandia SST
51
QUESTIONS?
52
BACKUP SLIDES
HPC-CLOUD ECONOMICS
Then why cloud for HPC? Small-medium enterprises, startups with HPC needs Lower cost of running in cloud vs. supercomputer?
For some applications?
53
HPC-CLOUD ECONOMICS*
54
Cost = Charging rate($ per core-hour) × P × Time
Cloud can be cost-effective till some scale but what about performance?
High means cheaper to run in cloud
$ per CPU-hour on SC$ per CPU-hour on cloud
* Ack to Dejan Milojicic and Paolo Faraboschi who originally drew this figure
HPC-CLOUD ECONOMICS
55
Cost = Charging rate($ per core-hour) × P × Time
Low is better
Best platform depends on application characteristics. How to select a platform for an application?
56
PROPOSED WORK(1): APP-TO-PLATFORM
1. Application characterization and relative performance estimation for structured applications
One-time benchmarking + interpolation for complex apps.
2. Platform selection algorithms (cloud user perspective)
Minimize cost meeting performance target Maximize performance under cost constraint Consider an application set as a whole
Which application, which cloudBenefits: Performance, Cost
57
IMPACT Effective HPC in cloud (Performance, cost) Some techniques applicable beyond clouds Charm++ production system OpenStack scheduler CloudSim Industry participation (HP Lab’s award, internships) 2 patents
top related