building hpc clusters as code in the (almost) infinite cloud | aws public sector summit 2016

64
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dr. Jeffrey B. Layton Global Scientific Computing June 20, 2016 Building HPC Clusters as Code in the [Almost] Infinite Cloud

Upload: amazon-web-services

Post on 15-Jan-2017

470 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Dr. Jeffrey B. LaytonGlobal Scientific Computing

June 20, 2016

Building HPC Clusters as Code in the [Almost] Infinite Cloud

Page 2: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Agenda

• Why cloud for HPC?• Tools for creating clusters in the cloud• SPOT + HPC = peas and carrots• Fermi National Accelerator Laboratory• Demo of scaling jobs on a budget• Summary

Page 3: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Why cloud for HPC?

Scalability• If you need to run on lots of cores, just spin them up• If you don’t need nodes, turn them off (and don’t pay for them)

Time to research• Usually on-premises high-performance computing (HPC) resources are centralized (shared)• Researchers like to have their own nodes when they need them

World-wide collaboration• Share data and interact with it by using the cloud

Latest technology and various instance typesCan save $$$Flexibility: code as infrastructure

Page 4: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

AWS HPC architectures—phases of deployment

• Fork lift• Make it look like on-premises

• Cloud “port”• Adapt to cloud features

• Auto Scaling• Spot

• Born in the cloud• Cycle computing

• Rethink application• Microservices and serverless computing

You must think in “cloud”You cannot think in “on-prem” and transposeYou must think in “cloud”Do you think you can do that, Mr. Gant?

Page 5: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

AWS HPC architecture

Master Node

Compute Node Compute Node

Compute Node Compute Node

Storage(NFS, Parallel)

Master Instance

Compute Instance Compute Instance

Compute Instance Compute Instance

Storage(NFS, Parallel)

On-premises AWS Cloud

Compute Instance

Compute Instance

Page 6: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

HPC tools

MIT StarCluster• No longer being supported nor developed

Bright Cluster Manager• Good for hybrid solutions

CloudyCluster• Omnibond (out of Clemson University)

Amazon Cfncluster• Getting started writing your own tools

Alces Flight—on AWS Marketplace

Page 7: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Alces FlightAlces Flight is software offering self-service supercomputers by using the AWS Marketplace (the cloud’s “App Store”). It creates self-scaling clusters with more than 750 popular scientific applications pre-installed, complete with libraries and various compiler optimizations, ready to run. The clusters use the AWS Spot market by default.

5 minutes

http://alces-flight.com

Page 8: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Alces Flight is familiar and flexible

• Same tools as virtually all HPC systems• Environment modules• Job scheduler (SGE)

• Catalog of 750+ prebuilt scientific applications and libraries including visualization tools

• Alces gridware tool for application management• Integrated with modules

• Defaults to the Spot market• Auto Scaling cluster based on queued jobs

Page 9: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Flight enables collaboration

Access the graphical console of your control node simultaneously with your collaborators

• Run visual apps that use the elastic cluster to drive visual results and you can work together with the visual console in real-time

Shared and secure cloud workspaces • Control access and focus on data

analysis• Make more discoveries faster

• Save lives• Change the world

Collaborative IGV Integrative Genomics Viewer (IGV) workspace for variant analysis

Page 10: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

SPOT + HPC = peas and carrots

Page 11: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Multiple pricing methods

Page 12: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Spot Market filler

Series10 .00

1 .50

3 .00

4 .50

6 .00

# CPUs

time

Spot Market

Our ultimate space filler.

Spot Instances allow you to name your own price for spare AWS computing capacity.

Great for workloads that aren’t time sensitive, and especially popular in research (hint: it’s really cheap).

Page 13: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Spot vs. On-Demand (YMMV)

4 compute nodes, 2 hours• On-Demand, us-east

• $19.13• Spot (us-west-1)

• $7.22• Almost 1/3 the cost!

16 compute nodes, 32 hours• On-Demand, us-east

• $1,018.77• Spot (us-west-1)

• $223.11• Almost 1/5 the cost!

Page 14: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Fermi National Accelerator LaboratoryDr. Panagiotis Spentzouris

Page 15: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Fermi National Accelerator Laboratory

Fermilab is America’s particle physics and accelerator lab.• Mission: solve the mysteries of matter, energy, space and time for

the benefit of all.

More than 4,200 scientists worldwide use Fermilab and its particle accelerators, detectors and computers for their research.

Page 16: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Particle Physics Science Drivers

Utilize high-energy particle beam collisions to discover• the origin of mass, the nature of dark matter, extra dimensions.

Employ high-flux beams to explore• neutrino interactions, to answer questions about the origins of

the universe, matter-antimatter asymmetry, force unification.• rare processes, to open a doorway to realms to ultra-high

energies, close to the unification scale.

Page 17: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Massive instruments generate massive data

Fermilab experiments

Page 18: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

The big data frontier… from Wired

Page 19: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Fermilab Facility Evolution: HEPCloud

HEPCloud: Provide cost effective and efficient “elastic” resource deployment, utilizing sophisticated decision engine and middleware for automation. A single portal to heterogeneous computing and storage resources, both local and “rental” (commercial or academic).• Initial focus on commercial

clouds AWS➡️

Page 20: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

AWS infrastructure

• AWS CloudFormation automates the setup and teardown of the Amazon Route 53 DNS entries, the Elastic Load Balancing load balancer, the Auto Scaling group, and Amazon CloudWatch monitoring

• Launched in each Availability Zone prior to workflows being run

Page 21: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

On Demand services● Workflows depend on software services to run● Automating the deployment of these services on AWS on-demand

—enables scalability and cost savingso Services include data caching (e.g. Squid) WMS , submission service, data transfer, etc.o As services are made deployable on-demand, instantiate ensemble of services together (e.g.

through AWS CloudFormation)● Example: On-demand Squid

Page 22: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Decision engine

Page 23: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Reaching ~60k slots on AWS with HEPCloud

Page 24: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

HEPCloud/AWS: 25% of CMS global capacity

Page 25: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

HPC needs of particle physics workflows

Now that the HTC use case is out of the way…Machine learning for pattern recognitions

Specialized HPC demandsVery large computations (petascale) of physics processes necesary for theoretical interpretationsVery large computations (petascale) for modeling particle accelerators and detectors

Page 26: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Demo: Scaling jobs on a budget

Page 27: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Summary

Page 28: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Summary

Easy to “recreate” clusters in the cloud• Extremely scalable and flexible

Spot + HPC is a wonderful combination• Saves time and money

Customer example—FNALAlces Flight in AWS MarketplaceThis is only the beginning—rethink HPC applications for fault tolerance, extreme scalability, etc.

Page 29: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Thank you!

Page 30: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Demo backups

Page 31: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Introduction

Setup 2 node cluster (2 compute nodes) where:• Master node = c4.8xlarge• 2x compute nodes = c4.xlarge• 10GigE networking

Run compute nodes on Spot market and master node On-DemandAccess cluster from Microsoft Windows box (using PuTTY)

Page 32: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Step 1

Start up cluster using Alces Flight JSON file• CloudFormation service

• Click Create Stack• Answer questions

• Key file is critical! You will use it to log in to master node.• Choose a reasonable Spot price (check current market in region)

– http://aws.amazon.com/ec2/pricing/ (near bottom of page)

Page 33: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

AWS CloudFormation page

Page 34: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Create a stack

• Specify the details of template instantiation

• Called a “stack”• Allows you to

tailor stack to needs

Page 35: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Stack details—top portionName of cluster

Spot bid

Instance type for compute nodes

Amazon S3 bucket for customizations

Key pair for that region

Number of initial nodes in cluster

Page 36: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Stack details—bottom portion

Storage capacity for instances

Master node type

Max number of nodes

Network CIDR

User name

Page 37: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Optional tags

Page 38: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Review of stack configuration—1

Page 39: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Review of stack configuration—2

Don’t forget to check this box!

Page 40: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Stack gets created—1

Page 41: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Stack gets created—2

Page 42: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Stack is done and cluster exists!

Took about 5 minutes

Page 43: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Demo 2

Page 44: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Cluster configuration

Recall that Alces Flight comes with:• Environment modules (connected to Alces Gridware)• Pdsh• SGE job scheduler• GNU Compilers• Alces Gridware• Built on CentOS 7• 750+ applications and libraries (MPI included)

Log into master node and try out commands (PuTTY). Run application.

Page 45: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Start up PuTTY

Copy/Paste IP address

Page 46: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Keep alive in PuTTY Go to Connection on

left menu Click on it Select Enable TCP

keepalives Keeps PuTTY

connection alive

Page 47: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Add key to PuTTY session Go to SSH on left menu Expand menu Select Auth Use Browse to location

private key (should be the same as was used when cluster was created)

Note: Has to be in .ppk format (might have to convert it from .pem format)

Page 48: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Log in to master node! Use “alces” as login (should

match what you input to create cluster)

No password needed (uses pass key)

Ready to go!

Page 49: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Check number of nodes

pdsh uses genders– “nodes” are only

compute nodes– “cluster” includes

master node Be sure to check

“qhost” for compute nodes

Page 50: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Available modules at boot

AWS command line tools installed by default

Page 51: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

“alces gridware list”

Page 52: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Install an application Search for application

using “alces gridware search …”

Install application using “alces gridware install …”

Environment modules are updated with application is installed

Page 53: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Modules after installing application To run application

don’t forget to load the application module!

Page 54: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Run application

Page 55: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Remove module and application

First, remove module

Second, run “alces gridware purge… “

Page 56: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Demo 3

Page 57: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Demo 3

Cluster is up—show running MPI application• Which MPI application (make it something reasonable)

Install application• Show change in modules

Job script (go over details)Submit job—show output of qstat

• Auto Scaling?

Show output from application (yes it’s running)

Page 58: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Cluster MPI definition

Install MPI application using alces gridwareLoad moduleSet up job scriptSubmit jobWatch it run (run, app, run)

Page 59: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

List available depots

Depots are prebuilt collections of applications

Page 60: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Install benchmark depot Installs depot Abbreviated

output Easy command Notice the use of the

“depot” option Installs dependency

as well

Page 61: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Check modules Alces Flight always adds

the appropriate module files Don’t forget to use them!

New modules

New modules

Don’t forget to load modules before running!

Page 62: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Create job script

Don’t forget that Alces Flight uses SGE

#!/bin/bash#$ -j y –N imb –o $HOME/imb_out.$JOB_ID#$ -pe mpinodes-verbose 2 –cwd –Vmodule load mpi/openmpimodule load apps/imbmpirun IMB-MPI1

Alces also has job templates available: “alces gridware templates list”

Page 63: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Submit job and check status

Page 64: Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sector Summit 2016

Once job is done—check output It works! We have MPI!