linux clusters for high-performance computing

61
October 25 | Slide 1 Tim Skirvin and Jim Phillips Theoretical and Computational Biophysics, Beckman Institute Linux Clusters for High- Performance Computing Jim Phillips and Tim Skirvin Theoretical and Computational Biophysics Beckman Institute

Upload: ivi

Post on 04-Jan-2016

64 views

Category:

Documents


1 download

DESCRIPTION

Linux Clusters for High-Performance Computing. Jim Phillips and Tim Skirvin Theoretical and Computational Biophysics Beckman Institute. HPC vs High-Availability. There are two major types of Linux clusters: High-Performance Computing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Linux Clusters for High-Performance Computing

October 25 | Slide 1Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Linux Clusters for High-Performance Computing

Jim Phillips and Tim Skirvin

Theoretical and Computational Biophysics

Beckman Institute

Page 2: Linux Clusters for High-Performance Computing

October 25 | Slide 2Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

HPC vs High-Availability

There are two major types of Linux clusters: – High-Performance Computing

• Multiple computers running a single job for increased performance

– High-Availability• Multiple computers running the same job for

increased reliability

– We will be talking about the former!

Page 3: Linux Clusters for High-Performance Computing

October 25 | Slide 3Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Why Clusters?

Cheap alternative to “big iron”

Local development platform for “big iron” code

Built to task (buy only what you need)

Built from COTS components

Runs COTS software (Linux/MPI)

Lower yearly maintenance costs

Single failure does not take down entire facility

Re-deploy as desktops or “throw away”

Page 4: Linux Clusters for High-Performance Computing

October 25 | Slide 4Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Why Not Clusters?

Non-parallelizable or tightly coupled application

Cost of porting large existing codebase too high

No source code for application

No local expertise (don’t know Unix)

No vendor hand holding

Massive I/O or memory requirements

Page 5: Linux Clusters for High-Performance Computing

October 25 | Slide 5Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Know Your Users

Who are you building the cluster for?

– Yourself and two grad students?– Yourself and twenty grad students?– Your entire department or university?

Are they clueless, competitive, or malicious?

How will you to allocate resources among them?

Will they expect an existing infrastructure?

How well will they tolerate system downtimes?

Page 6: Linux Clusters for High-Performance Computing

October 25 | Slide 6Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Your Users’ Goals

Do you want increased throughput?

– Large number of queued serial jobs.– Standard applications, no changes

needed.

Or decreased turnaround time?– Small number of highly parallel jobs.– Parallelized applications, changes

required.

Page 7: Linux Clusters for High-Performance Computing

October 25 | Slide 7Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Your Application

The best benchmark for making decisions is your application running your dataset.

Designing a cluster is about trade-offs.– Your application determines your choices.– No supercomputer runs everything well

either.

Never buy hardware until the application is parallelized, ported, tested, and debugged.

Page 8: Linux Clusters for High-Performance Computing

October 25 | Slide 8Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Your Application: Parallel Performance

How much memory per node?

How would it scale on an ideal machine?

How is scaling affected by:– Latency (time needed for

small messages)?– Bandwidth (time per byte

for large messages)?– Multiprocessor nodes?

How fast do you need to run?

0

512

1024

1536

2048

0 512 1024 1536 2048

Page 9: Linux Clusters for High-Performance Computing

October 25 | Slide 9Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Budget

Figure out how much money you have to spend.

Don’t spend money on problems you won’t have.– Design the system to just run your application.

Never solve problems you can’t afford to have.– Fast network on 20 nodes or slower on 100?

Don’t buy the hardware until…– The application is ported, tested, and debugged.– The science is ready to run.

Page 10: Linux Clusters for High-Performance Computing

October 25 | Slide 10Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Environment

The cluster needs somewhere to live.

– You won’t want it in your office.

– Not even in your grad student’s office.

Cluster needs:– Space (keep the fire

martial happy).– Power– Cooling

Page 11: Linux Clusters for High-Performance Computing

October 25 | Slide 11Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Environment: PowerMake sure you have enough

power.– Kill-A-Watt

• $30 at ThinkGeek

– 1.3Ghz Athlon draws 183 VA at full load

• Newer systems draw more; measure for yourself!

• More efficient power supplies help

– Wall circuits typically supply about 20 Amps

• Around 12 PCs @ 183VA max (8-10 for safety)

Page 12: Linux Clusters for High-Performance Computing

October 25 | Slide 12Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Environment: Power Factor

Athlon 1333 (Idle)

1.25A 98W 137VA PF 0.71

Athlon 1333 (load)

1.67A 139W 183VA PF 0.76

Dual Athlon MP 2600+

2.89A 246W 319VA PF 0.77

Dual Xeon 2.8GHz

2.44A 266W 270VA PF 0.985

More efficient power supplies do help!

Always test your power under load.

W = V x A x PF

Page 13: Linux Clusters for High-Performance Computing

October 25 | Slide 13Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Environment: Uninterruptible Power

5kVA UPS ($3,000)

– Holds 24 PCs @183VA (safely)

– Will need to work out building power to them

– May not need UPS for all systems, just root node

Page 14: Linux Clusters for High-Performance Computing

October 25 | Slide 14Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Environment: Cooling

Building AC will only get you so far

Make sure you have enough cooling.– One PC @183VA

puts out ~600 BTU of heat.

– 1 ton of AC = 12,000 BTUs = ~3500 Watts

– Can run ~20 CPUs per ton of AC

Page 15: Linux Clusters for High-Performance Computing

October 25 | Slide 15Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Hardware

Many important decisions to make

Keep application performance, users, environment, local expertise, and budget in mind

An exercise in systems integration, making many separate components work well as a unit

A reliable but slightly slower cluster is better than a fast but non-functioning cluster

Always benchmark a demo system first!

Page 16: Linux Clusters for High-Performance Computing

October 25 | Slide 16Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Hardware: Networking

Two main options:

– Gigabit Ethernet – cheap ($100-200/node), universally supported and tested, cheap commodity switches up to 48 ports.

• 24-port switches seem the best bang-for-buck

– Special interconnects:• Myrinet – very expensive ($thousands per node), very

low latency, logarithmic cost model for very large clusters.

• Infiniband – similar, less common, not as well supported.

Page 17: Linux Clusters for High-Performance Computing

October 25 | Slide 17Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Hardware: Other Components

Filtered Power (Isobar, Data Shield, etc)

Network Cables: buy good ones, you’ll save debugging time later

If a cable is at all questionable, throw it away!

Power CablesMonitorVideo/Keyboard Cables

Page 18: Linux Clusters for High-Performance Computing

October 25 | Slide 18Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

User Rules of Thumb

1-4 users:– Yes, you still want a queueing system.– Plan ahead to avoid idle time and conflicts.

5-20 users:– Put one person in charge of running things.– Work out a fair-share or reservation system.

> 20 users:– User documentation and examples are essential.– Decide who makes resource allocation decisions.

Page 19: Linux Clusters for High-Performance Computing

October 25 | Slide 19Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Application Rules of Thumb

1-2 programs:

– Don’t pay for anything you won’t use.

– Benchmark, benchmark, benchmark!• Be sure to use your typical data.

• Try different compilers and compiler options.

> 2 programs:

– Select the most standard OS environment.

– Benchmark those that will run the most.• Consider a specialized cluster for dominant apps only.

Page 20: Linux Clusters for High-Performance Computing

October 25 | Slide 20Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Parallelization Rules of Thumb

Throughput is easy…app runs as is.

Turnaround is not:– Parallel speedup is limited by:

• Time spent in non-parallel code.

• Time spent waiting for data from the network.

– Improve serial performance first:• Profile to find most time-consuming functions.

• Try new algorithms, libraries, hand tuning.

Page 21: Linux Clusters for High-Performance Computing

October 25 | Slide 21Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Some Details Matter More

What limiting factor do you hit first?– Budget?– Space, power, and cooling?– Network speed?– Memory speed?– Processor speed?– Expertise?

Page 22: Linux Clusters for High-Performance Computing

October 25 | Slide 22Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Limited by Budget

Don’t waste money solving problems you can’t afford to have right now:– Regular PCs on shelves (rolling carts)– Gigabit networking and multiple jobs

Benchmark performance per dollar.– The last dollar you spend should be on

whatever improves your performance.Ask for equipment funds in proposals!

Page 23: Linux Clusters for High-Performance Computing

October 25 | Slide 23Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Limited by Space

Benchmark performance per rack

Consider all combinations of:– Rackmount nodes

• More expensive but no performance loss

– Dual-processor nodes• Less memory bandwidth per processor

– Dual-core processors• Less memory bandwidth per core

Page 24: Linux Clusters for High-Performance Computing

October 25 | Slide 24Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Limited by Power/Cooling

Benchmark performance per Watt

Consider:– Opteron or PowerPC rather than Xeon– Dual-processor nodes– Dual-core processors

Page 25: Linux Clusters for High-Performance Computing

October 25 | Slide 25Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Limited by Network Speed

Benchmark your code at NCSA.– 10,000 CPU-hours is easy to get.– Try running one process per node.

• If that works, buy single-processor nodes.

– Try Myrinet.• If that works, can you run at NCSA?

– Can you run more, smaller jobs?

Page 26: Linux Clusters for High-Performance Computing

October 25 | Slide 26Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Limited by Serial Performance

Is it memory performance? Try:– Single-core Opterons– Single-processor nodes– Larger cache CPUs– Lower clock speed CPUs

Is it really the processor itself? Try:– Higher clock speed CPUs– Dual-core CPUs

Page 27: Linux Clusters for High-Performance Computing

October 25 | Slide 27Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Limited by Expertise

There is no substitute for a local expert.

Qualifications:– Comfortable with the Unix command line.– Comfortable with Linux administration.– Cluster experience if you can get it.

Page 28: Linux Clusters for High-Performance Computing

October 25 | Slide 28Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

System Software

“Linux” is just a starting point.– Operating system, – Libraries - message passing, numerical– Compilers– Queuing Systems

PerformanceStabilitySystem securityExisting infrastructure considerations

Page 29: Linux Clusters for High-Performance Computing

October 25 | Slide 29Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Scyld Beowulf / Clustermatic

Single front-end master node:– Fully operational normal Linux installation.– Bproc patches incorporate slave nodes.

Severely restricted slave nodes:– Minimum installation, downloaded at boot.– No daemons, users, logins, scripts, etc.– No access to NFS servers except for master.– Highly secure slave nodes as a result

Page 30: Linux Clusters for High-Performance Computing

October 25 | Slide 30Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Oscar/ROCKS

Each node is a full Linux install– Offers access to a file system.– Software tools help manage these large

numbers of machines.– Still more complicated than only

maintaining one “master” node.– Better suited for running multiple jobs on

a single cluster, vs one job on the whole cluster.

Page 31: Linux Clusters for High-Performance Computing

October 25 | Slide 31Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

System Software: Compilers

No point in buying fast hardware just to run poor performing executables

Good compilers might provide 50-150% performance improvement

May be cheaper to buy a $2,500 compiler license than to buy more compute nodes

Benchmark real application with compiler, get an eval compiler license if necessary

Page 32: Linux Clusters for High-Performance Computing

October 25 | Slide 32Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

System Software: Message Passing LibrariesUsually dictated by application code

Choose something that will work well with hardware, OS, and application

User-space message passing?

MPI: industry standard, many implementations by many vendors, as well as several free implementations

Others: Charm++, BIP, Fast Messages

Page 33: Linux Clusters for High-Performance Computing

October 25 | Slide 33Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

System Software: Numerical Libraries

Can provide a huge performance boost over “Numerical Recipes” or in-house routines

Typically hand-optimized for each platform

When applications spend a large fraction of runtime in library code, it pays to buy a license for a highly tuned library

Examples: BLAS, FFTW, Interval libraries

Page 34: Linux Clusters for High-Performance Computing

October 25 | Slide 34Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

System Software: Batch Queueing

Clusters, although cheaper than “big iron” are still expensive, so should be efficiently utilized

The use of a batch queueing system can keep a cluster running jobs 24/7

Things to consider:– Allocation of sub-clusters?– 1-CPU jobs on SMP nodes?

Examples: Sun Grid Engine, PBS, Load Leveler

Page 35: Linux Clusters for High-Performance Computing

October 25 | Slide 35Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

System Software: Operating System

Any annoying management or reliability issues get hugely multiplied in a cluster environment.

Plan for security from the outset

Clusters have special needs; use something appropriate for the application and hardware

Page 36: Linux Clusters for High-Performance Computing

October 25 | Slide 36Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

System Software: Install It Yourself

Don’t use the vendor’s pre-loaded OS.– They would love to sell you 100 licenses.– What happens when you have to reinstall?– Do you like talking to tech support?– Are those flashy graphics really useful?– How many security holes are there?

Page 37: Linux Clusters for High-Performance Computing

October 25 | Slide 37Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Security Tips

Restrict physical access to the cluster, if possible.– Make sure you’re involved in all tours, to make sure

nobody touches anything.If you’re on campus, put your clusters into the Fully Closed

network group– Might cause some limitations if you’re trying to

submit from off-site– Will cause problems with GLOBUS– The built-in firewall is your friend!

Page 38: Linux Clusters for High-Performance Computing

October 25 | Slide 38Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Before You Begin

Get your budget

Work out the space, power, and cooling capacities of the room.

Start talking to vendors early– But don’t commit!

Don’t fall in love with any one vendor until you’ve looked at them all.

Page 39: Linux Clusters for High-Performance Computing

October 25 | Slide 39Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Design Notes

Make sure to order some spare nodes– Serial nodes and hot-swap spares– Keep them running to make sure they work.

If possible, install HDs only in head node– State law and UIUC policy requires all hard drives to

be wiped before disposal– It doesn’t matter if the drive never stored anything!– Each drive will take 8-10 hours to wipe.

• Save yourself a world of pain in a few years…• …or just give your machines to some other campus group,

and make them worry about it.

Page 40: Linux Clusters for High-Performance Computing

October 25 | Slide 40Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Get Local Service

If a node dies, do you want to ship it?

Two choices:– Local business (Champaign Computer)– Major vendor (Sun)

Ask others about responsiveness.

Design your cluster so that you can still run jobs if a couple of nodes are down.

Page 41: Linux Clusters for High-Performance Computing

October 25 | Slide 41Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Dealing with Purchasing

You will want to put the cluster order on a Purchase Order (PO)

– Do not pay for the cluster until it entirely works.

Prepare a ten-point letter

– Necessary for all purchases >$25k.

– Examples are available with your business office (or bug us for our examples).

– These aren’t difficult to write, but will probably be necessary.

Page 42: Linux Clusters for High-Performance Computing

October 25 | Slide 42Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: The Bid Process

Any purchase >$28k must go up for bid– Exception: sole-source vendors– Number grows every year– Adds a month or so to the purchase time– If you can keep the numbers below the magic $28k,

do it!• The bid limit may be leverage for vendors to drop their

prices just below the limit; plan accordingly.

You will get lots of junk bids– Be very specific about your requirements to keep

them away!

Page 43: Linux Clusters for High-Performance Computing

October 25 | Slide 43Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Working the Bid Process

Use sole-source vendors where possible.– This is a major reason why we buy from Sun.– Check with your purchasing people.– This won’t help you get around the month time loss,

as the item still has to be posted.Purchase your clusters in small chunks

– Only works if you’re looking at a relatively small cluster.

– Again, you may be able to use this as leverage with your vendor to lower their prices.

Page 44: Linux Clusters for High-Performance Computing

October 25 | Slide 44Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Receiving Your Equipment

Let Receiving know that the machines are coming.– It will take up a lot of space on the loading

dock.– Working with them to save space will earn

you good will (and faster turnaround).– Take your machines out of Receiving’s space

as soon as reasonably possible.

Page 45: Linux Clusters for High-Performance Computing

October 25 | Slide 45Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Consolidated Inventory

Try to convince your Inventory workers to tag each cluster, and not each machine

– It’s really going to be running as a cluster anyway (right?).

– This will make life easier on you.• Repairs are easier when you don’t have to worry about

inventory stickers

– This will make life easier for them.• 3 items to track instead of 72

Page 46: Linux Clusters for High-Performance Computing

October 25 | Slide 46Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Assembly

Get extra help for assembly– It’s reasonably fun work

• …as long as the assembly line goes fast.

– Demand pizza.Test the assembly instructions before you begin

– Nothing is more annoying than having to realign all of the rails after they’re all screwed in.

Page 47: Linux Clusters for High-Performance Computing

October 25 | Slide 47Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Purchasing Tips: Testing and Benchmarking

Test the cluster before you put it into production!– Sample jobs + cpuburn– Look at power consumption– Test for dead nodes

Remember: vendors make mistakes!– Even their demo applications may not work;

check for yourself.

Page 48: Linux Clusters for High-Performance Computing

October 25 | Slide 48Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

Case Studies

The best way to illustrate cluster design is to look at how somebody else has done it.– The TCB Group has designed four separate

Linux clusters in the last six years

Page 49: Linux Clusters for High-Performance Computing

October 25 | Slide 49Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2001 Case Study

Users:– Many researchers with MD simulations– Need to supplement time on supercomputers

Application:– Not memory-bound, runs well on IA32– Scales to 32 CPUs with 100Mbps Ethernet– Scales to 100+ CPUs with Myrinet

Page 50: Linux Clusters for High-Performance Computing

October 25 | Slide 50Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2001 Case Study 2

Budget:

– Initially $20K, eventually grew to $100K

Environment:– Full machine room, slowly clear out space– Under-utilized 12kVA UPS, staff electrician– 3 ton chilled water air conditioner (Liebert)

Page 51: Linux Clusters for High-Performance Computing

October 25 | Slide 51Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2001 Case Study 3

Hardware:

– Fastest AMD Athon CPUs available (1333 MHz).

– Fast CL2 SDRAM, but not DDR.

– Switched 100Mbps Ethernet, Intel EEPro cards.

– Small 40 GB hard drives and CD-ROMs.

System Software:

– Scyld clusters of 32 machines, 1 job/cluster.

– Existing DQS, NIS, NFS, etc. infrastructure.

Page 52: Linux Clusters for High-Performance Computing

October 25 | Slide 52Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2003 Case Study

What changed since 2001:

– 50% increase in processor speed

– 50% increase in NAMD serial performance

– Improved stability of SMP Linux kernel

– Inexpensive gigabit cards and 24-port switches

– Nearly full machine room and power supply

– Popularity of compact form factor cases

– Emphasis on interactive MD of small systems

Page 53: Linux Clusters for High-Performance Computing

October 25 | Slide 53Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2003 Case Study 2

Budget:

– Initially $65K, eventually grew to ~$100K

Environment:

– Same general machine room environment

– Additional machine room space is available in server room

• Just switched to using rack-mount equipment

– Still using the old clusters; don’t want to get rid of them entirely

• Need to be more space-conscious

Page 54: Linux Clusters for High-Performance Computing

October 25 | Slide 54Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2003 Case Sudy 3

Option #1:

– Single processor, small form factor nodes.

– Hyperthreaded Pentium 4 processors.

– 32 bit 33 MHz gigabit network cards.

– 24 port gigabit switch (24-processor clusters).

Problems:

– No ECC memory.

– Limited network performance.

– Too small for next-generation video cards.

Page 55: Linux Clusters for High-Performance Computing

October 25 | Slide 55Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2003 Case Study 4

Final decision:– Dual Athlon MP 2600+ in normal cases.

• No hard drives or CD-ROMs.• 64 bit 66 MHz gigabit network cards.

– 24 port gigabit switch (48-proc clusters).– Clustermatic OS, boot slaves off of floppy.

• Floppies have proven very unreliable, especially when left in the drives.

Benefits:– Server class hardware w/ ECC memory.– Maximum processor count for large simulations.– Maximum network bandwidth for small

simulations.

Page 56: Linux Clusters for High-Performance Computing

October 25 | Slide 56Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2003 Case Study 5

Athlon clusters from 2001 recycled:– 36 nodes outfitted as desktops

• Added video cards, hard drives, extra RAM

• Cost: ~$300/machine

• Now dead or in 16-node Condor test cluster

– 32 nodes donated to another group– Remaining nodes move to server room

• 16-node Clustermatic cluster (used by guests)

• 12 spares and build/test boxes for developers

Page 57: Linux Clusters for High-Performance Computing

October 25 | Slide 57Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2004 Case Study

What changed since 2003:

– Technologically, not much!– Space is more of an issue.– A new machine room has been built for us.– Vendors are desperate to sell systems at any

price.

Page 58: Linux Clusters for High-Performance Computing

October 25 | Slide 58Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2004 Case Study 2

Budget:– Initially ~$130K, eventually grew to ~$180K

Environment:– New machine room will store the new

clusters.– Two five-ton Liebert air conditioners have

been installed.– There is minimal floor space, enough for four

racks of equipment.

Page 59: Linux Clusters for High-Performance Computing

October 25 | Slide 59Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2004 Case Study 3

Final decision:

– 72x Sun V60x rack-mount servers.• Dual 3.06GHz Intel processors – only slightly faster

• 2GB RAM, Dual 36GB HDs, DVD-ROM included in deal

• Network-bootable gigabit ethernet built in

• Significantly more stable than any old cluster machine

– 3x 24 port gigabit switch (3x 48-processor clusters)

– 6x serial nodes (identical to above, also serve as spares)

– Sun Rack 900-38• 26 systems per rack, plus switch and UPS for head nodes

– Clustermatic 4 on RedHat 9

Page 60: Linux Clusters for High-Performance Computing

October 25 | Slide 60Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

2004 Case Study 4

Benefits:– Improved stability over old clusters.– Management is significantly easier with

Sun servers than PC whiteboxes.– Network booting of slaves allows lights-

off management.– Systems use up minimal floor space.– Similar performance to 2003 allows all 6

clusters (3 old + 3 new) to take jobs from a single queue.

– Less likely to run out of memory when running an “express queue” job.

– Complete machines easily retasked.

Page 61: Linux Clusters for High-Performance Computing

October 25 | Slide 61Tim Skirvin and Jim PhillipsTheoretical and Computational Biophysics, Beckman Institute

For More Information…

http://www.ks.uiuc.edu/Development/Computers/Cluster/

http://www.ks.uiuc.edu/Training/Workshop/Clusters/

We will be setting up a Clusters mailing list some time in the next week or two

We will also be setting up a Clusters User Group shortly, but that will take some more effort.