Download - Blades for HPTC
- 1. Blades for HPTC
-
- Guy Coates
-
- Informatics Systems Groups
-
- [email_address]
-
2. Introduction
- The science.
-
- What is our HPTC workload?
- Why are clusters hard?
-
- What are the challenges of doing cluster computing?
- How do blades help us?
-
- Sanger's experience with blade systems.
- Can blades help you?
-
- What can blades not do?
3. The Science 4. The Post Genomic Era
- Genomes now available for many organisms.
- What does it mean?
TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 5. Deciphering the genome
- Sequence need to be analysed.
-
- Where are the genes?
-
- What do the genes do?
-
- Are the genes related to other genes via evolution?
- This analysis is known as gene annotation.
- Provides the basis for new questions:
-
- What happens when the genes go wrong?
-
- How do genes interact with one another?
-
- What do the genes we have never seen before do?
6. Annotation at Sanger
- Wehave both human and machine annotation efforts.
-
- Havanna Group:manual annotation (10% coverage).
-
- Ensembl project: automated annotation of 26 vertebrate genomes.
- Data pooled into the Ensembl database.
-
- Access via website (8M hits / week).
-
- Perl/Java/SQL APIs.
-
- Bulk download via FTP.
-
- Direct sql access(~150 queries / second).
-
- Core databases 250GB / month.
- Software is all Open Source (Apache style license).
- Data is free for download.
7. TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC Ensembl Annotation 8. Ensembl Annotation 9. Ensembl Annotation 10. Ensembl Annotation 11. How is the data generated?
- Ensembl provides a framework for automated annotation.
-
- Scientist describes annotation required.
- Rulemanager generates a set of compute tasks.
-
- ~20,000 jobs for a moderate genome.
-
- ~10,000 CPU/hours.
- Runner executes the jobs.
-
- Takes care of dependencies, failures.
-
- LSF used as DRM for execution of jobs.
-
- Results and state stored in mysql databases.
- Extensible and reusable.
-
- Newly sequenced genomes are areincorporated into Ensembl reasonably easily.
12. Genebuild Workflow 13. System requirements
- Many algorithms involved.
-
- Blast, exonerate( C )
-
- perl / java pipeline managers.
-
- 400 binaries in all.
- Integer, not floating point, intensive.
-
- General compute rather than specialised processors.
- Moderatememory sizes.
-
- 64 bit memory size is nice, but not essential.
- Lots and lots of disk I/O.
-
- 500GB genomic dataset searched by the pipeline.
-
- I/O bound in many parts.
- Minimal interprocess communication.
-
- Odd 4 node MPI jobs.
14. System requirements
- System is embarrassingly parallel.
-
- Scales well when we add more nodes.
- We don't need low-latency interconnects.
-
- Ethernet is fine.
- Well suited to clusters of commodity hardware.
- (We also need HA clusters for the queuing system and mysql databases, but that is another presentation)
15. Cluster MK 1
- 360 DS10L 1U servers in 9 racks.
- Bog standard cluster.
16. But...
- Data keeps on coming in.
-
- New genomes are sequenced.
-
- Errors in old genomes corrected.
- We want to compare all genomes against all others.
17. Compute demand grows with the data
- Science exceeds current compute capacity every 18 months.
- We need a bigger cluster every 18 months.
-
- Keep the current one running and help the users!
18. 5 clusters in 6 years
- 20x increase in compute capacity.
-
-
- (Moore's law helps a bit, but that is capacitors, not Spec Int.)
-
- What did we learn?
-
- Clusters are really hard.
19. Why are clusters hard? 20. Scaling
- Everyone talks about code scaling.
-
- Will my application run on more nodes?
- Do admins scale?
-
- If we double the cluster size, will we have double the admins?
-
- If it is hard today, what will it be like in 18 months?
-
- If we have to spend less admin time per node, will reliability suffer?
-
- We should be spending time helping users optimise code.
- Everything that can go wrong on a server can go wrong on a cluster node.
-
- But we have hundreds of nodes.
-
- Hundreds of problems.
21. Clusters Get More Complex
- MK1 cluster:
-
- 360CPUs, local disk storage, single fast ethernet.
- MK5 cluster:
-
- Multiple trunked GigE networks, cluster filesystems, SAN storage, multiple architectures (ia32, AMD64, token ia64 and alpha).
- Bleeding edge hardware / software stacks.
-
- Non trivial problems.
-
- google may not be your friend if you are the first to find the problem.
22. Manageability is the key
- Numerous, complex systems are hard to manage.
- Clusters need good management tools.
- The fastest cluster in the world is of no use if it does not stay up long enough to run your jobs.
- Manageability is our number 1 priority when designing clusters.
-
- We do not buy on price/performance.
-
- We buy on price/manageability.
23. Cluster Management Life Cycle
- Installation.
-
- Bolting the thing in.
- Commissioning.
-
- Getting the cluster configured.
- Production.
-
- Doing some useful work.
24. Installation
- Where to put the racks?
-
- Like disk space, data centres go to 80% full 6 months after they are built.
- Power / Aircon.
-
- You need to have enough.
-
- Total heat output vs density.
- Networking.
-
- Each system needs multiple network cables.
-
-
- public network, private network, SAN, mgt network.
-
-
- Don't forget the switching.
- But the cluster got delivered last week, why can't I run jobs?
25. Commissioning
- Getting the system up and running.
-
- OS deployment usually last!
- Initial configuration.
-
- Firmware updates.
-
-
- BIOS, NIC, mgt processor, FC HBA etc.
-
-
- Standardise BIOS settings.
-
-
- HT, memory interleave etc.
-
-
- RAID configuration.
- DOA Discovery.
-
- Machines with failed DIMMS, CPUs
- OS Deployment.
-
- OS installation, local customisations.
-
- Application stack.
26. Production
- Broken Hardware.
-
- Hardware failures should be detected and the admin told.
-
- Ideally they should be detected before they are fatal.
-
- Black hole machines painful on HPTC clusters.
- Sysadmin tasks.
-
- Software updates etc.
- Emergencies.
-
- Can you get a remote console?
-
- Console logs / oopses.
- Doomsday scenarios.
-
- Power or AC failures.
-
- Can I power off my cluster from home at 2:00am?
-
- Can I do it before my machines melt?
27. How do blades help? 28. How Do Blades Help?
- Manageability touches on hardware and software.
-
- Good manageability requires smart software and smart hardware.
- Blades have smart hardware.
-
- Management processors on blades and in chassis.
-
- (And some servers now.)
- Blades have smart software.
-
- Vendors supply OS deployment and management tools.
- Unit of administration is the chassis, not the blade.
-
- We end up managing a smaller number of smarter entities.
29. Smart Hardware
- Management processor.
-
- Sits on the blade and/or the chassis.
-
- Key enabler. Almost all benefits flow from this.
- Basic Features.
-
- Hardware Inventory (MAC addresses, BIOS revs etc).
-
- Remote power.
-
- Remote console (SOL, VNC).
-
- Machine health (memory, fans, CPUs).
-
- Alerting.
- Advanced Features.
-
- BIOS twiddling (PXE boot).
-
- Firmware updates.
-
- Integrated switch management.
30. Smart software
- Management Suite.
-
- Provides window into what the hardware is doing.
-
- Provides remote console, power and alerting.
- OS deployment suite.
-
- Typically golden image installers.
-
- Allow for rapid and consistent OS installation.
-
- Quick / automated re-tasking of machines.
-
- Software inventories.
- May be integrated into single product.
31. 32. 33. Web interface 34. Management Interface
- Web interfaces are nice.
-
- Easy to get to grips and find features.
- Command line is even better.
-
- Command line means we can script it.
- Command line tools allow you to integrate blade management with existing tools.
-
- You do not have to use the vendor suggested solution.
-
- Magic of open source.
35. Why Extend Existing Tools?
- Vendor tools can be limiting.
-
- Tend to be windows centric as windows is a pain to manage.
-
- May not work with non standard network or disk configs.
- Linux already has good deployment tools.
-
- Why re-invent the wheel?
-
- Not quite fully automated.
- Management processor command line interface.
-
- We can script and do whatever we want.
- Extend existing tools.
-
- Use existing deployment tools to install blades.
-
- Can cope with whatever twisted configs we want to run.
36. The Cluster Management Life Cycle Revisited
- ...But with blades.
- How do blades make it easier?
37. Cluster MK5
- 560 CPUs
-
- 140 dual core /dual CPU blades.
-
- 10 chassis, 2 cabinets.
- OS:
-
- Debian / AMD64.
- Networking:
-
- 1 GigE external network.
-
- 2 GigE trunked private network.
- Storage:
-
- Disk config: hardware RAID1 for OS.
-
- Cluster filesystem.
38. Installation
- Blades take up less space.
-
- Less space to clear / tidy.
- Integrated power and networking.
-
- Fewer cables.
39. Installation
- 42 1U servers with 3 GigE networks:
-
- 42 10/100 mgt cables.
-
- 126 GigE cables.
-
- 42 power cables.
-
- External switches.
- 70 blades in 5 chassis with 3GigE networks:
-
- 5 10/100 mgt cables.
-
- 15 GigE cables.
-
- 20 power cables.
-
- No external switches.
- One person can rack and patch a cabinet of blades in a day.
-
- I know, I've done it!
40. Consolidated networking and power
- 14 servers:
41. Cabling 42. Commissioning
- Bootstrap blade chassis.
-
- Configure mgt module.
-
- Script setsstatic IP addresses, alerts etc.
-
- Script configures network switches.
- FW Updates.
-
- Script update all blade and mgt module firmwares.
- ~0.5 day for the initial config on 10 chassis.
43. Commissioning
- We extended the FAI Debian auto installer.
-
- We use it already.
-
- It can cope with our non-standard network and disk topologies.
-
- Open Source generic system: future-proof.
- Install sequence:
-
- Harvest MAC addresses from mgt processor.
-
- PXE boot blades into FAI.
-
- Construct raid, flash system BIOS, set BIOS flags.
-
- OS and SW installation and customisation.
-
- Set blade to boot of disks and reboot.
- 160 seconds for a full OS and software install.
-
- Run script, go drink tea.
- Command line tools crucial.
44. Production
- Management processor.
-
- Remote power and remote console.
-
- Hardware failures.
-
- Alerts go into helpdesk system.
-
- Manage cluster from anywhere I can get ssh.
- Standard linux tools.
-
- DSH: run commands on all blades.
-
- cfengine: manage config files.
-
- ganglia / LSF: load monitoring.
-
- smartmontools for disk failures.
- Doomsday scenario.
-
- Emergency shutdown script.
-
- Runs round mgt processors and powers off blades.
-
- Keep blowers etc going to reduce heat stress.
45. Blades make large clusters easier
-
- Grown from 360 to 1456 CPUs.
-
- Shrunks from 360 systemto 42 chassis.
46. How many admins?
- It takes1 adminday /week to look after a 1456 CPU cluster.
-
- Gone down when we moved from servers to blades.
-
- cf TCO studies on the web.
-
- 1 full time admin for 40-50 unix machines.
-
-
- (Windows is half that).
-
- We look after all the rest of the Sanger systems too!
- We spend more time helping users rather than poking hardware.
-
- We get good usage out of our cluster.
47. Can blades help you? 48. Blade Pros / Cons
- Blades cost more up front.
-
- Pay for the chassis, even if you never fill it.
- Management savings only realised on larger installations.
-
- Would you use blades for a 8 node cluster?
- However, as cluster size increases, costs change.
-
- Management savings multiply as cluster size increases.
- Power density is high.
-
- Less power overall, but in a small space.
-
- Price / performance / watt ?
49. Interconnects
- We do not use low latency interconnects.
-
- We do Gigabit + SAN
- Blade chassis share a backplane.
-
- Typically 4 GB/s backplane.
-
- Limit full bandwidth of the blades.
-
- What is the latency hit?
- Blades have limited specialised network options.
-
- Single half height PCI card.
-
- Currently limited to 4x Infiniband, gigabit and SAN.
50. Conclusions
- Good management is the key, whether you run blade or servers.
-
- Good management is easier on blades.
- Blades can do anything a standard server can.
-
- In less of your space and inless of your time.
- If you are building larger clusters,consider blades.
51. Acknowledgements
- Informatics Systems Group
-
- Tim Cutts
-
- Mark Rae
-
- Simon Kelley
-
- Andy Flint
-
- Gildas Le Nadan
-
- Peter Clapham
- Special Projects Group
-
- John Nicholson
-
- Martin Burton
-
- Russell Vincent
-
- Dave Holland
52. 53. Storage Concepts 54. The data problem
- Pipeline is IO bound in many places.
-
- 500GB of genomic data to search.
- Keep the data as close as possible to the compute.
-
- Blast over NFS is a complete disaster.
-
- Data / IO problems common on bioinformatics clusters > 20 nodes.
Data NFS server Bottleneck 55. InitialStrategy
- Keep the data on local disk.
-
- Copy thedataset to each machine in the cluster.
Nodes Disk Data 56. Data Scaling
- Data management was a real headache.
-
- Ever-expanding dataset was copied to each machine in the farm (400-1000 nodes).
-
- Data grown from 50-500GB.
- Copying data onto 1000+machines takes time.
-
- 0.5-2 days for large data pushes, even with clever approaches.
- Ensuring data integrity is hard.
-
- Black Holes syndrome.
- Experience showed it was not a scalable approach for the future.
57. Cluster file systems
- Early 2003 started Investigation cluster file systems for farm usage.
- Most machines had gigabit connections.
-
- Network speeds near to local disk speeds. (120 Mbytes /S).
- Bitten hard by Tru64 end of life.
-
- We have ~300TB of data on Tru64/Advfs clusterfs.
-
- No migration path.
-
- We need a future proof storage solution.
- Should beOpen Source
-
- Binary kernel modules are evil.
-
- We often run non-standard kernels.
58. Initial Implementation
- No cluster file system which would scale to all nodes in the cluster.
-
- Assessed a large number of systems.
- GPFS was the one we settled on.
-
- Not all nodes need SAN connections.
-
- Not open source (you have to start somewhere).
- Divide farm up into a number of small systems.
-
- Chassis is an obvious unit.
-
- File systemsspanned 2 or 3 chassis of blades.
-
- End up with 20 file systems
- Keeping 20 file systems in sync is (relatively) easy.
59. Topology I
- 10x28 clusters of local NSDs
-
- GPFS striped across local disks on all nodes.
-
- Data accessed via gigabit.
- 2 chassisper cluster.
-
- Limited by replication level on GPFS and how often we expect machine failures.
- Performance limited by network.
-
- 80MBytes/s single client.
- Requires no special hardware.
Switch 60. Topology II: Hybrid
- 4x42 node clusters.
-
- Server machines have SAN storage.
-
- Client machines talk to servers over the LAN.
- Not every machine needs SAN.
-
- Clients do IO to multiple server machines.
-
- Eliminates single server bottleneck.
SAN Switch 61. Future implementation
- Expand cluster file system to the whole cluster.
-
- Single copy of the data.
-
- Allows users to manage their own data.
-
- Use cluster file system for general scratch/work space.
-
- Eliminate NFS.
- Implementing Lustre.
-
- Open source (v. x is propriety, v. x-1 is open sourced).
-
- Scales to 1000s of nodes.
-
- Performs well; in pilots our network is the bottleneck.
-
- Easy (ish) to add more network.
62. Lustre Config 10G 10G 4G 4G 2G 2G OST OST OST OST OST OST OST OST MDS ADM 63. The network is vital.
- Cluster IO is very stressful for networks.
-
- We can fill gigabit links from of a single client.
- Large amounts of gigabit networking.
-
- Multiple gigabit trunks.
- Non blocking switches critical.
64.