6/26/01high throughput linux clustering at fermilab--s. timm 1 high throughput linux clustering at...
TRANSCRIPT
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
1
High Throughput Linux Clustering at Fermilab
Steven C. Timm--Fermilab
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
2
Outline
• Computing problems in High Energy Physics
• Clusters at Fermilab• Hardware Configuration• Software Management Tools• Future Plans
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
3
Fermilab
• In Batavia, IL. Since 1972, highest energy accelerator in the world.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
4
Accelerator
• Collides protons and antiprotons at 2 TeV
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
5
Coarse Parallelism
• Basic idea: Each “event” is independent• Code doesn’t vectorize well or need SMP• 1000’s of instructions per byte of I/O• Need lots of small, cheap computers• Have used VAX, MC68020, IBM, SGI
workstations, now Linux PC’s.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
6
Types of Computing at Fermilab
• Simulation of detector response• Data acquisition• Event reconstruction• Data analysis• Theory calculations (Beowulf-like)• Linux clusters used in all of the
above!
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
7
Physics Motivation
• Three examples:• Fixed target experiment (~1999)• Collider experiment (running now)• CMS experiment (running 5 years
in future)
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
8
Fermilab E871
• Called HyperCP, ran in 1997 and 1999
• 3 particles per event• 10 billion events written to tape• 22000 tapes, 5 Gb apiece• More than 100 Tb of data!• Analysis recently completed, about 1
yr.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
9
Run II Collider Experiments
• CDF and D0—just starting to run now
• Expected data rate 1 Tb/day• 50-100 tracks per event• Goal: To reconstruct events as
fast as they come in.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
11
Mass Storage System
• 1 Pb-capacity tape robot (ADIC)• Mammoth tape drives, 11 Mb/sec• Two tape drives per Linux PC• Unix-like filespace to keep track of
files• Network-attached storage, can
deliver up to 100 Mb/sec throughput.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
13
Reconstruction Farm
• Five farms currently installed• 340 dual CPU nodes in all, 500-
1000 MHZ• 50 Gb disk each, 512 Mb RAM• One I/O node, SGI Origin 2000, 1 Tb
disk, 4 CPU’s, 2 x Gigabit Ethernet.• Farms Batch System software to
coordinate batch jobs
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
14
Farms I/O Node
SGI O2200
4 x 400 MHz
2 Gb Ethernet
1 Tb disk
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
15
Farm Workers
• 50 500 MHz Dual PIII
• 50 Gb disk• 512 Mb
RAM
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
16
Farm Workers
• 2U dual PIII 750 MHz, 50 Gb disk.
• 1Gb RAM.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
17
Data mining and analysis facility
• SGI Origin 2000, 176 processors• 5 Terabytes of disk and growing• Used for repetitive analysis of
small subsets of data• Wouldn’t need the SMP but it is the
easiest way to get a lot of processors near a lot of disk.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
19
CMS Project
• Scheduled to run in 2005 at CERN’s LHC (Geneva, Switzerland)
• Fermilab is managing US contribution.• Every 40 ns, expect 25 collisions • Each collision makes 50-100 particles• 1-10 petabytes of data has to be
distributed around the world• Will need at least 10000 of today’s
fastest PC’s
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
20
Qualified Vendors
• We evaluate vendors on hardware reliability, competency in Linux, service quality, and price/performance.
• Vendors chosen for desktops and farm workers
• 13 companies submitted evaluation units, five chosen in each category
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
21
Fermi Linux
• Based on Red Hat Linux 6.1 (7.1 coming soon)
• Add a number of security fixes• Follow all kernel and installer
updates• Updates sent out to ~1000 nodes by
Autorpm• Qualified vendors ship machines
with it preloaded.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
22
ICABOD
• Vendor ships system with Linux OS loaded.
• Expect scripts:– Reinstall the system if necessary– Change root password, partition disks– Configure static IP address– Install kerberos and ssh keys
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
23
Burn-in
• All nodes go through 1 month burn-in test.
• Load both CPU (2 x seti@home)• Disk (Bonnie)• Network test• Monitor temperatures and current
draw.• Reject if more than 2% down time.
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
27
FBSNG
• Farms Batch System, Next Generation
• Allows parallel batch jobs which may be dependent on each other
• Abstract and flexible resource definition and management
• Dynamic configuration through API• Web-based interface
6/26/01 High Throughput Linux Clustering at Fermilab--S. Timm
28
Future plans
• Next level of integration—1 “pod” of six racks plus switch, console server, display.
• Linux on disk servers, for NFS/NIS• “chaotic” analysis servers and
compute farms to replace big SMP boxes
• Find NFS replacement (SAN?)• Abandon tape altogether?