lhc-era data rates in 2004 and 2005 experiences of the phenix experiment with a petabyte of data
DESCRIPTION
LHC-Era Data Rates in 2004 and 2005 Experiences of the PHENIX Experiment with a PetaByte of Data. Martin L. Purschke, Brookhaven National Laboratory PHENIX Collaboration. RHIC from space. Long Island, NY. RHIC/PHENIX at a glance. PHENIX: 4 spectrometer arms 12 Detector subsystems - PowerPoint PPT PresentationTRANSCRIPT
1
LHC-Era Data Rates in 2004 and 2005Experiences of the PHENIX Experiment
with a PetaByte of Data
Martin L. Purschke, Brookhaven National LaboratoryPHENIX Collaboration
RHIC from space
Long Island, NY
2
RHIC/PHENIX at a glance
RHIC:2 independent rings, one beam clockwise, the other counterclockwisesqrt(SNN)= 500GeV * Z/A~200 GeV for Heavy Ions~500 GeV for proton-proton (polarized)
PHENIX:
4 spectrometer arms
12 Detector subsystems
400,000 detector channels
Lots of readout electronics
Uncompressed Event size typically 240 -170 - 90 KB for AuAu, CuCu, pp
Data rate 2.4KHz (CuCu), 5KHz (pp)
Front-end data rate 0.5 - 0.9 GB/s
Data Logging rate ~300-450MB/s, 600 MB/s max
3
Where we are w.r.t. others
ATLAS
CMS
LHCb
ALICE
CDF
~25 ~40
~100
~300
All in MB/sall approximate
~100
~150
600
~1250
400-600MB/s are not so Sci-Fi these days
4
Outline
• What was planned, and what we did• Key technologies• Online data handling• How offline / the users cope with the data volumes
Related talks:
Danton Yu, “BNL Wide Area Data Transfer for RHIC and ATLAS:Experience and Plan” this (Tuesday) afternoon 16:20
Chris Pinkenburg, “Online monitoring, calibration, and reconstruction in the PHENIX Experiment”, Wednesday 16:40 (OC-6)
5
A Little History
• In its design phase, PHENIX had settled on a “planning” number of 20MB/s data rate (“40 Exabyte tapes, 10 postdocs”)
• In Run 2 (2002) we started to approach that 20MB/s number and ran Level-2 triggers
• In Run 3 (2003) we faced the decision to either go full steam ahead with Level-2 triggers… or first try something else
• Several advances in various technologies conspired to make first 100MB/s, then multi-100MB/s data rates possible
• We are currently logging at about 60 times higher rates than the original design called for
6
But is this a good thing?
We had a good amount of discussions about the merit of going to high data rates. Are we drowning in data? Will we be able to analyze the data quickly enough? Are we recording “boring” events, mostly? Is it not better to trigger and reject?
• In Heavy-Ion collisions, the rejection power of level2- triggers is limited (high multiplicity, etc)
• triggers take a lot of time to study and developers usually welcome a delay in the onset of the actual rejection mode
• The rejected events are by no means “boring”, high-statistics physics in them, too
7
...good thing? Cont’d
In the end we convinced ourselves that we could (and should) do it.
The increased data rate helped defer the onset of the LVL2 rejection mode in Runs 4, 5 (didn’t run rejection at all in the end)
Saved a lot of headaches… we think
• Get the data while the getting is good - the Detector system is evolving and is hence unique for each run, better get as much data with it as you can
• Get physics that you simply can’t trigger on
• Don’t be afraid to let data sit on tape unanalyzed - computing power increases, time is on your side here
8
Ingredients to achieve the high rates
We implemented several main ingredients that made the high rates possible.
• We compress the data before they get on a disk the first time (cram as much information as possible into each MB)
• Run several local storage servers (“Buffer boxes”) in parallel, and dramatically increased the buffer disk space (40TB currently)
• Improved the overall network connectivity and topology
• Automated most of the file handling so the data rates in the DAQ become manageable
9
Event Builder Layout
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
Gigabit
Crossbar
Switch
To
HPSS
Event Builder
The data are stored on “Buffer boxes” that help ride out the ebb and flow of the data stream, ride out service interruptions of the HPSS tape system, and help send a steady stream to the tape robot
Buffer Box
Buffer Box
Buffer Box
Buffer Box
Buffer Box
Buffer Box
Sub-Event buffers see parts of a given Event following the granularity of the Detector electronics
With ~50% combined uptime of PHENIX * RHIC, we can log 2x the HPSS max data rate in the DAQ.Just needs enough buffer depth (40TB)
The pieces are sent to a given Assembly-and-Trigger Processor (ATP) on a next-available basis, fully assembled event available here for the first time
Some technical background...
10
Data Compression
LZO
algorithmAdd new
buffer hdrNew buffer with the compressed one as payload
buffer buffer buffer buffer buffer buffer
This is what a file then looks like
LZO
UnpackOriginal uncompressed buffer restored
On readback:
This is what a file normally looks like
All this is handled completely in the I/O layer, the higher-level routines just receive a buffer as before.
Found that the raw data are still gzip-compressible after zero-suppression and other data reduction techniques
Introduced a compressed raw data format that supports a late-stage
compression
11
Breakthrough: Distributed Compression
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
Gigabit
Crossbar
Switch
To
HPSS
Event Builder
The compression is handled in the “Assembly and Trigger Processors” (ATP’s) and can so be distributed over many CPU’s -- that was the breakthrough
Buffer Box
Buffer Box
Buffer Box
Buffer Box
Buffer Box
Buffer Box
The Event builder has to cope with the uncompressed data flow, e.g. 500MB/s … 900MB/s
The buffer boxes and storage system see the compressed data stream, 250MB/s … 450MB/s
12
Buffer Boxes
• We use COTS dual-CPU machines, some SuperMicro or Tyan MoBo designed for data throughput
• SATA 400G disk arrays (16 disks) Sata->SCSIRaid controller in striping (Raid0) mode
• 4 1.7 TB systems across 4 disks each on 6 Boxes, called a,b,c,d on each box. All “a” ( and b,c,d) file systems of each machine are one “set”, written to or read from at the same time
• File system set either written to or read from, not both
• Each Bufferbox adds a quantum of 2x 100MB/s rate capability (in and out) on 2 interfaces, 6 Buffer boxes = demonstrated 600 MB/s, expansion is cheap, all commodity hardware
13
Buffer Boxes (2)
• Short-term data rates are typically as high as ~450MB/s
• But the longer you average, the smaller the average data rate becomes (include machine development days, outages, etc)
• Run 4: highest rate 450MB/s, whole run average about 35MB/s
• So: The longer you can buffer, the better you can take advantage of the breaks in the data flow for your tape logging
• Also, a given file can linger on the disk until the space is needed, 40 hours or so -- the opportunity to do calibrations, skim off some events for fast-track reconstruction, etc
• “Be ready to reconstruct by the time the file leaves the countinghouse”
14
Run 5 Event statistics
CuCu 200GeV 2.2 Billion Events 157TB on tape
CuCu 62.4GeV 630 Million Events 42TB on tape
CuCu 22GeV 48 Million Events 3TB on tape
p-p 200GeV 6.8 Billion Events 286TB on tape
------------------------------
488 TB on tape for Run 5
Just shy of 0.5 PetaByte of raw data on tape in Run 5:
Had about 350 in Run 4 and about 150 in the runs before
15
How to analyze these amounts of data?
• Priority reconstruction and analysis of “filtered” events (Level2 trigger algorithms offline, filter out the most interesting events)
• Shipped a whole raw data set (Run 5 p-p) to Japan to a regional PHENIX Computing Center (CCJ)
• Radically changed the analysis paradigm to a “train-based” one
16
Near-line Filtering and Reconstruction
We didn’t run Level-2 triggers in the ATP’s, but ran the “triggers” on the data on disk and extracted the interesting events for priority reconstruction and analysis
The buffer boxes where the data are stored for ~40 hours come in handy
The filtering was started and monitored as a part of the shift operations, looked over by 2,3 experts
The whole CuCu data set was run through the filters
The so-reduced data set was sent to Oak Ridge where a CPU farm was available
Results were shown at the Quark Matter conference in August 2005. That’s record speed – the CuCu part of Run 5 ended in April .
17
Data transfers for p-p
Going into the proton-proton half of the run, the BNL Computing Center RCF was busy with last year’s data and the new CuCu data set
This was already the pre-QM era, most important conference in the field, gotta have data, can’t wait and make room for other projects
no adequate resources for pp reconstruction at BNL
historically, large-scale reconstruction could only be done at BNL where the raw data are stored, only smaller DST data sets could be exported…but times change
Our Japan computing center CCJ had resources… had plans to set up a “FedEx-tape network” to move the raw data there for speedy reconstruction
But then the GRID made things easier...
“This seems to be the first time that a data transfer of such magnitude was sustained over many weeks in actual production, and was handled as part of routine operation by non-experts.”
18
Shift Crew Sociology
The overall goal is a continuous, standard, and sustainable operation, avoid heroic efforts.
The shift crews need to do all routine operations, let the experts concentrate on expert stuff.
Automate. Automate. And automate some more.
Prevent the shift crews from getting “innovative”We find that we usually later throw out data where crews deviated from the prescribed path
In Run 5 we seemed past the “Torpedo in the water!” mode that’s unavoidable in Run 1
Encourage the shift crews to CALL THE EXPERT!
Given that, my own metrics how well it works is how many nights I and an half-dozen other folks get uninterrupted sleep
19
Shifts in the Analysis Paradigm
After a good amount of experimenting, we found that simple concepts work best. PHENIX reduced the level of complexity in the analysis model quite a bit.
We found:
You can overwhelm any storage system by having it run as a free-for-all. Random access to files at those volumes brings anything down to its knees.
People don’t care too much what data they are developing with (ok, it has to be the right kind)
Every once in a while you want to go through a substantial dataset with your established analysis code
20
Shifts in the Analysis Paradigm
• We keep some data of the most-wanted varieties on disk.
• that disk-resident dataset remains the same mostly, we add data to the collection as we get newer runs.
• The stable disk-resident dataset has the advantage that you can immediately compare the effect of a code change while you are developing your analysis module
• Once you think it’s mature enough to see more data, you register it with a “train”.
21
Analysis Trains
• After some hard lessons with the more free-for-all model, we established the concept of analysis trains.
• Pulling a lot of data off the tapes is expensive (in terms of time/resources) • Once you go to the trouble, you want to get as much “return on investment”
for that file as possible - do all the analysis you want while it’s on disk
• We also switched to tape (cartridge)-centric retrievals - once a tape is mounted, get all the files off while the getting is good
• Hard to put a speed-up factor to this, but we went from impossible to analyse to an “ok” experience. On paper the speed-up is like 30 or so.
• So now the data gets pulled from the tapes, and any number of registered analysis modules run over it -- very efficient
• You can still opt out for certain files you don’t want or need
If you don’t do that, the number of file retrievals explodes - I request the file today, next person requests it tomorrow, 3rd person next week
22
Analysis Train Etiquette
Your analysis module has to obey certain rules:• be of a certain module-kind to make it manageable by the train conductor • be a “good citizen” -- mostly enforced by inheriting from the module parent
class, start from templates, and review• Code mustn't crash, have no endless loops, no memory leaks• pass prescribed Valgrind and Insure tests
This train concept has streamlined the PHENIX analysis in hard-to-underestimate ways.
After the train, the typical output is relatively small and fits on disk
Made a mistake? Or forgot to include something you need? Bad, but not too bad… fix it, test it, the next train is leaving soon. Be on board again.
23
Other Advantages
When we started this, there was no production-grade GRID yet, so we had to have something that “just worked”
Less ambitious than current GRID projects
The train allows someone else to run your analysis (and many other’s at the same time with essentially the same effort) on your behalf
Can make use of other, non-Grid-aware, installations
Only one debugging effort to adapt to slightly different environments
24
Summary (and Lessons)
• We exceeded our design data rate by a factor of 60 (2 by compression, 30 by logging rate increases)
• We wrote 350 TB in Run 4, 500TB in Run 5, ~150 TB in the runs before -> ~1PB
• Increased data rate simplifies many tough trigger-related problems
• Get physics that one can't trigger on • Addt'l hardware relatively inexpensive and easily expandable• Offline and analysis copes by “train” paradigm shift
25
Lessons
• First and foremost, compress. Do NOT let “soft” data hit the disks. CPU power is comparably cheap. Distribute the compression load.
• Buffer the data. Even if your final storage system can deal with the instantaneous rates, take advantage of having the files around for a decent amount of time for monitoring, calibration, etc
• Don't be afraid to take more data than you can conceivably analyze before the next run. There's a new generation of grad students and post-docs coming, and technology is on your side
• Do not allow random access to files on a large scale. Impose a concept that maximizes the “return on investment” (file retrieval)
The End
26
1