the dØ computing model
DESCRIPTION
The DØ Computing Model. Overview The picture Planning history Status of acquisitions Performance More detail On the current operation On the R & D General Status Future plan. High bandwidth into robot. Overview. The data handling system SAM ENSTORE Robot(s) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/1.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
1
The DØ Computing Model
Overview The picture
Planning history Status of acquisitions Performance More detail
On the current operation On the R & D General Status
Future plan
![Page 2: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/2.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
2
Overview
The data handling system SAM ENSTORE Robot(s)
The offline user computing systems dØmino - O (20 TB) disk linux analysis server(s) - O (2 TB) disk linux development machines - O (0.2 TB)
• build cluster• ClueDØ• remote linux machines
non-development desktops
Associated systems Fermilab production farm (raw data reconstruction) Remote production farms (simulation) Database servers
High bandwidth into robot
![Page 3: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/3.jpg)
Robot
lxbld
Detector
Analysis Cluster 1
NT Desktops
Linux Compute Server
12.5 Mb/s
Monte Carlo Handled remotely
~ 1 TB
150 Mb/s
ClueDØ
~ 0.2 TB
Linux Farms
Database Servers
dØmino
27 TBHigh speed Network
ClueDØServer
![Page 4: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/4.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
4
Original plan January ‘97 DØ Internal Review February ‘97
External review: Von Rüden Committee Mar ‘97, Oct ‘97, Jun ‘98, Jan ‘99, Jun ‘99 Funding profile (DMNAG - Joint with CDF) approved ‘97
Plan updates January ‘99 for VR IV Global Computing Model reports (‘98-’99)
[Addition of Analysis Servers to plan]
Plan implementation ‘97 - ‘01 Run II Computing and Software Project: co-leaders +
Computing Planning Board
Planning history
![Page 5: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/5.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
5
Status of acquisitions
Analysis cpu Dømino: 192 proc O2000 complete (except add memory) Desktops: responsibility of institutions Analysis Clusters/Servers - 1 purchased of (6?)
Reconstruction cpu 200 processors acquired of 400 planned
[ 40 Hz cap @ current reco cpu perf. ; 80 Hz @ target reco perf]
Disk storage 30 TB total - complete (plan was 15 TB) See allocation slide
![Page 6: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/6.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
6
22.5
0.9
2.6
12
6
1
Allocated
27TOTAL
2contingency
?2Tmp ( group space)
~2.0?4Project disks
variable12DST/mDST
variable6SAM cache
11Scratch, releases & other config.
UsedAvailableDisk space on D0MINO
Total available disk space: 30 Tbyte
( all units are Tbytes)
3 Tbytes are on: D0test, d0lxac1, d0lxbld27 Tbytes are on D0MINO
Disk space in the offline systems
![Page 7: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/7.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
7
Status of acquisitions cont’d
Robotic tape storage 1 ADIC robot (750 TB capacity) - complete 18 Mammoth II tape drives - will be retired 6 LTO drives - now 2 STK robots (600 TB capacity) - FY02 9 STK 9940 drives - FY02 Post shutdown stopgap - use existing STKen w/ 4 drives
Database servers - complete 2 SUN systems w/ 600 GB disk
![Page 8: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/8.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
8
Performance
Farm production stats dØmino cpu & mem stats AC1 cpu & mem stats SAM & encp stats Disk usage stats Conclusion: Chief needs
More memory for Dømino More reliable tape drives More farm nodes More linux cpu
Open questions - DB server upgrades?
![Page 9: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/9.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
9
Farm Production Statistics
See web link from Main DØ Computing for weekly reports
Week of 08/31 - 09/06:
800,000 evts proc / 140,000 from data collected in that week
1.9 M events collected in that week Problems in this week:
encp problem (code change from ENSTORE)disk failure on dØbbin (the farm IO server)several other problems as well...
![Page 10: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/10.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
10
The Current Operation
Code release model Mapping activities to systems ClueD0 operation Remote farm operation Role of the ORB
![Page 11: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/11.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
11
The code release model
Weekly test releases Production releases every three months Weekly subsystem coordinators meeting:
Minutes to d0rug mailing list Rules for interface changes Schedules for big disruptive changes (e.g. switch
to KAI 4.0)
![Page 12: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/12.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
12
Mapping activities to systems
Code development: your Linux box, if possible; d0mino is the backup solution
Large sample processing: a SAM station d0mino, lxac1, special farm allocation (gtr) , (ClueD0 - in
R&D)
Small sample processing: create derived DS on SAM station, transfer to desktop
Office/Web browsing : use your desktop! Remote users: new position to address needs
![Page 13: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/13.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
13
Mapping activities to systems
Disk usage Home areas - backed up; you can ask for up to 250MB
(possibility of more for good reason) BUT NFS-mounted - don’t use for data files!
TMP areas - not backed up. Code development and / or data files, allocated per institution. 37 institutions are using it so far. A good place to start off if you are not working with a well-defined project.
PRJ areas - not backed up. Code development and / or data files, allocated per project. 3 large pools: commissioning, algorithm development, simulation, plus physics and ID groups and some smaller projects.
Web pages - DØ Main Computing ( SAM Data Handling section) --> General description of where data samples are stored in our system
![Page 14: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/14.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
14
The current population is:111 nodes with 138 CPUs and a total memory of 37GB396 Users
Rules for joining and policies can be found at:http://www-clued0.fnal.gov/clued0/http://www-clued0.fnal.gov/clued0/policies.html
Current difficulties from the lack of Redhat 7.1 builds are being actively worked on
ClueD0 Operation
![Page 15: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/15.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
15
Monte Carlo Production Status
Current Software – mcp07 p07.00.05a Generator, DØgstar, Døsim P08.12.00 Døreco, recoanalyze 950 kevents generated at reco level Run IIB Simulation is a major effort Will move to p08.13.00 to remove memory leak
Future Releases – p09.10.00 Problem running DØgstar under investigation Plate level available p10 certification will be available by the end of the
month
![Page 16: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/16.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
16
Charge: Allocate offline resources according to the experiment’s priorities
Project & tmp disk Sample priorities for simulation on remote farms Partitions in SAM cache Batch queues
Chair: Nick Hadley Web Page
http://www-d0.fnal.gov/Run2Physics/orb/d0_private/orb_home.html
Institutions which have no tmp disk allocation and have active users
email to [email protected] - 18 GB will be allocated
The Offline Resources Board
![Page 17: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/17.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
17
R & D
Analysis clusters - one in service ClueD0 servers ( a relocated analysis cluster) -
software being tested; networking strategy being developed
Compute servers for dØmino (a user-accessible farm) - 2 nodes available for tests
Remote farms for raw data reconstruction and analysis
Remote desktop analysis
![Page 18: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/18.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
18
Institutional contributions
Desktop seats Backup tapes Remote simulation capacity Disk for Dømino via budget code - issues
How to allocate between project & tmp? Lifetime for contribution? Unit of contribution : 1 rack of disk
Analysis cluster for Feynman via budget code Similar issues
Analysis cluster for ClueDØ - all the above issues + SAM bandwidth, networking, sysadmin, ...
![Page 19: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/19.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
19
General Status - Where are the limits/problems?
Online Max rate tested 40 Hz to tape Max rate sustained for a shift, to date ~25 Hz to tape Max rate expected with next iteration 60 Hz to tape Final limitation: tape budget (FY02 = ~ 400 TB )
Running p 10 on the farms Processes raw data @ 23 sec/event Thanks to Alg Group - worked out of box on raw data Limits: ~ 2-3 Hz w/ current nodes & cpu perf of reco
Output size: HUGE - writing too much tape, breaking DB model, using more than allocated network and disk resources all down the line
![Page 20: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/20.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
20
Expected Farm Performance
@ Current cpuperf
@ Target cpuperf
Existing farm 3 Hz 6 Hz
+ FY01purchase(32 nodes)
5 Hz 10 Hz
+ FY02purchase(200 nodes)
36 Hz 72 Hz
![Page 21: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/21.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
21
General Status - Where are the limits/problems?
SAM/ENSTORE status Working for many months with servers on automatic
recovery Not all features complete (pick events) 5 GB interfaces can deliver 150 MB/sec to dØmino
Robot status Design rates met, but robustness severely limited by M II
drive error rate - plan switchover by end of shutdown
![Page 22: The DØ Computing Model](https://reader036.vdocuments.us/reader036/viewer/2022062804/56814c0d550346895db90cb1/html5/thumbnails/22.jpg)
Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session
22
Future Plan
Major purchases still in FY02 New robot and reliable drives New farm nodes More memory for dØmino *Some* linux cpu
Continue R&D for linux analysis strategies Hope to establish effectiveness and practicality of the
three proposed models: AC, CS, AC@DØ
Operational improvements SAM personnel @ DØ RECO: continue with current release schedules;
emphasize quality control and testing for releases;push on cpu, memory, output size issues