1 data management d0 monte carlo needs the nikhef d0 farm the data we produce the sam data base the...

1

Data Management

D0 Monte Carlo needsThe NIKHEF D0 farmThe data we produceThe SAM data base

The networkConclusions

Kors Bos, NIKHEF, AmsterdamFermilab, May 23 2001

2

D0 Monte Carlo needs• D0 Trigger rate is 100 Hz, 107 seconds/yr 109 events/yr

• We want at least 10% of that be simulated 108 events/yr

• To simulate 1 QCD event takes ~3 minutes (size ~2 Mbyte)

– On a 800 MHz PIII

• So 1 cpu can produce ~105 events/yr (~200 Gbyte)– Assuming a 60% overall efficiency

• So our 100 cpu farm can produce ~107 events/yr (~20 Tbyte)

– And this is only 10% of the goal we set ourselves

– Not counting Nijmegen D0 farm yet

• So we need at least an order of magnitude more– UTA (50), Lyon (200), Prague(10), BU(64),

– Nijmegen(50), Lancaster(200), Rio(25),

3

Example: Min.bias

• Did a run with 1000 events on all cpu’s– Took ~2 min./event– So ~1.5 days for the whole run– Ouput file size ~575 MByte

• We left those files on the nodes• reason for enough local disk space !• Intend to repeat that “sometimes”

4

Output data

• -rw-r--r-- 1 a03 computer 298 Nov 5 19:25 RunJob_farm_qcdJob308161443.params

• -rw-r--r-- 1 a03 computer 1583995325 Nov 5 10:35 d0g_mcp03_pmc03.00.01_nikhef.d0farm_isajet_qcd-incl-PtGt2.0_mb-none_p1.1_308161443_2000

• -rw-r--r-- 1 a03 computer 791 Nov 5 19:25 d0gstar_qcdJob308161443.params

• -rw-r--r-- 1 a03 computer 809 Nov 5 19:25 d0sim_qcdJob308161443.params

• -rw-r--r-- 1 a03 computer 47505408 Nov 3 16:15 gen_mcp03_pmc03.00.01_nikhef.d0farm_isajet_qcd-incl-PtGt2.0_mb-none_p1.1_308161443_2000

• -rw-r--r-- 1 a03 computer 1003 Nov 5 19:25 import_d0g_qcdJob308161443.py

• -rw-r--r-- 1 a03 computer 912 Nov 5 19:25 import_gen_qcdJob308161443.py

• -rw-r--r-- 1 a03 computer 1054 Nov 5 19:26 import_sim_qcdJob308161443.py

• -rw-r--r-- 1 a03 computer 752 Nov 5 19:25 isajet_qcdJob308161443.params

• -rw-r--r-- 1 a03 computer 636 Nov 5 19:25 samglobal_qcdJob308161443.params

• -rw-r--r-- 1 a03 computer 777098777 Nov 5 19:24 sim_mcp03_psim01.02.00_nikhef.d0farm_isajet_qcd-incl-PtGt2.0_mb-poisson-2.5_p1.1_308161443_2000

• -rw-r--r-- 1 a03 computer 2132 Nov 5 19:26 summary.conf

5

Output data translated0.047 Gbyte gen_*

1.5 Gbyte d0g_*

0.7 Gbyte sim_*

import_gen_*.py

import_d0g_*.py

import_sim_*.py

isajet_*.paramsRunJob_Farm_*.paramsd0gstar_*.paramsd0sim_*.paramssamglobal_*.paramsSummary.conf

12 files for generator+d0gstar+psimBut of course only 3 big onesTotal ~2 Gbyte

Per Day, on 100 cpu’sTotal 200 Gbyte/day !

6

Automation• Mc_runjob (modified)

– Prepares MC jobs (gen+sim+reco+anal)• (f.e.) 300 events per job/cpu • Repeat (f.e.) 500 times

– Submits them into the batch (FBSNG)• Ran on the nodes• Moves the executable to the nodes + some files

– Copy to fileserver after completion• A separate batch job onto the fileserver• Data moves between nodes and server

– Submits them into SAM• Sam does file transfers to Fermi and SARA

• Runs for a week …

7

farm server file server

node

SAM DB

datastore

fbs(rcp)

fbs(sam)

fbs(mcc)

mcc request

mcc input

mcc output

1.2 TB

40 GB

SARA

control

data

metadata

fbs job:1 mcc2 rcp3 sam

50 +

datastore

FNAL

8

Network bandwidth• NIKHEF SURFnet 1 Gbit• SURFnet: Amsterdam Chicago 622 Mbit• Esnet: Chicago Fermilab 155 Mbit

ATM• But ftp gives us ~4 Mbit/sec• bbftp gives us ~25 Mbit/sec• bbftp processes in parallel ~45 Mbit/sec

For 2002 • NIKHEF SURFnet 2.5 Gbit• SURFnet: Amsterdam Chicago 622 Mbit• SURFnet: Amsterdam Chicago 2.5 Gbit optical• Chicago Fermilab ? More than 155

9

network capacity internally

Acc

ess

capa

city

100 Gbit/s

1 Gbit/s

10 Mbit/s

100 Mbit/s

10 Gbit/s

1999 2000 2001 2002

155 Mbit/s

2,5 Gbit/s

20 Gbit/sSURFnet5

10 Gbit/s

1.0 Gbit/s

SURFnet4

10

NLNLSURFnet

Geneva

UKUKSuperJANET4

AbileneAbilene

ESNETESNET

MRENMREN

ItItGARR-B

GEANT

NewYork

FrFrRenater

STAR-TAP

STAR-LIGHT

622 Mb

2.5 Gb

TA network capacity

11

Network load last week

• Needed for 100 MC CPU’s: ~10 Mbit/s (200 GB/day)• Available to Chicago: 622 Mbit/s• Available to FNAL: 155 Mbit/s

• Needed next year (double cap.): ~25 Mbit/s• Available to Chicago: 2.5 Gbit/s: factor 100 more !!• Available to FNAL: ??

12

Conclusions• Producing a lot of data is easy• Storing a lot of data less easy, but still easy• Moving a lot of data even less easy, but still easy

So what is the problem?• Managing a lot of data is difficult metadata

dbase• The network around Fermilab/CERN is getting

tight• Otherwise there is enough bandwidth !Conclusion:Do the easiest thing:Don’t store or move: recalculate !!

1 data management d0 monte carlo needs the nikhef d0 farm the data we produce the sam data base the...

Documents