Download - New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008

New CERN CAF facility:parameters, usage statistics,

user support

Marco MEONIJan Fiete GROSSE-OETRINGHAUS

CERN - Offline Week – 24.10.2008

Outline

New CAF: features

CAF1 vs CAF2

Processing Rate comparison

Current Statistics

Users, Groups

Machines, Files, Disks, Datasets, CPUs

Staging problems

Conclusions

Timeline28.09 startup of the new CAF cluster

01.10 1st day with users on the new cluster

07.10 old CAF dismissed by IT

Usage26 workers instead of 33 (but much faster, see later)

Head node is « alicecaf » instead of « lxb6046 »

GSI based authentication, AliEn certificate needed Announced since July but many last-minute users with AliEn account != afs account or server certificate unknown

Datasets clean up, staged only latest data production (First physics - stage 3)

– AF v4-15 meta package redistributed

New CAF

Technical Differences• Cmsd (Cluster Management Service Daemon)

– Why? Olbd not supported any longer– What? Dynamic load balancing of files and data

name-space– How? Stager daemon can benefits from:

• bulk prepare replaces touch file• bulk prepare allows "co-locate" files on the same node

• GSI authentication– Secure communication using user certificates and

LDAP based configuration management

Architectural Differences

New CAF Old CAF

Architecture AMD 64 Intel 32

Machines 13 x 8-core 33 x dual CPU

Space for staging

13 x 2.33 TB 33 x 200 GB

Workers 26 (2/node) 33 (1/node)

Mperf 8570 1307

• Why « only » 26 workers?

• You could use 104 if you are alone

• With 26 workers 4 users can effectively run concurrently

• Estimate average of 8 concurrent users…

• Processing units 6.5x faster than old CAF

Outline

• CAF2: features

• CAF1 vs CAF2

• Processing Rate comparison

• Current Statistics

• Users, Groups

• Machines, Files, Disks, Datasets, CPUs

• Staging problems

• Conclusions

CAF1 vs CAF2 (Processing Rate)

• Test Dataset

• First physics (stage 3) pp, Pythia6, 5kG, 10TeV

• /COMMON/COMMON/LHC08c11_10TeV_0.5T

• 1840 files, 276k events

• Tutorial task that runs over ESDs and displays Pt distribution

• Other comparison test:RAW data reconstruction (Cvetan)

Reminder• The test is dependent on the file distribution for

the used dataset

• Parallel code:• Creation of workers

• Files validation (workers opening the files)

• Events loop (execution of the selector on the dataset)

• Serial code:• Initialization of PROOF master, session and query objects

• Files look up

• Packetizer (file slices distribution)

• Merging (biggest task)

#nodes #events Size (GB) Init_time Proc_time Ev/s MB/s Speedup Efficiency

33 2k 0.25 0.8s 3s 644 50

20k 1.35 17s 1143 77

120k 8.11 49s 2423 164

200k 13.53 1m23s 2405 163

276k 18.71 2m34s 1783 120

26 2k 0.25 0.4s 2s 1062 81 1.6x

20k 1.35 6s 3299 225 2.8x

120k 8.11 28s 4253 289 1.8x

200k 13.53 42s 4743 323 2.0x

276k 18.71 55s 4365 340 2.8x

104 2k 0.25 0.9s 2s 848 124 0.8x

20k 1.35 5s 3572 244 1.1x 27%

120k 8.11 19s 6280 427 1.4x 35%

200k 13.53 31s 6365 433 1.3x 32%

276k 18.71 45s 6120 417 1.2x 30%

• Task executed 5 times and averaged

Processing Rate Comparison (1)• The final average rate is the only important

information

104 workers, 200k evs 104 workers, 276k evs

• Final tail reflects the fact one by one workers stop working• data unevenly distributed

• A longer tail shows a worker overloaded on the last packet(s)• 3 workers maximum helping on the same

«slow» packet

Processing Rate Comparison (2)

Events/sec

#events #events

MB/sec

___104 workers___ 26 workes___ 33 workers

Outline

• CAF2: features

• CAF1 vs CAF2



• Users/Groups


• Staging problems

• Conclusions

• Available resources in CAF must be fairly used

• Highest attention to how disks and CPUs are used

• Users are grouped (sub-detectors / physics working groups)

• Each group– has a disk space (quota) which is used to stage

datasets from AliEn– has a CPU fairshare target (priority) to regulate

concurrent queries

CAF Usage

CAF GroupsGroups #UsersPWG0 21 (5)

PWG1 3 (1)

PWG2 39 (21)

PWG3 18 (8)

PWG4 30 (17)

EMCAL 2 (1)

HMPID 1 (1)

ITS 6 (3)

T0 2 (1)

MUON 4 (3)

PHOS 4 (1)

TPC 3 (2)

TOF 1 (1)

TRD 4 (0)

ZDC 1 (1)

VZERO 2 (0)

ACORDE 1 (0)

PMD 3 (0)

DEFAULT

– 19 registered groups– 145 (60) registered users– In brackets () the situation at the previous

offline week

CAFStatusTable

Files Distribution

• Nodes with more files can produce tails in processing rate

• Above a defined threshold files are not stored any longer

Min: 1727Max: 1863

Max difference: 8%

Disk Usage

Max: 116Min: 105

Max difference: 10%

Dataset Monitoring- 28TB disk space for staging- PWG0: 4TB- PWG1: 1TB- PWG2: 1TB- PWG3: 1TB- PWG4: 1TB- ITS: 0.2TB- COMMON: 2TB

CPU Quotas

- default group is not the most consuming anymore

Outline

• CAF2: features

• CAF1 vs CAF2

• processing rate comparison


• Users, Groups


• File Staging

• Conclusions

File Stager• CAF intensively uses 'prepare’

– 0-size files in Castor2 cannot be staged, but replicas are ok

– Check at stager level to avoid spawning infinite prepare on the same empty file unable toget online

replica[i] in Castor

&& size==0?

Copy replica (API service)

Loop over the replicas (CERN, if any, taken first)

replica[i] is not

staged?

Add to StageLIST

Skip it

STOP

File corrupted. Skip it

Stage StageLISTSTO

P

Outline

• CAF2: features

• CAF1 vs CAF2



• Files Distribution

• Users/Groups

• Staging

• Conclusions

Conclusions• CAF Usage

– Subscribe to [email protected] using CERN SIMBA (http://listboxservices.web.cern.ch/listboxservices)

– Web page at http://aliceinfo.cern.ch/Offline/Analysis/CAF– CAF tutorial once a month

• New CAF– Faster machines, more space, more fun– Shaky behavior due to higher user activity is under

intensive investigation

• Credits– PROOF Team and IT for the prompt support

• If (ever) you cannot connect just drop a mail and wait for…

… « please try again »

Download - New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008

Top Related