site report
DESCRIPTION
Site Report. US CMS T2 Workshop. Samir Cury on behalf of T2_BR_UERJ Team. Server's Hardware profile. SuperMicro machines 2 X Intel Xeon dual core @ 2.0 GHz 4 GB RAM RAID 1 - 120 GB HDs. Nodes Hardware profile (40). Dell PowerEdge 2950 2 x Intel Xeon Quad core @ 2.33 GHz 16 GB RAM - PowerPoint PPT PresentationTRANSCRIPT
Site ReportUS CMS T2 Workshop
Samir Cury on behalf of T2_BR_UERJ Team
Server's Hardware profile• SuperMicro machines
• 2 X Intel Xeon dual core @ 2.0 GHz
• 4 GB RAM
• RAID 1 - 120 GB HDs
Nodes Hardware profile (40)• Dell PowerEdge 2950
– 2 x Intel Xeon Quad core @ 2.33 GHz
– 16 GB RAM
– RAID 0 – 6 x 1 TB Hard Drives
• CE Resources
– 8 Batch slots
– 66.5 kHS06
– 2 GB RAM / Slot
• SE Resources
– 5.8 TB Useful for dCache or hadoop
Private network onlyPrivate network only
Nodes Hardware profile (2+5)• Dell R710
– 2 are Xen Servers – not worker nodes
– 2 X Intel Xeon Quad core @ 2.4 GHz
– 16 GB RAM
• RAID 0 – 6 x 2 TB Hard Drives
• CE
– 8 Batch Slots (or more?)
– 124.41 kHS06
– 2 GB RAM / Slot
• SE
– 11.8 TB for dCache or hadoop
`
Private network onlyPrivate network only
First phase nodes Profile (82)
• SuperMicro Server
– 2 Intel Xeon single core @ 2.66 GHz
– 2 GB RAM
– 500 GB Hard Drive & 40 GB Hard Drive
• CE Resources
– Not used – Old CPU’s & low RAM per node
• SE Resources
– 500 GB per node
Plans for the future - Hardware
• Buying 5 more Dell R710
• Deploying 5 R710 when the disks arrive
– More 80 cores
– More 120 TB Storage
– More 1244 kHS06
Total• CE - 40 PE 2950 + 10 R710 = 400 Cores || 3.9 kHS06
• SE - 240 + 120 + 45 = 405 TB
Software profile – CE
• OS – CentOS 5.3 64 bits
• 2 OSG Gatekeepers
– Both running OSG - 1.2.x
– Maintenance tasks eased by redundancy – less downtimes
• GUMS 1.2.15
• Condor 7.0.3 for job scheduling
Software profile – SE
• OS - CentOS 4.7 32 bits
• dCache 1.8
– 4 GridFTP Servers
• PNFS 1.8
• PhEDEx 3.2.0
Plans for the future: Software/Network
• SE Migration– Right now we use dCache/PNFS
– We plan to migrate to BeStman/Hadoop• Some effort already comes up with results
• Adding the new nodes to the Hadoop SE
• Migrate the data
• Test with real production environment– Jobs and users accessing
• Network Improvement– RNP (our network provider) plan to deliver for us a
10 Gbps link before the next SuperComputing Conference.
T2 Analysis model & associated Physics groups
We have reserved 30 TB for each of the groups:
• Forward Physics
• B-Physics
• Studying the possibility to reserve space for Exotica
The group has several MSc & PhD students working on CMS Analysis for a long time – These have a good support
Some Grid users submit, sometimes run into trouble and give up – don't ask for support
Developments• Condor Mechanism based on suspend to give
priority to a very little pool of important users :
– 1 pair of batch slots per core
– When the priority user’s jobs arrive, it pauses the normal job on the other batch slot
– Once it finishes and vacate the slot, his pair automatically resumes.
– Documentation can become available for the interested
– Developed by Diego Gomes
Developments
• Condor4Web
– Web interface to visualize condor queue
• Shows grid DN’s
– Useful for Grid users that want to know how the job is going scheduled inside the site – http://monitor.hepgrid.uerj.br/condor
– Available on http://condor4web.sourceforge.net
– Still have much to evolve, but already works
– Developed by Samir
CMS Center @ UERJDuring LISHEP 2009 – January we have inaugurated a small
control room for CMS on UERJ:
Shifts @ CMS Center
Our computing team have participated on tutorials and now we
have four potential CSP Shifters
CMS Centre (quick) profile
• Hardware
– 4 Dell workstations with 22” monitors
– 2 x 47” TV’s
– Polycom SoundStation
• Software
– All the conferences including with the other CMS Centers are done via EVO
Cluster & Team
• Alberto Santoro (General supervisor)• Eduardo Revoredo (Hardware coordinator)• Samir Cury (Site admin)• Douglas Milanez (Trainee)
• Andre Sznajder (Project coordinator)• Jose Afonso (Software coordinator)• Fabiana Fortes (Site admin)• Raul Matos (Trainee)
2009/2010 year’s goals• We have worked in 2009 mostly in
– Getting rid of the infra-structure problems• Electrical Insuficciency
• AC – Many downtimes due to this
• These are solved now
– Besides that problems• Running official production on small workflows
• Doing private production & analysis for local and Grid users
• 2010 goal
– Use the new hardware and infra-structure for a more reliable site
– Run more heavy workflows and increase participation and presence on official production.
Thanks!
I want to formally thank Fermilab, USCMS and OSG for their financial help to bring an UERJ
representantive here.
Also want to thank USCMS for this very useful meeting
Questions? Comments?