prague tier-2 operationsgrid2012.jinr.ru/docs/kouba_prague_t2_operations.pdf · 2012. 7. 16. ·...
Post on 18-Feb-2021
2 Views
Preview:
TRANSCRIPT
-
Prague Tier-2 operations
● Tomáš KoubaMiloš Lokajíček
● GRID 2012, Dubna● 16.7.2012
-
Outline
● Who we are, our users● New HW● Services● Internal network● External connectivity● IPv6 testbed
-
● Who we are– Regional Computing Centre for Particle Physics,
Insitute of Physics Academy of Sciences of the Czech Republic
– basic research in particle physics, solid state physics and optics
● Our users– scientists from our and other institutes of the
Academy– Charles University, Czech Technical University– WLCG (ATLAS, ALICE), EGI (AUGER, CTA), D0 grid
Who we are, Our users
-
WLCG grid structure
Backup Tier1:TaipeiBNL(FNAL)
MFFTIER3
FJFITIER3
ÚJFTIER3
CERNTIER0/1
KITTIER1 Prague
TIER2
Other Tier2 over Internet
Disk space and computing capacity
Next year goal:• support for Tier3
centers• user support
-
Capacities over time
HEPSPEC2006 % TB disk %2009 10 340 1862010 19 064 100 427 1002011 23 484 100 1 714 100
D0 90331 40 35 2ATLAS 6 796 29 1316(16 MFF) 77
ALICE 7 357 31 363(60 Řež) 21
2012 29 192 100 2521 100D0 9 980 34 35 1
ATLAS 11 600 40 1880 (16 MFF) 74
ALICE 7 612 26 606(100 Řež) 24
-
New HW in 2012
• Worker nodes:– 23 nodes SGI Rackable C1001-G13– 2x (Opteron 6274 16 cores) 64 GB RAM, 2x SAS 300 GB– 374 W (full load), more than 5000 HEPSpec in total– delivered in water-cooled rack
• Disk servers– 4 Supermicro nodes (4 servers + 3 JBODs)– 837TB in total (400TB still delayed because of floods)
• Infrastracture servers– 2x DL360 G7 (HyperV server, NFS server)
• UPS PowerWare 9390 (aka Eaton) – 2x100 KW, energy saving mode (offline => 98% efficiency)
-
good sealing crucial
diskservers on off (divider added)
dis
kser
vers
worker nodes
rubus01
-
Services
● Batch system: Torque/MUI
● UMD services
– 2x CreamCE
– MONBox
– SE DPM (1x head node, 15 disk nodes)
● VO specific
– AUGER dashboard
– squid (for cvmfs and frontier – ATLAS)
– VOBOX (ALICE)
– 2x SAM station (D0)
● All nodes installed automatically over network (PXE, kickstart, simple script to end installation)
● All further configuration performed by CFengine (version 2)
– We are evaluating puppet● New services in 2012:
– CVMFS (problem with full disks, direct access to CERN stratum 1)
– UMD worker nodes
– perfsonar
-
Monitoring
● Nagios – Health of hradware, systems, SW, syslog
monitor, SNMP traps– Important errors by e-mail and SMS, rest by
consolidated mails 3 times per day– 7000 services on 466 hosts– WLCG data transfers, job execution– Multisite –alternative user interface, massive
opreations with group of nodes
-
Multisite Nagios UI
-
Netflow – network monitoring
● Flowtracker, Flowgrapher● Useful for troubleshooting problems in the past
– e.g. reason of poor Alice efficiency at our site:
-
Internal network
● CESNET upgraded our main CISCO router– 6506 -> 6509– supervisor SUP720 -> SUP2T– new 8x 10G X2 card– planned upgrade of power supplies 2x3kW -> 2x6 kW– (2 cards 48x1 Gbps, 1 card 4x10 Gbps, FW service module)
● FWSM upgraded to support IPv6● MTU increased to 9000 during spring
– experienced problems with ATLAS data transfers– fragmentation ICMP messages were suppressed– fixed on the main router
-
Central router (Cisco 6509)
-
External connectivity
● Exclusive: 1 Gbps (to FZK) + 10 Gbps (CESNET)● Shared: 10 Gbps (PASNET – GEANT)
FZU -> FZK FZK -> FZU PASNET link
• Not enough for ATLAS T2D limit (5 MB/s to/from T1s)• Perfsonar installed:
-
External connectivity
-
LHCONE - LHC Open Network Environment
● New concept to connect T2 to other T1s and T2s● Tier1 (11), Tier2 (130), Tier3 allover the world● Initially hierarchical model: T2 communicates to one T1● T1s interconnected with private redundant optical LHCOPN● Change from hierarchical to flat model
T1
T2
T2
T1
T2
T2
T1
T1
-
LHCONE cont.
● LHCONE complementary to well working LHCOPN● LHCONE only for LHC data● Realization via L3 VPN using VRF● Under construction
– Esnet, Internet2, Geant+NREN, Nordunet, USLHCnet, Surfnet, ASGC, CERN
● Evaluation and new improvements in 2013● Our implementation and HW requirements are being discussed
with CESNET
-
IPv6 testing
● We participate in Hepix IPv6 testbed (we focus on IPv6-only setup)
● HW status (so far tested)
– switches have no problem with IPv6 (only 2 of them can be managed over IPv6)
– firewall upgrade was needed
– no management interfaces of our servers support IPv6
– no facility monitored by SNMP supports IPv6 (air condition, thermometers, UPS, water cooling unit)
– none of the disk arrays management interfaces support IPv6
● DNS, DHCPv6 running fine
● NTP server runs fine (lack of stratum 1 NTP servers with IPv6 connectivity)
● Many problems with automatic installation (SL5 is simply not ready for IPv6)
-
IPv6 testing cont.
● Running middleware needs regular CRL updates– we developed a tool to test CRLs availability over IPv6
● IPv6 testing project was partially supported by CESNET, project number 416R1/2011.
Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20
top related