ral network upgrade - cern · 2021. 3. 23. · modest upgrade plans to make use of planned 100gb/s...
TRANSCRIPT
RAL Network
Upgrade
Alastair Dewhurst
Motivation
In the last few years the RAL Tier-1 network has struggled to deliver what was required of it. Modest upgrade plans to make use of planned 100Gb/s LHCOPN link
were scheduled for 2020.
Covid-19 meant many plans were delayed/put on hold. Opportunities for projects that could spend money.
We had access to a pot of money to build a: “World Class Laboratory” (WCL). Additional £400k to upgrade the RAL Tier-1 network.
Also, a recent security breach at a different UK research council has led to greater scrutiny of the network design/separation.
2
Alastair Dewhurst, 23rd March 2021
RAL Campus Upgrades Several planned upgrades to the campus network were also
expedited as a result of the WCL funds.
The RAL Campus offsite connection is being upgraded from 100Gbps to 200Gbps This is a resilient link 2 x 200Gbps active/passive) Physical connections done 17th - 18th March. Due to go into production March 31st.
New Fortinet 4201F Firewall Partially moved into production (IPv6) in February 2021, however intermittent
packet loss with Telephone network has meant IPv4 has not been migrated. Due for completion end of April. Access to some Tier-1 services currently passes through the firewall.
New 100Gbps capable core switches. Delivered in March 2021, in production in Q3 2021.
Alastair Dewhurst, 23rd March 2021
3
Tier-1 Network The RAL Tier 1 currently connects to:
The world via Janet via the RAL Campus Network.
Tier 0s and other Tier 1s via the LHCOPN via a private router (OPNR).
Tier 2s via Janet via a private router (OPNR).
Currently the Tier 1 provides compute, storage and services over both IPv4 and IPv6 on a single L2 segment. 3× IPv4 subnets (one for LHCOPN) 2× IPv6 subnets (one for LHCOPN)
Routing to deal with this is a little arcane… 3+1 physical routers ~7 virtual routers Nodes have ~16 IPv4 and ~8 IPv6 routing table entries More default route (gateway) options than subnets
4
Alastair Dewhurst, 23rd March 2021
Tier-1 NetworkRAL Campus & JanetLHCOPN
Alastair Dewhurst, 23rd March 2021
Tier-1 Subnets3 x Tier 1 subnets:
OPN: 130.246.176.0/22
Services: 130.246.180.0/22
Compute: 130.246.216.0/21
Subnet design was done a decade before services like CMS AAA were thought of.
6
130.246.128.0/20
130.246.192.0/20
130.246.188.0/22
130.246.208.0/21
Alastair Dewhurst, 23rd March 2021
Designing a new Tier-1 Network
We decided to build a new network for the RAL Tier-1 with a Spine / Leaf topology. New CTA Tape system is on a separate network pod.
Wanted to provide a uniform experience for end users: Dual Stack everywhere.
All machines accessible to the outside world will be on the LHCOPN and LHCONE.
We choose a single vendor (Mellanox) with the Cumulus OS for all systems. Also required storage and CPU nodes to come with Mellanox NICs.
Hardware purchased in the last two years can be easily added to new network: Review other hardware on case by case basis.
7
Alastair Dewhurst, 23rd March 2021
SCD Super Spine
In 2018 RAL Scientific Computing department deployed a Super Spine. Design to move data between big SCD
projects (bypassing site core).
3 Tier, Spine/leaf architecture following data centre best practises.
16 x Mellanox SN2700 switches in 4 blocks. 32 x 100Gb/s each
8
Alastair Dewhurst, 23rd March 2021
Tier-1 Network Design9
CTA
200Gb/sCloud
200Gb/s
Super Spine
JASMIN 400Gb/s
Tier-1 Spine
100Gb/s to CERN
200Gb/s to JANET
400Gb/s to Site Core
400Gb/s
~40PB storage 42k CPU cores
Legacy
Tier-1
Alastair Dewhurst, 23rd March 2021
10
Alastair Dewhurst, 23rd March 2021
Joining the LHCONE11
Tier-1 Spine
RAL Tier-2 has been connected to the LHCONE since September 2019
Before the Tier-1 joins we will need to switch T1E Router to peer with CERN.
LHCONE will work as a backup in the event the OPN link is cut.
Alastair Dewhurst, 23rd March 2021
Time Line
1)Build new Tier-1 Network Vendors are currently doing all installation and cabling inside the data centre.
Vast majority done this week, completion by April 14th.
Dedicated contractor effort to configure setup. Target completion mid May (4 weeks work).
2)Switch Peering from OPNR to TIE Router.
3)Announce 130.246.216.0/21 and 2001:630:58:1820/64 to LHCOPN.
4)Announce 130.246.216.0/21 and 2001:630:58:1820/64 to LHCONE.
5)Migrate older hardware to new network. Q3 2021
12
Alastair Dewhurst, 23rd March 2021
Mid
May -
June
Future plans
The new network should be in place this summer ready for other hardware/service to be migrated in the autumn.
The hardware shouldn’t change during Run 3. 4 x 100Gb/s leaf links with 25Gb/s links to CPU/Storage nodes should
be more than sufficient for Run 3.
Given current pricing trends, we would hope to upgrade to 2 x 100Gb/s LHC OPN links in the second half of Run 3. Cost halves every ~4 years.
In 2026 during LS3, when the warranty expires on current hardware, we would expect to replace 100Gb/s capable spine switches with 400Gb/s capable ones.
13
Alastair Dewhurst, 23rd March 2021
SCD Super Spine
Alastair Dewhurst, 22nd September 2020
15
y = 5115 . e -0.17x
LS3