ibm network processor, development environment and lhcb ... · niko neufeld cern, ep 8 egress event...
TRANSCRIPT
1
IBM Network Processor,Development Environment and
LHCb Software
LHCb Readout Unit Internal ReviewJuly 24th 2001
Niko Neufeld, CERN
July 24th, 2001 Niko NEUFELDCERN, EP
2
Outline
• IBM NP4GS3 – Architecture• A Readout Unit based on the NP4GS3• Dataflow in the NP based RU • Sub-event building algorithms• Software Development Environment• Performance of Sub-event building• Calibration of simulation results with
measurements on Reference Kit Hardware
July 24th, 2001 Niko NEUFELDCERN, EP
3
IBM NP4GS3
• 4 full duplex Gigabit Ethernet MACs• 16 Processors / 2 Hardware threads @ 133 MHz è
2128 MIPS to handle up to 4.1x106 packets/s• 128 kB on-chip input buffer, up to 128 MB DDR RAM
output buffer• 2 x Switch Interface (“DASL”)
@ 4 Gb/s• Embedded PPC 405 • In production since beginning
of this year (R1.1)
July 24th, 2001 Niko NEUFELDCERN, EP
4
Readout Unit based on NP4GS3• 1 or 2 Mezzanine Cards
containing each– 1 Network Processor– All memory needed for the NP– Connections to the external
world• PCI-bus• DASL (switch bus)• Connections to physical
network layer• JTAG, Power and clock
• PHY-connectors• L0 Trigger-Throttle output• Power and Clock generation• LHCb standard ECS interface
(CC-PC) with separate Ethernet connection
Board Block Diagram
July 24th, 2001 Niko NEUFELDCERN, EP
5
Data flow in the NP4GS3
Ingress Event Building Egress Event Building
DASL DASL
Access to frame data Access to frame data
Wrap for single NP only
July 24th, 2001 Niko NEUFELDCERN, EP
6
Sub-event merging software
• Sub-event building is the main task of the software running on the NP4GS3 when used in a Readout Unit
• Two locations for frame manipulation (ingress & egress) insinuate two different algorithms with different advantages and possible fields of application
• Ingress event building for high frequency up to 1.5 MHz, small frames ( ~ 64 bytes)
• Egress event building for large frames (up to 9000 bytes), or fragments spanning multiple frames
July 24th, 2001 Niko NEUFELDCERN, EP
7
Ingress Event Building
L Only 128 kB of memoryL Multiple frames more
difficult, because they have to be sent in orderover the DASL
☺ On chip memory(no wait cycles!)
☺ Memory buffers organized via descriptors
☺ Code is more “stream-lined”, because copying is done on linear memory
July 24th, 2001 Niko NEUFELDCERN, EP
8
Egress Event Building☺ 64 MB of external
buffer space (Access to memory over 128 bit wide bus). Two copies of 64 MB each, for increased throughput.
☺ Only contested resource is the memory bus.
☺ Multiple frames are handled easily.
☺ For larger fragments only part of the data (ideally 1/8) need to be read.
L Memory is external (wait cycles!)
L Buffer data structure is awkward (chunks of 2 x 58 bytes)
July 24th, 2001 Niko NEUFELDCERN, EP
9
Software Development Environment
• Development software consists of: Assembler, Debugger, Simulator and Profiler (the debugger can either be run on the simulator or connect to real hardware via a Ethernet to JTAG interface (“RISC Watch”) attached to the NP4GS3)
• Rich set of documentation• Many examples, such as complete routing
software, are available
July 24th, 2001 Niko NEUFELDCERN, EP
10
User interface of Simulator and Remote Debugger
Thread control
Coprocessor RegistersCode view with break point and single stepping
View of memory buffers
Simulator or Real Hardware Debugging as well as processor version
July 24th, 2001 Niko NEUFELDCERN, EP
11
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600
Average Fragment Size [Bytes]
Max
imu
m A
llow
ed F
rag
men
t Rat
e [k
Hz]
Limit imposed by NP Processing Power
Limit imposed by single Gb Output Link
Range of possible Level-1 trigger rates
Performance for 4:1 Egress Event-Building
• Only 16 out of 32 threads simulated
• hence blue line is highly pessimistic
• Should increase by at least 50% for 32 threads
• At any frame size the performance of the RU is limited only by the output link speed of 125 MB/s
July 24th, 2001 Niko NEUFELDCERN, EP
12
Performance for 3:1 Ingress Event- Building
• Only 16 out of 32 threads simulated
• Overall performance limited by single output link speed only
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100 110 120
Average Input Fragment Payload [Bytes]
Max
imu
m A
ccep
tab
le L
0 T
rig
ger
Rat
e [M
Hz] Limit imposed by NP PerfomanceLimit imposed by single Gb Output LinkMax. Level-0 Trigger Rate
• NP performance sufficient for any fragment size
July 24th, 2001 Niko NEUFELDCERN, EP
13
COP Stalls Thread 10
0
500
1000
1500
2000
2500
3000
CL
P
DS
TS
E0
TS
E1
CA
B
EN
Q
CH
K
ST
R
PO
L
CN
TR
SE
M
Cycles
Detailed Profiler InformationCycles per Dispatch Thread 10
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Thread 10 dispatch
Cyc
les
(7.5
ns)
gpr
bus
inst
data
Cop
Exec
Stall Cycles Thread 10
0
1000
2000
3000
4000
5000
6000
7000
Exec Cop Data Inst Bus GPR
Cycles
Details of stall (‘enforced NOP’) cyclesThey are assigned to the coprocessor, which controls access to the resource in question (and is causing the stall in that case)
July 24th, 2001 Niko NEUFELDCERN, EP
14
Reference Kit HardwareOur reference kit has 4 x 2 port Gigabit Ethernet I/O cards
July 24th, 2001 Niko NEUFELDCERN, EP
15
Test Set-up4 x Tigon2 NIC 1000SX
PPC 405
Measurement Procedure• Download code into NP4GS3
via RISC Watch (JTAG)
• Send special frame to NP4GS3 to trigger synchronization frame being sent to Tigons
• Tigons start sending fragments. They add their internal time-stamp to each frame (1 µs resolution).
• NP4GS3 processes and adds its internal time-stamp (1 ms resolution).
• Tigons receive frames and calculate elapsed timeRISC Watch =
JTAG via Ethernet
IBM NP4GS3R1.1 Reference Kit
4 x 1000SX Full Duplex Ports
Tigon2 NICFeatures
• Gigabit Ethernet Network Interface Card (optical/copper)
• Fully programmable
• Baseline for final stage event building (“Smart NICs”)
• Up to 620 kHz fragment rate
• 1 µs resolution timer
July 24th, 2001 Niko NEUFELDCERN, EP
16
Calibrating the simulation by measurements
• Event building with a single thread enabled and 4 sources
• Event building with a single source and multiple threads (16) enabled
• With release 1.1 version of the NP4GS3 can unfortunately not run easily multiple sources on multiple threads (no semaphore coprocessor)
July 24th, 2001 Niko NEUFELDCERN, EP
17
Results from Measurements (Ingress Event-Building)
• Round-trip time (Tigon-out Tigon-in: 1.7 µs per fragment (frame of 60 Bytes)
• Simulation says that 600 ns/fragment are used in the NP4GS3. This is not really measurable with our setup.
0.51.7 (0.0)1 source16 threads
3.24.5(2.8)4 sources1 thread
4.96.6(4.9)1 source1 thread
Simulation[µs/fragment]
Measurement[µs/fragment]
NOTE: Since the system is pipelined, times smaller than the intrinsic overhead of the Tigon, (i.e. 1.7 µs), cannot be measured accurately!
àWe can trust the timing results obtained with the simulator.
July 24th, 2001 Niko NEUFELDCERN, EP
18
Conclusion
• The software development environment makes the full power of this complex chip available to the software developer
• Two sub-event merging codes for large fragments @ rates of a few 100 kHz and small fragments @ rates of 1 MHz have been developed and benchmarked using simulation.
• The results for the more demanding (“ingress”) of the two has been verified using the IBM NP4GS3 reference kit hardware
• The simulation results show that the performance requirements on a readout unit are fully met an NP4GS3-based implementation.
July 24th, 2001 Niko NEUFELDCERN, EP
19
Future Work
• With the upgrade of the NP4GS3 reference kit to version 2.0 of the processor verify (again) code for egress and ingress event building
• With more Tigon NICs and faster fragment generation code, test also e.g. 7 to 1 multiplexing.
• Develop and test layer 2 switching application (much simpler than event building due to static routing tables)
• Develop code for the communication between Linux operated embedded PPC 405 and the CC-PC
July 24th, 2001 Niko NEUFELDCERN, EP
20
NP4GS3 Architecture
128 kB on chip Up 64 MB128 bit wide
4 Gb/s
Accessiblefrom PCI
July 24th, 2001 Niko NEUFELDCERN, EP
21
EPC Block-Diagram
IBM NP4GS3 Embedded Processor Complex (EPC)Block-Diagram
July 24th, 2001 Niko NEUFELDCERN, EP
22
Performance for 4:1 Egress Event-Building
Performance Dependance on Fragment Size
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 5 10 15 20Number of Active Threads
Fra
gm
en
t R
ate
[M
Hz]
no
rma
lize
d t
o 1
6 t
hre
ad
s
0
100
200
300
400
500
600
700
800
Ou
tpu
t B
an
dw
ith
[M
B/s
]
100 Byte
200 Bytes
500 Bytes
Data Rate 100 Byte
Data Rate 500 Bytes
Data Rate 100 Bytes
max. Wire Speed at Output
Average Payload
Throughput as a function of the number of active threads and the payload.
The throughput is normalised to the one achieved with 16 threads for comparability
Contention for resources such as locks and memory prevents linear scaling
Limit of Gigabit Ethernet 125 Mbyte/s
July 24th, 2001 Niko NEUFELDCERN, EP
23
Performance for 4:1 Ingress Event- Building
Fragment rate and output bandwidth as a function of active threads for various amounts of payload. (A 36 byte transport header is always included)
Performance Dependance on Fragment Size
0
1
2
3
4
5
6
7
8
0 5 10 15 20Number of Active Threads
Fra
gm
en
t R
ate
[M
Hz]
no
rma
lize
d t
o 1
6 t
hre
ad
s
0
50
100
150
200
250
300
350
400
450
Ou
tpu
t B
an
dw
idth
[M
B/s
ec
]
1 Byte
30 Bytes
100 Bytes
Data Rate 1 Byte
Data Rate 30 Bytes
Data Rate 100 Bytes
Average Payload
max. Wire Speed at Output
For all practical working points For all practical working points beyond wirebeyond wire--speedspeed
64 bytes = 1.9 MHz