ibm network processor, development environment and lhcb ... · niko neufeld cern, ep 8 egress event...

1

IBM Network Processor,Development Environment and

LHCb Software

LHCb Readout Unit Internal ReviewJuly 24th 2001

Niko Neufeld, CERN

July 24th, 2001 Niko NEUFELDCERN, EP

2

Outline

• IBM NP4GS3 – Architecture• A Readout Unit based on the NP4GS3• Dataflow in the NP based RU • Sub-event building algorithms• Software Development Environment• Performance of Sub-event building• Calibration of simulation results with

measurements on Reference Kit Hardware


3

IBM NP4GS3

• 4 full duplex Gigabit Ethernet MACs• 16 Processors / 2 Hardware threads @ 133 MHz è

2128 MIPS to handle up to 4.1x106 packets/s• 128 kB on-chip input buffer, up to 128 MB DDR RAM

output buffer• 2 x Switch Interface (“DASL”)

@ 4 Gb/s• Embedded PPC 405 • In production since beginning

of this year (R1.1)


4

Readout Unit based on NP4GS3• 1 or 2 Mezzanine Cards

containing each– 1 Network Processor– All memory needed for the NP– Connections to the external

world• PCI-bus• DASL (switch bus)• Connections to physical

network layer• JTAG, Power and clock

• PHY-connectors• L0 Trigger-Throttle output• Power and Clock generation• LHCb standard ECS interface

(CC-PC) with separate Ethernet connection

Board Block Diagram


5

Data flow in the NP4GS3

Ingress Event Building Egress Event Building

DASL DASL

Access to frame data Access to frame data

Wrap for single NP only


6

Sub-event merging software

• Sub-event building is the main task of the software running on the NP4GS3 when used in a Readout Unit

• Two locations for frame manipulation (ingress & egress) insinuate two different algorithms with different advantages and possible fields of application

• Ingress event building for high frequency up to 1.5 MHz, small frames ( ~ 64 bytes)

• Egress event building for large frames (up to 9000 bytes), or fragments spanning multiple frames


7

Ingress Event Building

L Only 128 kB of memoryL Multiple frames more

difficult, because they have to be sent in orderover the DASL

☺ On chip memory(no wait cycles!)

☺ Memory buffers organized via descriptors

☺ Code is more “stream-lined”, because copying is done on linear memory


8

Egress Event Building☺ 64 MB of external

buffer space (Access to memory over 128 bit wide bus). Two copies of 64 MB each, for increased throughput.

☺ Only contested resource is the memory bus.

☺ Multiple frames are handled easily.

☺ For larger fragments only part of the data (ideally 1/8) need to be read.

L Memory is external (wait cycles!)

L Buffer data structure is awkward (chunks of 2 x 58 bytes)


9

Software Development Environment

• Development software consists of: Assembler, Debugger, Simulator and Profiler (the debugger can either be run on the simulator or connect to real hardware via a Ethernet to JTAG interface (“RISC Watch”) attached to the NP4GS3)

• Rich set of documentation• Many examples, such as complete routing

software, are available


10

User interface of Simulator and Remote Debugger

Thread control

Coprocessor RegistersCode view with break point and single stepping

View of memory buffers

Simulator or Real Hardware Debugging as well as processor version


11

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600

Average Fragment Size [Bytes]

Max

imu

m A

llow

ed F

rag

men

t Rat

e [k

Hz]

Limit imposed by NP Processing Power

Limit imposed by single Gb Output Link

Range of possible Level-1 trigger rates

Performance for 4:1 Egress Event-Building

• Only 16 out of 32 threads simulated

• hence blue line is highly pessimistic

• Should increase by at least 50% for 32 threads

• At any frame size the performance of the RU is limited only by the output link speed of 125 MB/s


12

Performance for 3:1 Ingress Event- Building

• Only 16 out of 32 threads simulated

• Overall performance limited by single output link speed only

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100 110 120

Average Input Fragment Payload [Bytes]

Max

imu

m A

ccep

tab

le L

0 T

rig

ger

Rat

e [M

Hz] Limit imposed by NP PerfomanceLimit imposed by single Gb Output LinkMax. Level-0 Trigger Rate

• NP performance sufficient for any fragment size


13

COP Stalls Thread 10

0

500

1000

1500

2000

2500

3000

CL

P

DS

TS

E0

TS

E1

CA

B

EN

Q

CH

K

ST

R

PO

L

CN

TR

SE

M

Cycles

Detailed Profiler InformationCycles per Dispatch Thread 10

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Thread 10 dispatch

Cyc

les

(7.5

ns)

gpr

bus

inst

data

Cop

Exec

Stall Cycles Thread 10

0

1000

2000

3000

4000

5000

6000

7000

Exec Cop Data Inst Bus GPR

Cycles

Details of stall (‘enforced NOP’) cyclesThey are assigned to the coprocessor, which controls access to the resource in question (and is causing the stall in that case)


14

Reference Kit HardwareOur reference kit has 4 x 2 port Gigabit Ethernet I/O cards


15

Test Set-up4 x Tigon2 NIC 1000SX

PPC 405

Measurement Procedure• Download code into NP4GS3

via RISC Watch (JTAG)

• Send special frame to NP4GS3 to trigger synchronization frame being sent to Tigons

• Tigons start sending fragments. They add their internal time-stamp to each frame (1 µs resolution).

• NP4GS3 processes and adds its internal time-stamp (1 ms resolution).

• Tigons receive frames and calculate elapsed timeRISC Watch =

JTAG via Ethernet

IBM NP4GS3R1.1 Reference Kit

4 x 1000SX Full Duplex Ports

Tigon2 NICFeatures

• Gigabit Ethernet Network Interface Card (optical/copper)

• Fully programmable

• Baseline for final stage event building (“Smart NICs”)

• Up to 620 kHz fragment rate

• 1 µs resolution timer


16

Calibrating the simulation by measurements

• Event building with a single thread enabled and 4 sources

• Event building with a single source and multiple threads (16) enabled

• With release 1.1 version of the NP4GS3 can unfortunately not run easily multiple sources on multiple threads (no semaphore coprocessor)


17

Results from Measurements (Ingress Event-Building)

• Round-trip time (Tigon-out Tigon-in: 1.7 µs per fragment (frame of 60 Bytes)

• Simulation says that 600 ns/fragment are used in the NP4GS3. This is not really measurable with our setup.

0.51.7 (0.0)1 source16 threads

3.24.5(2.8)4 sources1 thread

4.96.6(4.9)1 source1 thread

Simulation[µs/fragment]

Measurement[µs/fragment]

NOTE: Since the system is pipelined, times smaller than the intrinsic overhead of the Tigon, (i.e. 1.7 µs), cannot be measured accurately!

àWe can trust the timing results obtained with the simulator.


18

Conclusion

• The software development environment makes the full power of this complex chip available to the software developer

• Two sub-event merging codes for large fragments @ rates of a few 100 kHz and small fragments @ rates of 1 MHz have been developed and benchmarked using simulation.

• The results for the more demanding (“ingress”) of the two has been verified using the IBM NP4GS3 reference kit hardware

• The simulation results show that the performance requirements on a readout unit are fully met an NP4GS3-based implementation.


19

Future Work

• With the upgrade of the NP4GS3 reference kit to version 2.0 of the processor verify (again) code for egress and ingress event building

• With more Tigon NICs and faster fragment generation code, test also e.g. 7 to 1 multiplexing.

• Develop and test layer 2 switching application (much simpler than event building due to static routing tables)

• Develop code for the communication between Linux operated embedded PPC 405 and the CC-PC


20

NP4GS3 Architecture

128 kB on chip Up 64 MB128 bit wide

4 Gb/s

Accessiblefrom PCI


21

EPC Block-Diagram

IBM NP4GS3 Embedded Processor Complex (EPC)Block-Diagram


22

Performance for 4:1 Egress Event-Building

Performance Dependance on Fragment Size

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 5 10 15 20Number of Active Threads

Fra

gm

en

t R

ate

[M

Hz]

no

rma

lize

d t

o 1

6 t

hre

ad

s

0

100

200

300

400

500

600

700

800

Ou

tpu

t B

an

dw

ith

[M

B/s

]

100 Byte

200 Bytes

500 Bytes

Data Rate 100 Byte

Data Rate 500 Bytes

Data Rate 100 Bytes

max. Wire Speed at Output

Average Payload

Throughput as a function of the number of active threads and the payload.

The throughput is normalised to the one achieved with 16 threads for comparability

Contention for resources such as locks and memory prevents linear scaling

Limit of Gigabit Ethernet 125 Mbyte/s


23

Performance for 4:1 Ingress Event- Building

Fragment rate and output bandwidth as a function of active threads for various amounts of payload. (A 36 byte transport header is always included)

Performance Dependance on Fragment Size

0

1

2

3

4

5

6

7

8

0 5 10 15 20Number of Active Threads

Fra

gm

en

t R

ate

[M

Hz]

no

rma

lize

d t

o 1

6 t

hre

ad

s

0

50

100

150

200

250

300

350

400

450

Ou

tpu

t B

an

dw

idth

[M

B/s

ec

]

1 Byte

30 Bytes

100 Bytes

Data Rate 1 Byte

Data Rate 30 Bytes

Data Rate 100 Bytes

Average Payload

max. Wire Speed at Output

For all practical working points For all practical working points beyond wirebeyond wire--speedspeed

64 bytes = 1.9 MHz

ibm network processor, development environment and lhcb ... · niko neufeld cern, ep 8 egress event...

Documents