network processors - uio

Network ProcessorsA generation of multi-core processors

INF5063:Programming Heterogeneous Multi-Core Processors

September 11, 2009

INF5063, Carsten Griwodz & Pål HalvorsenUniversity of Oslo

Agere Payload Plus APP550

Classifiermemory

Classifierbuffer

Schedulerbuffer

Stream editormemory

from Ingress

from. co-processor

to Egress

to co-processor

Schedulermemory

Statisticsmemory

PCI Bus


Agere Payload Plus APP550

Classifiermemory

Classifierbuffer

Schedulerbuffer

Stream editormemory

from Ingress

from. co-processor

to Egress

to co-processor

Schedulermemory

Statisticsmemory

PCI Bus

Pattern Processing Engine- patterns specified by programmer- programmable using a special high-level language- only pattern matching instructions- parallelism by hardware using multiple copies and several sets of variables

- access to different memories

State Engine- gather information (statistics) for scheduling - verify flow within bounds- provide an interface to the host- configure and control other functional units

Packet (protocol data unit) assembler- collect all blocks of a frame- not programmable

Stream Editor (SED)- two parallel engines- modify outgoing packets (e.g., checksum, TTL, …)- configurable, but not programmable

Reorder Buffer Manager - transfers data between classifier and traffic manager - ensure packet order due to parallelism andvariable processing time in the pattern processing


Embedded processorsEmbedded processors




PowerNP

Ingress queue

Ingress data store

Egress queue

Egressdata store

4 Interfaces (IN from net) 4 Interfaces (OUT to net)

Internalmemory

Externalmemory

Control store

Instruct.memory

PowerPCcore

2 Interfaces (OUT to host) 2 Interfaces (IN from host)

Hardwareclassifier

Dispatchunit






PowerNP

Ingress queue

Ingress data store

Egress queue

Egressdata store

4 Interfaces (IN from net) 4 Interfaces (OUT to net)

Internalmemory

Externalmemory

Control store

Instruct.memory

PowerPCcore

2 Interfaces (OUT to host) 2 Interfaces (IN from host)

Hardwareclassifier

Dispatchunit

Embedded PowerPC GPU- no OS on the NPF

Coprocessors- 8 embedded processors- 4 kbytes local memory each- 2 cores/processor- 2 threads/core

Link layer- framing outside the processor


IXP1200 Architecture

RISC processor:- StrongARM running Linux- control, higher layer protocols and exceptions- 232 MHz

Microengines:- low-level devices with limited set of instructions- transfers between memory devices - packet processing- 232 MHz

Access units:- coordinate access to external units

Scratchpad:- on-chip memory- used for IPC and synchronization


IXP2400 Architecture

microengine 8

SRAM

coprocessor

FLASH

DRAM

SRAM

access

SDRAM

access

SCRATCH

memory

PCI

access

MSF

access

Embedded

RISK CPU

(XScale)

PCI bus

receive bus

DRAM

bus

SRAM

bus

microengine 2

microengine 1

microengine 5

microengine 4

microengine 3

multiple

independent

internal

buses

slowport

access

…

transmit bus

RISC processor:- StrongArm XScale- 233 MHz 600 MHz

Microengines- 6 8- 233 MHz 600 MHz

Media Switch Fabric- forms fast path for transfers- interconnect for several IXP2xxx

Receive/transmit buses- shared bus separate busses

Slowport- shared inteface to external units- used for FlashRom during bootstrap

Coprocessors- hash unit- 4 timers- general purpose I/O pins- external JTAG connections (in-circuit tests)- several bulk cyphers (IXP2850 only)- checksum (IXP2850 only)- …

Example: SpliceTCP



Internet


TCP Splicing

SYN

SYNACK

SYNACK

Some client


Internet


TCP Splicing

ACK

ACK

ACK

Some client


Internet


TCP Splicing

HTTP-GET

HTTP-GET

DATA

DATA

Some client


Internet


TCP Splicing

Some client

INF5063, Carsten Griwodz & Pål HalvorsenUniversity of Oslo INF5063, Carsten Griwodz & Pål HalvorsenUniversity of Oslo

TCP Splicing

Data link layer

Physical layer

Network layer

Data link layer

Transport layer

Network layer

Application layer

Transport layer

accept

connect

while(1)

read

write

Linux Netfilter• Establish upstream connection• Receive entire packet• Rewrite headers• Forward packet

IXP 2400• Establish upstream connection• Parse packet headers• Rewrite headers• Forward packet


Throughput vs Request File Size

0

100

200

300

400

500

600

700

800

1 4 16 64 256 1024

Request file size (KB)

Th

rou

gh

pu

t (M

bp

s)

Linux-based

NP-based

Graph from the presentation of the paperSpliceNP: A TCP Splicer using a Network Processor, ANCS2005, Princeton, NJ, Oct 27-28, 2005By Li Zhao, Yan Lou, Laxmi Bhuyan (Univ. Calif. Riverside), Ravi Iyer (Intel)

Major performance gain at all request sizes

Example:Transparent protocol translation and load balancing in a media streaming scenario

slides from an ACM MM 2007 presentationby Espeland, Lunde, Stensland, Griwodz and Halvorsen



Load Balancer

Network...

IXP 2400

ingress

egress

Balancer

1. identify connection

2. if exist send to right server(select port to use)

elsecreate new session(select one server)send packet

Monitor

Historic and

current loads

of the different

servers

RSTP/RTPvideo server


mplayer

clients

RTSP / RTP parserRTSP

RTP/UDPRTSP

RTP/UDP


Transport Protocol Translator

Network...

IXP 2400

ingress



mplayer

clients

egress

Balancer Monitor

RTSP / RTP parserRTSP

HTTP-streaming isfrequently used today!!

HTTP


Transport Protocol Translator

Network...

IXP 2400

ingress



mplayer

clients

egress

Balancer Monitor

RTSP / RTP parser

RTSP/RTP

Protocol translator

RTP/UDP

HTTP

RTSP/RTP

HTTP

RTP/UDP

HTTP

RTSP


Results

The prototype works and both load balances and translates between HTTP/TCP and RTP/UDP

The protocol translation gives a much more stable bandwidth than using HTTP/TCP all the way from the server

protocol translation HTTP

Example: Booster Boxes

slide content and structure mainly from the NetGames 2002 presentation by Bauer, Rooney and Scotton



Client-Server

backbonenetwork

local distributionnetwork




Peer-to-peer

backbonenetwork





Booster boxes

Middleboxes

− Attached directly to ISPs’ access routers

− Less generic than, e.g. firewalls or NAT

Assist distributed event-driven applications

− Improve scalability of client-server and peer-to-peer applications

Application-specific code

− “Boosters”

− Caching on behalf of a server

− Aggregation of events

− Intelligent filtering

− Application-level routing


Booster boxes

backbonenetwork





Booster boxes

Application-specific code

− Caching on behalf of a server• Non-real time information is cached

• Booster boxes answer on behalf of servers

− Aggregation of events• Information from two or more clients within a time window is aggregated

into one packet

− Intelligent filtering• Outdated or redundant information is dropped

−Application-level routing

• Packets are forward based on

Packet content

Application state

Destination address


Architecture

Data Layer

− behaves like a layer-2 switch for the bulk of the traffic

− copies or diverts selected traffic

− IBM’s booster boxes use the packet capture library (“pcap”) filter specification to select traffic


Data Aggregation Example: Floating Car Data

Main booster task Complex message aggregation

Statistical computations

Context information

Very low real-time requirements

Traffic monitoring/predictionsPay-as-you-drive insuranceCar maintenanceCar taxes…

Statistics gatheringCompressionFiltering…

Transmission of Position Speed Driven distance …


Interactive TV Game Show

Main booster task

Simple message aggregation

Limitedreal-timerequirements

1. packet generation

2. packetinterception

3. packetaggregation

4. packetforwarding


Game with large virtual space

Main booster task

Dynamic server selection− based on current in-

game location

− Require application-specific processing

handled byserver 1

handled byserver 2

server 2

server 1

Virtual space

High real-time requirements


Summary

Scalability

− by application-specific knowledge

− by network awareness

Main mechanisms

− Caching on behalf of a server

− Aggregation of events

− Attenuation

− Intelligent filtering

− Application-level routing

Application of mechanism depends on

− Workload

− Real-time requirements

Multimedia Examples



Multicast Video-Quality Adjustment



IOhub

memoryhub

CPU

memory



Several ways to do video-quality adjustments− frame dropping

− re-quantization

− scalable video codec

− …

Yamada et. al. 2002: use low-pass filter to eliminate high-frequency components of the MPEG-2 video signal and thus reduce data rate

− determine a low-pass parameter for each GOP

− use low-pass parameter to calculate how many DCT coefficients to remove from each macro block in a picture

− by eliminating the specified number of DCT coefficients the videodata rate is reduced

− implemented the low-pass filter on an IXP1200



Low-pass filter on IXP1200

− parallel execution on 200MHz StrongARM and microengines

− 24 MB DRAM devoted to StrongARM only

− 8 MB DRAM and 8 MB SRAM shared

− test-filtering program on a regular PC determined work-distribution

• 75% of data from the block layer

• 56% of the processing overhead is due to DCT

five step algorithm:

1. StrongArm receives packet copy to shared memory area

2. StrongARM process headers and generate macroblocks (in shared memory)

3. microengines read data and information from shared memory and perform quality adjustments on each block

4. StrongARM checks if the last macroblock is processed (if not, go to 2)

5. StrongARM rebuilds packet



Segmentation of MPEG-2 data

− slice = 16 bit high stripes

− macroblock = 16 x 16 bit square

• four 8 x 8 luminance

• two 8 x 8 chrominance

DCT transformed with coefficients sorted in ascending order

Data packetization for video filtering

− 720 x 576 pixels frames and 30 fps

36 “slices” with 45 macroblocks per frame

− Each slice = one packet

− 8 Mbps stream ~7Kb per packet



Evaluation – three scenarios tested

− StrongARM only 550 kbps

− StrongARM + 1 microengine 350 kbps

− StrongARM + all microengines 1350 kbps

− achieved real-time transcoding not enough for practical

purposes, but distribution of workload is nice

Parallelism, Pipelining &Workload Partitioning



Divide and …

Divide a problem into parts – but how?

Pipelining:

Parallelism:

Hybrid:


Key Considerations

System topology− processor capacities:

different processors have different capabilities

− memory attachments:• different memory types have different rates and access times• different memory banks have different access times

− interconnections: different interconnects/busses have different capabilities

Requirements of the workload?− dependencies

Parameters?− width of pipeline (level of parallelism)− depth of pipeline (number of stages)− number of jobs sharing busses


Network Processor Example

Pipelining vs. Multiprocessor by Ning Weng & Tilman Wolf

− network processor example

− all pipelining, parallelism and hybrid is possible

− packet processing scenario

− what is the performance of the different schemes taking into account…?

• … processing dependencies

• … processing demands

• … contention on memory interfaces

• … pipelining and parallelism effects (experimenting with the width and the depth of the pipeline)


Simulations

Several application examples in the paper giving different DAGs, e.g.,…

− ... flow classification:classify flows according to IP addresses and transport protocols

Measuring system throughput varying all the parameters

− # processors in parallel (width)

− # stages in the pipeline (depth)

− # memory interfaces (busses) between each stage in the pipeline

− memory access times


Results

# memory interfaces per stage M = 1

Memory service time S = 10

Increases with the pipeline depth D− Good scalability – proportional to the # processors

Increases with the width W initially, but tails off for large W− Poor scalability due to contention on the memory channel

Efficiency per processing engine…?


Lessons learned…

Memory contention can become a severe system bottleneck

− the memory interface saturates with about two processing elements per interface

− off-chip memory access cause significant reduction in throughput and drastic increase in queuing delay

− performance increase with more

• memory channels

• lower access times

Most NP applications are of sequential nature which leads to highly pipelined NP topologies

Balancing processing tasks to avoid slow pipeline stages

Communication and synchronization are the main contributors to the pipeline stage time, next to the memory access delay

“Topology” has significant impact on performance


Some References1. Tatsuya Yamada, Naoki Wakamiya, Masayuki Murata, and Hideo Miyahara: "Implementation

and Evaluation of Video-Quality Adjustment for heterogeneous Video Multicast“, 8th Asia-Pacific Conference on Communications, Bandung, September 2002, pp. 454-457

2. Daniel Bauer, Sean Rooney, Paolo Scotton, “Network Infrastructure for Massively Distributed Games”, NetGames, Braunschweig, Germany, April 2002

3. J.R. Allen, Jr., et al., “IBM PowerNP network processor: hardware, software, and applications”, IBM Journal of Research and Development, 47(2/3), pp. 177-193, March/May 2003

4. Ning Weng, Tilman Wolf, “Profiling and mapping of parallel workloads on network processors”, ACM Symposium of Applied Computing (SAC 2005), pp. 890-896

5. Ning Weng, Tilman Wolf, “Analytic modeling of network processors for parallel workload mapping”, ACM Trans. on Embedded Computing Systems, 8(3), 2009

6. Li Zhao, Yan Lou, Laxmi Bhuyan, Ravi Iyer, “SpliceNP: A TCP Splicer using a Network Processor”, ANCS2005, 2005

7. Håvard Espeland, Carl Henrik Lunde, Håkon Stensland, Carsten Griwodz, Pål Halvorsen, ”Transparent Protocol Translation for Streaming”, ACM Multimedia 2007


Summary

TODO

network processors - uio

Documents