comp 4211 project reportcs4211/projects/reports/jiening.doc · web viewsimultaneous multithreaded:...

COMP 4211 Project Report

COMP4211 Project Report

Network Processor Technical Report – Architecture, Performance and Future Trends

COMP4211 Project Report

Network Processor Technical Report

-- Architecture, Performance and Future Trends

Author: Jiening Jiang (2279326)

S1, 2005 CSE, UNSW

Table Of Contents

31Introduction

32Architectures

32.1The Challenge of Packets Processing

32.2Characters of Packets Processing

42.3The Architecture Techniques

42.4Generic Packet Processing Architecture

62.5Evaluating the Architectures

72.6Pipelining Process Engines

82.7Memory architecture

92.8Memory bandwidth

102.9On-chip Communication

103Case studies

103.1IXA2800

133.2PowerNP NP4GX

144Conclusion and future trends

155Reference:

1 Introduction

With increasing network users and many new bandwidth-consume applications emergence, the Internet line speed and bandwidth have increased and are increasing tremendously. Now edge routers normally connect to10Gbps, 40Gbps is coming. Today’s routers must handle not only packets forwarding but also more complex tasks, such as the complex queuing and quality-of-service (QoS), encryption / decryption and so on. It requires enormous processing power.

When the Internet invented, routers were built base on general-purpose processor. It was pretty much like a normal computer at those days. With increased users and speed, the ASIC used in routers. However the Internet protocols and applications are changing and evolving quite often, and the developing new ASIC is time-cost and expensive. So, new specific network processor emerged. It provides robust, flexible, programmable solutions to the Internet routing, switching and higher-level applications. It targets the design trade-off between performance and flexibility, and gives a very good solution based on the state-of-art architectures.

2 Architectures

Before we move to the network processor architectures, lets study the challenge and characters of packets processing.

2.1 The Challenge of Packets Processing

The enormous challenge is line speed. Table 1 shows the packets arrive rates in different speeds. It assumed that packet size is 40 bytes, which is common roughly packet size used in multimedia data stream.

Line Speed

40-byte Packet Arrival Rate

2.5 Gbps

OC-48

160 ns

10 Gbps

OC-192

35 ns

40 Gbps

OC-768

8 ns

Table 1: inter-packet arrival rates [7]

The inter packet arrival time is the rough time can spend on processing if routers don’t want to drop packets intentionally. In traditional architectures, the normal memory access time is larger than the inter packet arrival time.

2.2 Characters of Packets Processing

A data stream is divided into amount of small size of packets. Each packet contains some duplicate information, such as IP header. All packets can be processed in parallel. This is so call Packet-level Parallelism (PLP)

Some packets are time-critical and simple, such as multimedia data stream. While some are complex and not time-critical, such as routing table update message, network management messages. Different types of packets need different strategies to process.

2.3 The Architecture Techniques

The challenges are so enormous that new architectures must be built to solve the problem. A variety of architecture techniques have been used to address the challenges. They are divided in 3 categories: [5]

· Application-specific logic

i. Extending the RISC instruction set

ii. Use of customized on-chip or off-chip hardware assists

· Advanced processor architectures

i. Multithreading

ii. Instruction-level parallelism

· Macroparallelism

i. Multiple processors

ii. Pipelined processors

Commercial NP products use almost all these techniques. Most NPs have so-call multi-core architecture. Each core is a small-scale microprocessor and uses multithreading and ILP techniques. Some functions are implemented in specific logic units. Details see case studies.

2.4 Generic Packet Processing Architecture

Figure 1 shows the generic packet processing architecture. This architecture meets the characters of packets. Almost all commercial NPs are based on this architecture.

Figure 1: Generic Packet Processing Architecture

PHY layer processing converts the analogue signal to digital signal with some type of frame format. Packet Processing performs all the necessary operations on the network traffic at line speed. These operations are also known as “fast path” or “data path” operations. Host Processing handles a number of functions such as device configurations, network managements and so on, which are slow and not time-crucial.

In some papers, they are referred as “data plane” and “control plane”. Finally, switching handles the forwarding of data traffic between the ingress and egress ports of the bus, backplane, or other switch fabric of the router.

The data plane processes the time-critical packets, while control plane handles the less time-critical managements and system configurations. The operations processed by control plane are more complex and diversity than data plane operations. So, the general-purpose processor can be used to execute control plane operations. The data plane operations are executed by a number of dedicated processing engines.

2.5 Evaluating the Architectures

The architecture for the control plane is preferred to the general processor architecture. For the data plane architecture, there are so many different architectures. Why most NPs chose the multi-core structure. Crowley etc. evaluated the performance of network packets processing based on four different architectures (Superscalar, Fine-Grain Multithreaded, Chip-Multiprocessor and Simultaneous Multithreaded). [1]

SuperScalar: multiple issues, out-of-order execution processor

Fine-Grain Multithreaded (FGMT): multithreading support extends the core out-of-order, superscalar microprocessor by adding support for multiple hardware thread contexts.

Simultaneous Multithreaded: The SMT architecture extends the FGMT architecture by adding support for instructions to be fetched and issued from multiple threads within one circle.

Chip-Multiprocessor: A CMP partitions chip resources in rigidly the form of multiple processors. Each processor can operate a different thread.

The figure 2 is the result of evaluating the four architectures [1]. The four architectures run in different benchmarks from basic ip4 forwarding to complex encryptions MD5, 3DES. All run at clock rate 500MHz and ignore the operating system overheads.

Figure 2: Performance results of all architectures [1]

In the case of ignoring the operating system overheads, both SMT and CMP achieved highest performance than the other two. The results of the study suggest that both have roughly equivalent performance that is two to four times greater than SS and FGMT. The reason is that both SMT and CMP are suit to exploit the parallel nature of network workloads.

Many NP products choose the architecture close to CMP to achieve the high performance of PLP.

2.6 Pipelining Process Engines

The processing time for each individual packet in high-speed link is only a few nanoseconds. Individual PE is not capable of processing packets at this short time. Therefore, pipelining PEs is the solution to this high performance requirement.

The processing mode of the PEs can be programmed as context pipeline or functional pipeline [5].

In a context pipeline, the pipeline stages are mapped to different PEs. Each PE constitutes a context pipe stage, and cascading two or more context pipe stages constitutes a context pipeline, as figure 3 show.

Pipeline stage 0 Pipeline stage 1

Pipeline stage m

Function 0 Function 1

Function m

Packet 1

…

Packet 2

…

…

Packet n

…

Time

Figure 3: Context pipeline of process engines [5]

· Advantage of context pipeline:

· The entire PE memory space can be dedicated to a single function. When a stage function needs large program memory, this model is good.

· The context pipeline is also desirable when a pipe stage needs to maintain state (bit vectors or tables) to perform its work. The local memory can store the state; therefore eliminate the latency of accessing external memory.

· Disadvantage of context pipeline:

· If the context is very large, it will take longer time to pass each pipeline stage. That will affect the overall pipeline throughputs.

· As the each pipeline stage must execute at the maximum packet arrival rate, it would be difficult to partition the application into stages.

In functional pipeline, the context remains with a PE while different functions are performed on the packet as time progresses. The PE execution time is divided into n pipe stages, and each pipe stage performs a different function. A signal PE can constitute a functional pipeline. Figure 4 is the model of functional pipeline.

Packet n+1

Packet n+2

…

Packet 2n

Time

Figure 4: Functional pipeline of a processor engine [5]

· Advantage of functional pipeline:

· The context remains locally within the PE

· It supports a longer execution period

· Disadvantage of functional pipeline

· The entire PE program memory space must support multiple functions

· Function control must be passed between stages, therefore it should be minimized

· Mutual exclusion may be more difficult because the multiple PEs access the same data structures.

The both models have advantages and disadvantages. Some NPs can be programmed in either model.

2.7 Memory architecture

The NPs need massive memory operations to process the packets. The operations such as pattern matching, queue / dequeue, encryption / decryption and etc require a lot of memory reads and writes. Thus the good memory architecture is a key factor of system performance.

The main parts of the memory system are large amount of local registers, local memory and cache, high speed SRAM and high bandwidth DRAM.

There are several ways to minimize the access speed and maximum the performance.

· Memory latency hiding

Modern computer architectures use multithreading to hide the memory access latency. Multithreading can hide the memory access latency by allowing the processor to switch to another thread while waiting for the slow memory access.

· Memory co-processors

Certain complex memory intensive tasks such as table-lookup and tree searching require a significant number of processor circles. Memory co-processor receives the request from the main processor and carries out necessary operations, then return back the result. Meanwhile the main processor can do other operations.

Some memory co-processors provide Content Addressable Memory (CAM) to accelerate the memory search operations.

· Caching

Using cache can significant improve the packets throughput of the systems, as caching speedup the routing table lookup. One mechanism of caching for address is called Host Address Cache (HAC), which is identical to a normal cache. The architecture is as figure 5 [2].

Output

Figure 5: Host address cache

Using the least significant k bit of the destination IP address as index to select one of 2k cache sets.

2.8 Memory bandwidth

As the bandwidth of links increased dramatically, the memory bandwidth requirement of NPs is also a key factor of system performance. The system throughput is critical.

Multithreading latency hiding only optimises the latency rather than the throughput.

There are three different models:

· Replicate the memory state for each processor; or share the state: It would be very expensive when the problem sizes are very large.

· Pipeline processors with distributed memories: It is very hard to statically partition different data structures, eg different lookup databases.

· Pipelined wide word memory

2.9 On-chip Communication

Traditional central CPU memory architectures are not capable for the high speed packet processing. Memory access scheme use the distributed memory or other high performance architectures, as mentioned above.

The normal bus based on-chip communication does not meet the demands of high-speed link such as OC-768. On-chip crossbar is an alternative. But it is expensive and low scalability. Most new generation of NPs use high-speed buses and other mechanisms to meet the requirements. IXA28XX uses Hyper Task Chaining, which will be talked in Case Studies. The Motorola C-5 and Agere PayloadPlus use high bandwidth buses.

3 Case studies

There are dozens NP products in the market. Among them are Intel IXA, IBM PowerNP, Agere PayloadPlus, Cisco Toaster2, Motorola C-5 and so on. Here I only study IXA and PowerNP

3.1 IXA2800

· Features

· Second-generation network processor

· Programmable parallel processing architecture

· Solving complex problem at line speed

· Xscale core with sixteen 32-bit independent multithreaded Microengins, provide more than 25 giga-operations per second

· Hyper tasks chaining technique

· Hyper tasks chaining [8]

Hyper Task Chaining implements several significant innovations to ensure low latency communication among processes. These mechanisms include “Next Neighbor” registers that enable individual Microengines to rapidly pass data and state information to adjacent Microengines. Reflector Mode pathways ensure that data and global event signals can be shared with multiple Microengines, using 32-bit unidirectional buses that connect the network processor’s internal processing and memory resources. A third enhancement, Ring Buffer registers, provides a highly efficient mechanism for flexibly linking tasks among multiple software pipelines. Ring buffers allow developers to establish “producer-consumer” relationships among Microengines, efficiently propagating results along the pipeline in FIFO order. To minimize latency associated with external memory references, register structures are complemented by 16 entries of Content Addressable Memory (CAM) associated with each Microengine. Configured as a distributed cache, the CAM enables multiple threads and Microengines to manipulate the same data simultaneously, while maintaining data coherency.

· Architecture overview

Figure 6: IXA2800 network processor functional block diagram [9]

The major parts of the IXA2800 are XScale Core and 16 MEs.

The Intel XScale ® core is a 32-bit general-purpose RISC processor. It incorporates an extensive list of architectural features that enable it to achieve high performance. It is compatible to the ARM* Version 5 (V5) Architecture. It implements the integer instruction set of ARM V5, but does not provide hardware support for the floating-point instructions.

XScale core is logically in the control plane of the NP. It handles slow, complex tasks and device configurations.

The Microengines do most of the programmable per-packet processing in the network processor. There are 16 Microengines, connected as shown in Figure 6. The Microengines can access all of the shared resources (SRAM, DRAM, MSF, etc.) and the private connections between adjacent Microengines.

The Microengines provide support for software-controlled multi-threaded operation. Given the disparity in processor cycle times compared to external memory times, a single thread of execution often blocks, waiting for external memory operations to complete. Multiple threads enable interleave operations. There is usually at least one thread ready to run while others are waiting.

The Microengine detail is shown in figure 7.

Microengines are logically in data plane of the NP. They perform time-critical operations of packets processing.

Figure 7: Microengine Block Diagram

The Control Store is a RAM that holds the program that is executed by the Microengine. It holds 8192 instructions, each of which is 40 bits wide. It is initialized by the Intel XScale ® core.

There are eight hardware Contexts available in the Microengine. To allow for efficient context swapping, each Context has its own register set, Program Counter, and Context specific Local registers. Having a copy per Context eliminates the need to move Context specific information to/from shared memory and Microengine registers for each Context swap. Fast context swapping allows a Context to do computation while other Contexts wait for I/O (typically external memory accesses) to complete or for a signal from another Context or hardware unit. (A context swap is similar to a taken branch in timing.)

As shown in the block diagram in Figure 7, each Microengine contains four types of 32-bit datapath registers:

256 General Purpose registers

512 Transfer registers

128 Next Neighbor registers

640 32-bit words of Local Memory

Local Memory is addressable storage within the Microengine. Local Memory is read and written exclusively under program control. Local Memory supplies operands to the execution datapath as a source, and receives results as a destination.

3.2 PowerNP NP4GX

The IBM NP4GX is a network processor used in the line speed OC-48 / OC-192. Figure 8 is the architecture block diagram.

Figure 8: NP4GX architecture block diagram [10]

The major parts are:

· Protocol processor

NP4GX has 16 multithreaded protocol processors. These protocol processors arranged as 8 dynamic protocol processor units (DPPU).

· Coprocessor and hardware assists

Coprocessor and specific hardware support packet queuing, header manipulation. Scheduling support hardware is designed to provide robust QoS functions. Four embedded search engines perform multiple lookups into very large tables and have access to more than 700MB of table memory with greater than 87Gbps bandwidth. Advanced CRC computation hardware performs multiple types of calculations

· Control processor

NP4GX incorporates a PowerPC 440 as the control processor, support control plane functions.

4 Conclusion and future trends

The sophisticate architectures give the NPs enormous processing power to deal with the very high line speed and broad bandwidth requirements. The most widely used architecture is parallel processing architecture that exploits the packet-level parallelism as well as instruction-level parallelism and thread-level parallelism. NPs also utilize the coprocessors and high-speed on-chip communication to get high processing speed.

However using more coprocessors would lead less flexibility that NPs initially pursued. It is an option of using reconfigurable circuits to solve this problem. But there are no reconfigurable circuits in recent commercial NPs.

The on-chip communication on NPs is capable of dealing with current line speed requirement. With the line speed increasing, new mechanism must be invented.

Current NPs architecture models are all based on “store-processing-forwarding” strategy. Is it feasible to abolish store state? Can we build an architecture based on “arriving-processing-forwarding”? In this model, NPs begin processing a packet once it arrives the ingress port instead waiting the whole packet arrival and buffing the whole packet.

5 Reference:

[1] Patrick Crowley, et al “Characterizing Processor Architectures for Programmable Network Interfaces,” the Proceedings of the 2000 International Conference on Supercomputing, Santa Fe, N.M., May 2000

[2] Mohammad Shorfuzzaman, et al: “Architectures for Network Processors: Key Features, Evaluation, and Trends,” The 2004 International MultiConference in Computer Science & Computer Engineering Las Vegas, Nevada, USA June 21-24, 2004

[3] J.R Allen Jr. et al “IBM PowerNP network processor: Hardware, software, and applications,” IBM Journal Res. & Dev. Vol.47 No. 2/3 March/May 2003

[4] Lin Chuang, et al “Analysis and Research on Network Processot,” Journal of Software, Feb. 2003 14(2): 253-267

[5] Patrick Crowley, et al “Network Processor Design, Issues and Practices,” Vol1, Morgan Kaufmann Publishers, 2003

[6] http://biz.yahoo.com/prnews/050121/nyf067_1.html

[7] Intel “Next Generation Network Processor Technologies, Enabling Cost Effective Solutions for 2.5 Gbps to 40 Gbps Network Services” Oct. 2001

[8] Intel “IXA2800 Network Processor Datasheet”

[9] Intel “IXA2800 Network Processor Hardware Reference Manual” Aug. 2004

[10] IBM “NP4GX Datasheet”

Host Processing

(Slow path and/or control functions)

PHY Layer

Packet Processing

(fast or data path function)

Switching

PE 0.1

PE 0.2

…

PE 0.n

PE 1.1

PE 1.2

…

PE 1.n

PE m.1

PE m.2

…

PE m.n

Thread 0 Pipe Pipe Pipe …Pipe

Thread 1 stage 0 stage 1 stage 2 stage m

… Function Function FunctionFunction

Thread n 0 1 1 (cont.)p

Destination IP Address

Right Shifter

Programmable Hash Engine

TagIndex

Tag Mask

Data Memory

=

Author: Jiening Jiang

Page: 6

_1103281554.psd

_1103284684.psd

_1103281200.psd

comp 4211 project reportcs4211/projects/reports/jiening.doc · web viewsimultaneous multithreaded:...

Documents