radio network layer (rnl) data plane processing on

323854-001

Radio Network Layer (RNL) Data Plane Processing on Embedded Intel® Architecture Processors

June 2010

White Paper

Luis Henriques Software Engineer

Brian Forde Senior Platform Applications Engineer

Chris MacNamara Software Engineer

Intel Corporation


2

Abstract

This paper describes a software package that implements some of the

workload associated with an RNL (Radio Network Layer) Data-Plane

pipeline. It describes techniques used for optimizing the CPU workload,

and outlines the resulting benchmarks.


3

Contents 1 Introduction .................................................................................................. 5

2 Key Platform Features ................................................................................... 6

3 The RNL-d Application ................................................................................... 7 3.1 Software Architecture ............................................................................ 8 3.2 Radio Access Bearer (RAB) Flow Pipeline .................................................. 9 3.3 Downlink Processing ............................................................................ 11 3.4 Uplink Processing ................................................................................ 11 3.5 Segmentation and Reassembly ............................................................. 12 3.6 Configuration of System-Under-Test ...................................................... 13 3.7 Single-Pipeline Performance Summary ................................................... 14

4 Simultaneous Multi-Threading ..................................................................... 17 4.1 Single-Pipeline SMT Performance Summary ............................................ 18

5 Multi-Core Scaling ....................................................................................... 20 5.1 Multi-Core Performance Summary ......................................................... 21

5.1.1 Multi-Core Two Pipelines Performance ........................................ 21 5.1.2 Multi-Core, Four Pipelines Performance ...................................... 23

5.2 CPU Usage and Interrupt Handling ......................................................... 25

6 Summary ..................................................................................................... 27

7 Reference List .............................................................................................. 28

Figures

Figure 1. Platform Overview ......................................................................................... 6

Figure 2. Hardware Setup ............................................................................................ 7

Figure 3. Software Configuration ................................................................................... 8

Figure 4. RNL-d Software Stack .................................................................................... 9

Figure 5. Data Processing Path .................................................................................... 10

Figure 6. Segmentation of SDUs into PDUs ................................................................... 13

Figure 7. Aggregate Throughput of 81.92Mbps and 245.76Mbps ...................................... 15

Figure 8. Aggregate Throughput of 163.84Mbps and 491.52Mbps .................................... 16


Figure 10. RNL-d Pipeline Partitioning .......................................................................... 17


4


Figure 12. Aggregate Throughput of 163.84Mbps and 491.52Mbps ................................... 19

Figure 13. Aggregate Throughput of 327.68Mbps and 983.04Mbps ................................... 19

Figure 14. Multiple RNL-d Pipeline Instances ................................................................. 20

Figure 15. Aggregate Throughput of 2x81.92Mbps and 2x245.76Mbps .............................. 22

Figure 16. Aggregate Throughput of 2x163.84Mbps and 2x491.52Mbps ............................ 22


Figure 18. Aggregate Throughput of 4x81.92Mbps and 4x245.76Mbps .............................. 24



Figure 21. 128kbps – IRQs per Second ........................................................................ 26

Tables

Table 1. Benchmark Throughput per UE ....................................................................... 14

Table 2. Benchmark Packet Size .................................................................................. 15

Table 3. Throughput per Pipeline ................................................................................. 21


5

1 Introduction Traditionally, high-performance packet processing has been implemented using application-specific hardware, typically ASICs, FPGAs, and network processors. Recent solutions utilize general-purpose processors coupled with application-specific accelerators (for example, Security and Encryption engines). For a packet-processing solution to be fit for purpose, it must provide sufficient compute performance in conjunction with tightly coupled high-bandwidth memory I/O.

The latest multi-core Embedded Intel® Architecture processors provide both the required I/O bandwidth and compute resource required to support intensive packet-processing workloads. Embedded Intel Architecture processors are high-performance processing machines with several cores per processor, each supporting Simultaneous Multithreading Technology (SMT). The specific Embedded Intel Architecture processor used for this paper had four cores per processor with SMT. Thus, in a dual-processor (DP) configuration, 16 logical cores are available.

The processor’s Intel® QuickPath Interconnect (QPI) system bus provides both high-bandwidth intra-processor and processor-to-I/O Hub (IOH) chipset communication. Each processor also incorporates three DDR3 channels, providing high memory subsystem throughput.


6

2 Key Platform Features

Figure 1. Platform Overview

6 8LPC

SATA-II

PCI32

6 onboard

VIDEO

DDR2

SATA

Video ROM

SPI

Intel® I/O Controller Hub

Zoar

DDR3 (Ch B)

ITP

x4 ESI( 2.5 GT/s)

ICH10

Intel® 82578 Phy

Slot 3

SPI

ME

USB

USB

Slot 6

Slot 2 Slot 5

Slot 4

TPM header

GbE

Intel® Xeon® Processor

DRAMSDRAM

4

DDR3 (Ch A)

DDR3 (Ch C)

Slot 1

DRAMSIO

Serial3

Intel® Xeon® Processor

Ch A

DDR3 (Ch E)

DDR3 (Ch D)

DDR3 (Ch F)

Intel® QPI

PCIe Gen2 x8

PCIe Gen2 x8

PCIe Gen1 x4

ESI

PCIe Gen2 x16

PS2

PCIe Gen1 x1

PCIe Gen1 x4

Hot Plug

PCIe Gen1 x1

ESI

DRAM

PLD

Port 80

FLASH

Ch BCh C Ch A Ch B Ch C

SMBus

Video

ME SMLink

FLASHFLASH

FLASHFLASH

· Multi-core (four cores per processor)

· Simultaneous Multithreading (two threads per core)

· Memory I/O (3-channel DDR3)

· Intel® QuickPath Interconnect running at 4.80 GTs

· Intel® chipset supporting 40 lanes of PCIe* Gen2 at 5 GHz

· Performance per Watt

· Virtual interrupts

· Streaming instructions


7

3 The RNL-d Application A Radio Network Layer Dedicated Channel (RNL-d) Proof of Concept (PoC) was developed that demonstrates the advantages of executing Data-Plane workload on multi-core Intel Architecture processors. This PoC implements some of the workload associated with the RNL-d pipeline. The implementation is representative of the workload that might be required in a functionally complete RNL-d.

The PoC comprises three separate systems: the RNL-d System-Under-Test (SUT), an upstream generator (simulating the Core Network protocols, the CNs), and a downstream generator (simulating the Node B and User Equipment protocols, the UEs). Figure 2 shows the hardware configuration used in the system.

Figure 2. Hardware Setup

The software implements, as closely as practical, a representative implementation of the relevant portions of an RNL-d protocol stack. The majority of the data movement and processing required in a functional RNL-d are present. In a complete system, the RNL-d might be part of a larger RNC implementation, comprising line cards (e.g., GTP/IP over GbE), RNL-d and RNL-c/sh.


8

The SUT (the RNL-d) has two GbE (Gigabit Ethernet*) interfaces, one upstream, facing the CNs, and one downstream, facing the UEs1

3.1 Software Architecture

.

The software implementation decomposes into several modules, as follows (please see Figure 3).

Figure 3. Software Configuration

PacketGenerator

PacketGenerator

UE/Node BSimulator

System Under Test Core Network Simulator

MACd

RLC(AM)

IuFP

CRC

IuUP(transparent)

CRC

PDCP

IuUP (transparent)

CRC

IuFP

CRCMACd

RLC(AM)

PDCP

In the System-Under-Test (SUT), the center of the system (where the bulk of the processing occurs) is the Data-Plane portion of the Radio Network Layer Dedicated Channel (RNL-d), comprising a MAC-d scheduler and RLC entities, operating in Acknowledge Mode. A peer stack exists in the User Equipment / Node B simulator (UEs). These two protocol stacks communicate via Framing Protocol (IuFP) over the transport layer (in this case, Switch Fabric Protocol (SFP) over Ethernet). On the upstream side, the SUT communicates with the Core Network simulator (CNs) over the transport layer (again SFP over Ethernet). As on the downstream side, a peer entity exists in the CNs. Above the PDCP in the UEs, and above the IuUP in the CNs, packet generators stream packets of specified size at a specified rate through the RNL-d pipeline.

Multiple Radio Access Bearer (RAB) flows are supported by the system, identified by the ‘Entity Id’ field appended to the SFP header of incoming and outgoing packets.

The complete software is implemented as a number of Linux*-loadable modules. The modules run in Linux kernel space, as illustrated in Figure 4.

1 This configuration is valid for a single pipeline instance of the system; when running several instances of the pipeline (see Section 5, Multi-Core Scaling, for details), each pipeline will require two GbE interfaces.


9

There are several architectural characteristics that have allowed achieving the best performance with the system. The following lists some of these characteristics:

· One kernel thread per interface

· IRQ Affinity – MSI interrupt for each interface are tied to a specific core. Pinning IRQs and threads to a single core reduces accesses to remote caches, ensuring the data remains in the processor's local cache, thus reducing cache thrashing between processor sockets across the Intel QPI interface.

· New API (NAPI) mode soft interrupt fires; notifies the kernel that there is data, which then polls until empty

· 256K entry table per interface, up to 4 million flow entry hash lookup table size

Figure 4. RNL-d Software Stack

3.2 Radio Access Bearer (RAB) Flow Pipeline

Figure 5 shows the data processing path on both the upstream and the downstream interfaces. On both interfaces, packets that arrive at the Ethernet interfaces are placed in a receive queue (one queue for each interface) by the receive handler, for later processing. The receive thread is signaled by the receive handler, indicating the presence of packets in the receive queue.


10

Figure 5. Data Processing Path


11

At a later time, the receive thread is woken by the signal originating from the receive handler, indicating the presence of packets in the receive queue. The receive thread then removes any packets from the receive queue, and processes those packets.

The same design pattern is used on both the upstream and downstream interfaces, and is illustrated in Figure 5 by the elements “Rcv Function”, “RxQ”, “Trigger”, and “Rx Proc”.

3.3 Downlink Processing

In the downlink pipeline, packets received on the Ethernet interface are forwarded to the IuUP interface packet handling function. There the packet is sanity checked, the RAB FlowId extracted from the header, the header and trailer removed, and the packet payload forwarded to the upstream side of the RNL-d (the RLC) for further processing.

On receiving a packet from the upstream receive entity, the RNL-d routes the packet to the unique RLC entity associated with the RAB flow of the packet. The RLC entity places the received packet into its upstream receive queue (each RLC entity has a unique upstream receive queue), for later processing.

Periodically, any packets in the RLC entities upstream receive queue are removed from that queue, segmented (into appropriate size PDUs) and/or concatenated into the segmentation buffer, and the packet buffer freed (returned to the OS for reuse). Each RLC entity has a unique and dedicated segmentation buffer.

The MAC-d thread periodically (every 10mS radio frame) enumerates all active DCH entities. DCH entities have a Transmission Time Interval (TTI), which is a multiple number of radio frames. Only during radio frames one TTI apart is any further processing performed (for that DCH). Individual DCHs may have a different TTI.

During a transmission radio frame, the DCH function is invoked by the MAC-d thread. The DCH function allocates a transmission buffer from the OS, and checks the Buffer Occupancy (i.e., how many segments, or Transport Blocks (TB), are available for transmission) of the RLC entity associated with this DCH. A decision is made as to how many Transport Blocks (i.e., the Transport Format) are to be transmitted during this TTI, based on the Buffer Occupancy of the RLC entity. Transport Blocks (of that number) are copied into the already allocated transmission buffer and a Framing Protocol header is prepended, indicating the Transport Format Indicator (TFI), among other things, of the transmission during this TTI. Once the Framing Protocol packet has been constructed, an SFP header is prepended, and the whole packet is then sent (using the OS interface dev_queue_xmit) to the Ethernet NIC for transmission on the downstream NIC.

3.4 Uplink Processing

In the uplink pipeline, packets received on the Ethernet interface are forwarded to the IuFP interface packet handling function. There the packet is sanity checked, the RAB FlowId extracted from the header, the header and trailer removed, and the packet payload forwarded to the downstream side of the RNL-d (the MAC-d) for further processing.

On receiving a packet from the downstream receive entities, the RNL-d first parses the received packet, to determine what Transport Format is used (i.e., how many Transport Blocks, etc.). Any Transport Blocks within the received packet are passed to the DCH entity associated with the RAB flow of the packet. The DCH then passes the PDU (within the TB) to the RLC associated with that DCH. The RLC saves the PDU in the reassembly buffer of


12

that RLC entity (each entity has a unique reassembly buffer), for later processing and reassembly.

The downlink interface receives one frame containing all the PDUs received in a single TTI. The MAC-d entity will individually pass these PDUs to the RLC entity associated with this RAB flow. In the case of Acknowledge Mode functionality, further discrimination of received PDUs is made between Data and Control PDUs received.

When handling a received Data PDU, the PDU is placed into the reassembly buffer by the receive handler. The reassembly buffer is an array of PDUs that are received from the underlying MAC-d entity from the radio interface.

3.5 Segmentation and Reassembly

The Segmentation function takes the outbound SDU as an argument. It breaks the SDU into PDUs of length determined by the underlying MAC entity. Space is allowed at the beginning of each PDU for an RLC header of appropriate size (16 bits for AM, 8 bits for UM). For the last segment of the current SDU, which is always smaller than a full PDU, a Length Indicator (LI) is added before the SDU datum.


13

Figure 6. Segmentation of SDUs into PDUs

SDU data

SN

SN

LI E

P HE

SDU dataPDU 1

COPY

CRC

CRC

Empty

PDU N

SDU data

SN

SN

LI E

P HE

PAD

LI=7F E

Empty

COPY

TFI

CFN

COPY

COPY

COPY

SDU data

SN

SN P HE

Empty

SDU data

SDU data

SDU data

COPY

PDU 2COPYCache

line Aligned

SDU data

SDU data

SDU data

Cache line

Aligned

Cache line

Aligned

FT

D/C

D/C

D/C

SKB

SKB

Outbound SDU

Segmentation Buffer

IuFP Buffer

If the next SDU in the Queue has more data than can fit into a single PDU (that is, requires more than one PDU for transmission), then that SDU is concatenated with the current SDU, and no additional Length Indicator (LI) is required. If the next SDU in the Queue is smaller than a PDU (that is, requires only one PDU for transmission), or there is no next SDU, then concatenation is not performed. In that case an additional LI is required, indicating that any unused (unfilled) space in the PDU is padding.

3.6 Configuration of System-Under-Test

Hardware

· Intel® Xeon® Processor LC5528

- Uni-processor configuration

- AC and HW prefetchers enabled

- DCU streamer and DCU UP prefetchers enabled

- SMT enabled


14

- CPU C-states disabled

- ACPI disabled

- Intel SpeedStep® disabled

- Direct Cache Access disabled

- Turbo Mode disabled

- NUMA disabled

· 6 x 2G 1333

· 2 x network interface card (NIC), quad-port (Intel® 82576 Gigabit Ethernet Controller)

Software

· CentOS* v5.2 64 bit

· Vanilla Linux* kernel 2.6.28.9

· Ethernet driver (igb-2.1.1)

· RNL-d application

3.7 Single-Pipeline Performance Summary

This section summarizes the results from benchmarking the software on an Intel® Xeon® processor LC5528, running at a core frequency of 2.1GHz.

Only Downlink traffic at 128kbps and 384kbps has been considered for this benchmarking, for 640, 1280 and 2560 UEs. Table 1 summarizes the throughput obtained in the different benchmark scenarios.

Table 1. Benchmark Throughput per UE

Throughput per UE (kbps)

Number of UEs

Aggregated Throughput

(Mbps)

128

640 81.92

1280 163.84

2560 327.68

384

640 245.76

1280 491.52

2560 983.04

In each of the following graphs, the traffic per UE remains constant, while the Packet Rate (Packets Per Second, PPS) and Packet size varies, resulting in a constant traffic throughput per UE. The aggregate load is the multiple of the throughput per UE and the number of UEs.


15

Table 2. Benchmark Packet Size

PPS Interval (ms)

Packet Size (bytes)

128kbps 384kbps

500 2 31 95

333 3 47 143

250 4 63 191

200 5 79 239

166 6 95 287

125 8 127 383

100 10 159 479

83 12 191 575

62 16 255 767

50 20 319 959

40 25 399 1199

33 30 479 -

Table 2 shows the packet rates and the packet sizes that have been used to execute the benchmark. The PPS is computed so that the throughput remains constant (either 128kbps or 384kbps). Thus, the inter-packet interval and the packets size are modified accordingly.

Figure 7. Aggregate Throughput of 81.92Mbps and 245.76Mbps

As it is possible to observe in, Figure 7, the Core does not remain constant although the overall throughput of the system remains stable (128kbps or 384kbps per UE).

Although the CPU usage increases almost linearly with the PPS, there are some fluctuations especially visible at 384kbps rates. These fluctuations will become more visible and are explained later in this paper, when partitioning and multi-core support is explained.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

0 100 200 300 400 500 600

CPU

Usa

ge

PPS Per UE

640 UEs

128kbps

384kbps


16

As shown in Figure 9, when 2560 UEs are used to load the system, the CPU usage is very high and there is no headroom to collect more data points ¾ for the 384kbps scenario, only three data points have been used. When the PPS is increased, the system starts to drop packets due to high CPU usage.



0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0 50 100 150 200 250 300 350

CPU

Usa

ge

PPS Per UE

1280 UEs

128kbps

384kbps

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0 20 40 60 80 100 120

CPU

Usa

ge

PPS Per UE

2560 UEs

128kbps

384kbps


17

4 Simultaneous Multi-Threading Simultaneous Multi-Threading (SMT) consists of two or more logical processors sharing the processor’s execution engine and the bus interface, each one having its own Advanced Programmable Interrupt Controller (APIC). Using SMT allows multiple independent threads execution within a single physical core with a more efficient utilization of the available processor resources.

In order to balance the load between the different active entities of an RNL-d pipeline, these entities are distributed across the cores within a single processor, taking advantage of SMT threads. This allows the threads to execute in parallel, reducing concurrent access to CPU resources and thereby reducing the number of involuntary context switches. In addition, the threads in the same pipeline share the upper-level caches, as these are shared among SMT threads on the same physical core.

Figure 10 describes the system partitioning for a system executing one RNL-d pipeline in a processor with two SMT logical cores.

Figure 10. RNL-d Pipeline Partitioning

As shown in the figure, the first logical core (SMT #0) within the physical core will execute the FP and IUP threads. This core will also be responsible for handling the Tx/Rx interrupts for the two Ethernet interfaces used by the pipeline for the Uplink and Downlink. The second logical core (SMT #1) on the physical core is dedicated to executing the MAC thread, which is the most compute-intensive thread.

By partitioning the RNL-d pipeline in this way, data locality is maintained as the data used within a pipeline is kept within a single physical core. Since caches are shared, the number of cache misses is kept to a minimum. On the other end, true parallelism within the pipeline is also achieved, allowing packets to be processed by the MAC scheduler while new packets arriving will not interrupt this processing.


18

4.1 Single-Pipeline SMT Performance Summary

This section summarizes the results of benchmarking the RNL-d software on the same hardware platform as presented in Section 3.6. The only difference between the two setups is that SMT has been turned on in BIOS so that the software could take advantage of the two logical processors per core.

Since the pipeline now executes on two logical processors within the same physical core, the figures presented next with the CPU percentage usage will distinguish the CPU usage of the two components:

· The CPU usage for the logical processor that handles the FP, IUP and Tx/Rx interrupts for the two Ethernet interfaces used by the pipelines.

· The CPU usage for the logical processor that executes the MAC scheduler.

As described in Section 3.7, 128kbps and 384kbps Downlink throughput per UE has been used to benchmark the system with 640, 1280 and 2560 UEs.


As Figure 11 (and following) shows, the CPU usage is higher on the processor that is handling the interrupts from the Ethernet cards. Also, the fluctuations observed in Section 3.7 are still visible. These fluctuations are discussed in later sections of this paper.

Another observation is that the CPU usage is higher for the 384kbps throughput scenario, although Figure 12 (with 1280 UEs) shows that the difference between processor handling the IRQs for 128kbps and for 384kbps tends to be smaller.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

0 100 200 300 400 500 600

CPU

Usa

ge

PPS Per UE

640 UEs

128kbps IRQs CPU

384kbps IRQs CPU

384kbps MAC CPU

128kbps MAC CPU


19



0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

0 50 100 150 200 250 300 350

CPU

Usa

ge

PPS Per UE

1280 UEs

128kpbs IRQs CPU

384kbps IRQs CPU

128kbps MAC CPU

384kbps MAC CPU

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0 20 40 60 80 100 120 140 160 180

CPU

Usa

ge

PPS Per UE

2560 UEs

128kbps IRQs CPU

384kbps IRQs CPU

128kbps MAC CPU

384kbps MAC CPU


20

5 Multi-Core Scaling The RNL-d pipeline implementation (comprising the downlink and the uplink flows) described in previous sections can be replicated and executed on several CPU cores so that several pipeline instances can be executed in parallel on the same multi-core Intel Architecture processor.

The main goal of scaling the RNL-d pipeline across multiple-core was to prove the scalability of the pipeline implementation, thereby increasing overall throughput. This, in turn, allows higher numbers of UEs to be serviced by a single system.

Figure 14 describes one possible configuration of multiple RNL-d pipeline instances that can be used. Each pipeline has to be connected to the UE/Node-B simulator and to the Core Network simulator. Thus, two interfaces are used by each pipeline instance. Note that, for maximum performance, the Tx/Rx IRQs of these two interfaces have to be pinned to the SMT core running the FP and the IUP threads for the corresponding pipeline.

Figure 14. Multiple RNL-d Pipeline Instances

Port #0 Port #1 Port #2 Port #3

Port #0 Port #1 Port #2 Port #3

MAC

IU

FPSMT #1

SMT #0

NIC ISR

Core #0

MAC

IU

FPSMT #1

SMT #0

NIC ISR

Core #1

MAC

IU

FPSMT #1

SMT #0

NIC ISR

Core #2

MAC

IU

FPSMT #1

SMT #0

NIC ISR

Core #3

The figure also shows the active instances that are present on each pipeline: each pipeline will have its own IUP, FP and MAC-d threads.

Since RNL-d pipeline instances do not share data, each pipeline is executed independently ¾ no inter-core locking mechanisms are required.

An important aspect about the pipeline is that the actual implementation is not NUMA-aware. For this reason, and to achieve high performance, the hardware used to execute the pipeline has to be set to a UP configuration. If the system has a DP configuration, than the memory accesses could be very expensive if a pipeline running on a CPU socket would have to access non-local memory. This kind of access would need to be performed through the Intel QPI bus, which could introduce high overheads and decrease performance.


21

Another important consideration about the hardware setup is related to the PCIe slots used for the Ethernet interfaces. The best performance is achieved when the NICs are connected on a PCIe slot that is directly accessed by the CPU socket. Using a PCIe slot that is indirectly connected to the processor (for example, via the PCH complex) may significantly reduce performance.

5.1 Multi-Core Performance Summary

This section presents the benchmark results for the RNL-d software when executing two pipelines and four pipelines simultaneously. For the two-pipeline configuration, two different physical cores are used (a total of four logical cores); for the four-pipeline configuration, the four available physical cores are used (a total of eight logical CPUs).

Note that the data presented in Section 5.1.1 represents only one of the pipelines that was executing. However, the data for the other pipelines running is very similar and hence is not shown here.

Table 3 summarizes the Downlink throughput that is obtained for each of the configurations.

Table 3. Throughput per Pipeline

Throughput per UE (kbps)

Number of UEs

Aggregated Throughput in 1 Pipeline (Mbps)

Aggregated Throughput in

2 Pipelines (Mbps)

Aggregated Throughput in 4

Pipelines (Mbps)

128

640 81.92 163.84 327.68

1280 163.84 327.68 655.36

2560 327.68 655.36 1310.72

384

640 245.76 491.52 983.04

1280 491.52 983.04 1966.08

2560 983.04 1966.08 3932.16

5.1.1 Multi-Core Two Pipelines Performance

From the following three figures, one of the first interesting observations is that, at the higher packet rates, the highest loaded CPU is not the CPU handling the Ethernet interrupts for 384kbps throughput but is the one handling the interrupts for the 128kbps. The difference between the CPU usage for the two CPUs handling interrupts tends to get smaller until they actually intercept.

The MAC scheduler CPU usage for 128kbps and 384kbps never intercept, and seems to remain separated by a constant value.


22

Figure 15. Aggregate Throughput of 2x81.92Mbps and 2x245.76Mbps


0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

0 100 200 300 400 500 600

CPU

Usa

ge

PPS Per UE

640 UEs

128kbps IRQs CPU

384kbps IRQs CPU

128kbps MAC CPU

384kbps MAC CPU

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0 50 100 150 200 250 300 350

CPU

Usa

ge

PPS Per UE

1280 UEs

128kbps IRQs CPU

384kpbs IRQs CPU

128kbps MAC CPU

384kbps MAC CPU


23


5.1.2 Multi-Core, Four Pipelines Performance

The tendency of having the highest CPU load on the 128kbps CPU handling the Ethernet interrupts is even more visible in this four-pipeline scenario: the CPU usage for this throughput is almost always higher than the same CPU for the 384kbps.

Another interesting observation is that, for 1280 UEs (Figure 19), there is a moment where the CPU usage actually is lower when the PPS is increased (highlighted in figure). This phenomenon is detailed in Section 5.2.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0 20 40 60 80 100 120 140

CPU

Usa

ge

PPS Per UE

2560 UEs

128kbps IRQs CPU

384kbps IRQs CPU

128kbps MAC CPU

384kbps MAC CPU


24


0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%

0 100 200 300 400 500 600

CPU

Usa

ge

PPS Per UE

640 UEs

128kbps IRQs CPU

384kbps IRQs CPU

128kbps MAC CPU

384kbps MAC CPU


25



5.2 CPU Usage and Interrupt Handling

As described by the data presented in previous sections, the CPU usage is not completely linear when comparing it with the packet rate. For example, Figure 18 shows high fluctuations on the CPUs that are servicing the Ethernet interrupts associated with the pipeline.

This fluctuation has been initially identified in the data provided in Section 3.7, in a non-partitioned version of the RNL-d pipeline. When partitioning was introduced to the pipeline, separating the MAC scheduler from the Rx/Tx entities (such as interrupt handlers), it became clear that these variations on the CPU usage were visible mainly on the CPU handling the NICs IRQs.

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

0 50 100 150 200 250

CPU

Usa

ge

PPS Per UE

1280 UEs

128kbps IRQs CPU

384kbps IRQs CPU

128kbps MAC CPU

384kbps MAC CPU

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0 20 40 60 80

CPU

Usa

ge

PPS Per UE

2560 UEs

128kbps IRQs CPU

384kbps IRQs CPU

128kbps MAC CPU

384kbps MAC CPU


26

This observation triggered the collection of additional data that could help to correlate the CPU usage, the packet rate, and the number of Ethernet IRQs occurring on the system. For this additional data collection, the scenario described in Figure 19 has been selected: four pipelines loaded with 1280 UEs at 128kbps.

Figure 21 shows only the CPU usage for the CPU handling the IRQs in one of the four pipelines. Also, it presents the average IRQs per second on each of the Ethernet interfaces used by the pipeline: the interface that connects to the Core Network (UL IRQs) and the interface that connects to the Node-B/UEs (DL IRQs). Finally, this figure also shows the total number of IRQs as a sum of the DL and the UL IRQs.

Figure 21. 128kbps – IRQs per Second

An interesting observation (highlighted in figure) is that when the CPU usage goes down from around 57% to 50% at 100 PPS/UE, there is a spike on the number of IRQs per second ¾ from around 5000 interrupts per second to 7000. This exact pattern can be observed in all the four pipelines that were executing simultaneously.

There are two main reasons that justify the variation in the number of IRQs per second. The first is that the number of packets handled by the Ethernet cards is not constant ¾ the amount of packets injected in the Downlink is increasing. The second reason is that the Ethernet cards device driver (igb) is making use of interrupt coalescing techniques that impact on the number of IRQs being generated.

The igb device driver behavior can be modified by adjusting the interrupt throttling rate. However, its default behavior (which was used to run all the tests presented in this paper) is to use an adaptive mode that dynamically adjusts the interrupt throttling rate based on the traffic being received.

Thus, the number of interrupts is a factor that influences the overall CPU usage on the system and can justify the fluctuations that have been observed on the RNL-d pipeline implemented.

0.00

2,000.00

4,000.00

6,000.00

8,000.00

10,000.00

12,000.00

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%

0 50 100 150 200 250IR

Qs

Per

Seco

nd

CPU

Usa

ge

PPS Per UE

1280 UEs

IRQs CPU

UL IRQs

DL IRQs

UL+DL IRQs


27

6 Summary This paper showcases SMT-enabled Embedded Intel® Architecture processors as an excellent vehicle for handling high-performance data-plane packet processing.

The Intel® Embedded Design Center provides qualified developers with web-based access to technical resources. Access Intel Confidential design materials, step-by-step guidance, application reference solutions, training, Intel’s tool loaner program, and connect with an e-help desk and the embedded community. Design Fast. Design Smart. Get started today. http://intel.com/embedded/edc.

http://intel.com/embedded/edc�


28

7 Reference List UMTS Access Stratum Services and Functions, 3GPP TS 23.110, version 8.0.0 Release 8 Radio interface protocol architecture, 3GPP TS 25.301, version 8.4.0 Release 8 Services provided by the physical layer, 3GPP TS 25.302, version 8.2.0 Release 8 High Speed Downlink Packet Access (HSDPA); Overall description; Stage 2, 3GPP TS 25.308, version 8.5.0 Release 8 Enhanced uplink; Overall description; Stage 2, 3GPP TS 25.319, version 8.5.0 Release 8 Medium Access Control (MAC) protocol specification, 3GPP TS 25.321, version 8.4.0 Release 8 Radio Link Control (RLC) protocol specification, 3GPP TS 25.322, version 8.4.0 Release 8 Packet Data Convergence Protocol (PDCP) specification, 3GPP TS 25.323, version 8.4.0 Release 8 UTRAN overall description, 3GPP TS 25.401, version 8.2.0 Release 8 UTRAN Iu interface: General aspects and principles, 3GPP TS 25.410, version 8.1.0 Release 8 UTRAN Iu interface Radio Access Network Application Part (RANAP) signaling, 3GPP TS 25.413, version 8.2.1 Release 8 UTRAN Iu interface user plane protocols, 3GPP TS 25.415, version 8.0.0 Release 8 UTRAN Iur interface general aspects and principles, 3GPP TS 25.420, version 8.1.0 Release 8 UTRAN Iur interface user plane protocols for Common Transport Channel data streams, 3GPP TS 25.425, version 8.2.0 Release 8 UTRAN Iur and Iub interface data transport & transport signalling for DCH data streams, 3GPP TS 25.426, version 8.0.0 Release 8 UTRAN Iur/Iub interface user plane protocol for DCH data streams, 3GPP TS 25.427, version 8.1.0 Release 8 UTRAN Iub Interface: general aspects and principles, 3GPP TS 25.430, version 8.0.0 Release 8 UTRAN Iub interface user plane protocols for Common Transport Channel data streams, 3GPP TS 25.435, version 8.3.0 Release 8


29

Common test environments for User Equipment (UE); Conformance testing, 3GPP TS 34.108, version 8.6.0 Release 8 IP Header Compression, RFC 2507, February 1999


30

Authors

Brian Forde, Luis Henriques and Chris MacNamara are software engineers in the Intel Architecture Group.

Acronyms

APIC Advanced Programmable Interrupt Controller CN Core Network DCH Dedicated Channel DDR3 Double Data Rate Three DL Downlink DP Dual Processor FP Framing Protocol GbE Gigabit Ethernet GTP Generic Framing Protocol IOH Input/Output Hub IUP Iu User Plane Protocol IRQ Interrupt Request LI Length Indicator MAC Media Access Controller MSI Message Signaled Interrupts NAPI New Application Programming Interface NIC Network Interface Card PDU Protocol Data Unit PPS Packets Per Second PoC Proof of Concept QPI Intel® QuickPath Interconnect RAB Radio Access Bearer RLC Radio Link Control RNC Radio Network Controller RNL Radio Network Layer SDU Service Data Unit SFP Switch Fabric Protocol SMT Simultaneous Multithreading Technology SUT System-Under-Test TB Transport Block TFI Transport Format Indicator TTI Transmission Time Interval UE User Equipment UL Uplink


31

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice.

This paper is for informational purposes only. THIS DOCUMENT IS PROVIDED "AS IS" WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including liability for infringement of any proprietary rights, relating to use of information in this specification. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted herein.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Core Inside, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2010 Intel Corporation. All rights reserved.

§

http://www.intel.com/performance/resources/benchmark_limitations.htm�

radio network layer (rnl) data plane processing on

Documents