ku leuven leuven department of computer science celestijnenlaan 200a { b-3001 heverlee (belgium)...
TRANSCRIPT
Evaluation of a Dual-Core SMP and
AMP Architecture based on an
Embedded Case Study
Nico De WitteRobbie Vincke
Sille Van LandschootEric SteegmansJeroen Boydens
Report CW642, July 2013
KU LeuvenDepartment of Computer Science
Celestijnenlaan 200A – B-3001 Heverlee (Belgium)
Evaluation of a Dual-Core SMP and
AMP Architecture based on an
Embedded Case Study
Nico De WitteRobbie Vincke
Sille Van LandschootEric SteegmansJeroen Boydens
Report CW642, July 2013
Department of Computer Science, KU Leuven
Abstract
Typical telecom applications apply a planar architecture pat-tern based on the processing requirements of each subsystem. Ina symmetric multiprocessing environment all applications share thesame hardware resources. However, currently embedded hardwareplatforms are being designed with asymmetric multiprocessor archi-tectures to improve separation and increase performance of non-interfering tasks. These asymmetric multiprocessor architecturesallow different planes to be separated and assign dedicated hard-ware for each responsibility. While planes are logically separated,some hardware is still shared and creates cross-plane influence ef-fects which will impact the performance of the system. The aim ofthis report is to evaluate, in an embedded environment, the perfor-mance of a typical symmetric multiprocessing architecture comparedto its asymmetric multiprocessing variant, applied on a telecom ap-plication.
Keywords : Multi-core, Telecommunications, Planar Pattern, Symmetric Mul-tiprocessing, Asymmetric Multiprocessing.CR Subject Classification : D.4.1
Evaluation of a Dual-Core SMP and AMP
Architecture based on an Embedded Case Study
De Witte, Nico∗1, Vincke, Robbie 1, Van Landschoot, Sille1,
Boydens, Jeroen†1,2
July 2, 2013
{Nico.Dewitte, Robbie.Vincke, Sille.Vanlandschoot}@khbo.be,[email protected]
1�KHBO� Dept of Industrial Engineering Science & TechnologyZeedijk 101, B-8400 OOSTENDE
2�KU Leuven� Dept of Computer ScienceCelestijnenlaan 200A, B-3001 Leuven, Belgium
Abstract
Typical telecom applications apply a planar architecture pattern based
on the processing requirements of each subsystem. In a symmetric mul-
tiprocessing environment all applications share the same hardware re-
sources. However, currently embedded hardware platforms are being de-
signed with asymmetric multiprocessor architectures to improve separa-
tion and increase performance of non-interfering tasks. These asymmetric
multiprocessor architectures allow di�erent planes to be separated and as-
sign dedicated hardware for each responsibility. While planes are logically
separated, some hardware is still shared and creates cross-plane in�uence
e�ects which will impact the performance of the system. The aim of this
report is to evaluate, in an embedded environment, the performance of a
typical symmetric multiprocessing architecture compared to its asymmet-
ric multiprocessing variant, applied on a telecom application.
∗N. De Witte, R. Vincke and S. Van Landschoot are scienti�c sta� members at KHBOfunded by IWT-110174.
†J. Boydens is a professor at KHBO and an a�liated researcher at KU Leuven, dept. CS,research group iMinds-Distrinet.
1
Contents
1 Introduction 21.1 Embedded Multi-Core . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Types of Multi-Core Architectures . . . . . . . . . . . . . 41.1.2 Embedded Multi-Core Software Development . . . . . . . 5
1.2 Embedded Telecommunication and Networking Applications . . . 61.2.1 A Design Pattern Approach . . . . . . . . . . . . . . . . . 6
2 Measurement Setup 72.1 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Freescale P2020 . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Data Terminal Equipment . . . . . . . . . . . . . . . . . . 8
2.2 Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Available Tools . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Experiments 123.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.2 System Setup . . . . . . . . . . . . . . . . . . . . . . . . 163.1.3 Measured Parameters . . . . . . . . . . . . . . . . . . . . 17
3.2 DTE Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Measurements . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Dual-Core SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.2 Measurements . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Dual-Core AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4.1 Singlecore . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2 Dividing the Hardware . . . . . . . . . . . . . . . . . . . . 303.4.3 Con�guring the Linux Kernel for AMP . . . . . . . . . . 323.4.4 Booting the AMP System . . . . . . . . . . . . . . . . . . 333.4.5 Measurements . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Comparison 384.1 Summarized Measurements . . . . . . . . . . . . . . . . . . . . . 384.2 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.2 AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Future Work 42
6 Conclusion 43
1 Introduction
In Section 1 theoretical concepts concerning multi-core symmetric- and asym-metric multiprocessing systems in an embedded environment are discussed. Sec-tion 2 describes the software and hardware used in the di�erent measurement
2
setups. Next, all experiment setups are explained in depth and all tests resultsare listed in Section 3. In Section 4 a comparison of the di�erent setups andtheir results is made. Section 5 gives an overview of possible future experiments.Last a conclusion is formulated in Section 6 based on our test results.
In Section 1.1 the incentive to embedded multi-core processors is brie�ydescribed, followed by a general introduction to telecommunication system ar-chitectures in Section 1.2.
1.1 Embedded Multi-Core
Moore's Law [1] (shown in Figure 1), which states that every 18 months thenumber of transistors on a single chip doubles, is reaching its physical limitsfor singlecore processors. The transition to multi-core processors in high per-formance computing and personal computing is already a fact. Recently, wehave seen a similar transition in embedded systems. The increasing demandfor better performance and more functionality forces embedded processors toa multi-core environment. Nevertheless multi-core software is still di�cult towrite and requires a thoughtful approach. The lack of tool support and languagefeatures hinders rapid development of e�cient parallel software.
Nu
mb
er
of
tra
nsi
sto
rs (
N)
Time (year)
1970 2010 Future1990
103
1010
108
106
� = 2300 × 2��−1971
2
8080
Pentium
Pentium 4
Core i7
104
80286
Physical limit
Single core
Multi core
Figure 1: Moore's Law
As stated by Amdahl's Law [2] (shown in Figure 2), the speedup achieved by
3
increasing the number of processor cores can bene�t the application immensely.However, the maximum achievable speedup is limited by the ratio of sequentialto parallel code. Therefore, it is important to get the most out of potentialconcurrency.
Sp
ee
du
p (
S)
Number of Processors
P: Proportion
S: Speedup
50%
75%
90%
95%
Parallel Portion
1 2 4 8
16
32
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
32
76
8
65
53
61
2
4
6
8
10
12
14
16
20
18
Figure 2: Amdahl's Law
In Section 1.1.1 di�erent types of multi-core systems are brie�y describedfollowed by the current state of the art for multi-core embedded software devel-opment in Section 1.1.2.
1.1.1 Types of Multi-Core Architectures
The most common used model of multi-core systems is symmetric multiprocess-ing (SMP). An SMP system consists of multiple processor cores and only oneoperating system covering all cores. An SMP system is relatively easy to man-age since it provides good hardware abstraction and lets the operating systemfacilitate processor core load balancing. In this SMP model, to communicatebetween cores no special application programming interfaces (API) is neededsince all processes and applications run in the same environment.
To ensure process isolation or minimum performance requirements migratingto another processor core is essential. This type of multi-core model is called
4
asymmetric multiprocessing (AMP). Each processor core runs its own, com-pletely isolated, instance of an operating system and application framework.Other processor cores can run a di�erent instance of the same operating sys-tem, a totally di�erent operating system or no operating system at all (thisis commonly known as a baremetal or freestanding application). By using thismodel, singlecore, legacy applications can be reused without any changes. It alsoprovides space for new applications that run in parallel to the existing applica-tions. The applications and processes are thus divided into separate subsystems.When these subsystems require a communication channel inter-core communi-cation (ICC) can be used. Inter-core communication in an AMP system is basedon techniques such as message passing paradigm or a speci�c communicationAPI (ex. MCAPI [3]).
1.1.2 Embedded Multi-Core Software Development
Developing software for multi-core systems is still a di�cult and error-pronetask. Software executing in an SMP environment must be written carefully toavoid concurrency problems. The main pitfalls are race conditions and sharedresources. Race conditions occur when multiple threads of execution are per-forming operations in a non-�xed order. By performing the same operations in adi�erent order, non-deterministic results can occur. Shared resources can causenon-deterministic behavior by concurrent access to the same data. Therefore,locking mechanisms such as mutexes or semaphores should be used to serializeaccess to shared resources.
In high-performance computing (HPC) software mechanisms, such as designpatterns, are migrated to a multi-core environment. Design patterns describegood solutions to common and recurring problems [4]. Parallel programmingpatterns di�er from regular design patterns, as described in Gamma et al. [4],in their way of being structured in abstraction levels. From design patternsfor low-level implementation techniques, towards design patterns of high-levelarchitectural models. We propose a layered structure (Figure 3) for parallelembedded software based on Our Pattern Language (OPL) [5], which is alsoused for high-performance computing applications. Finding concurrency designspace [6] are the �rst patterns which decompose the problem into parallel execu-tion units. Next come patterns which describe the overall parallel architectureand software structure. Then there are patterns which introduce typical parallelalgorithm structures. Finally patterns which describe the parallel implementa-tion mechanisms [6] are provided. Each level in this hierarchical structure caninteract with the so called Optimization Design Space [7]. This is a collection ofapplication speci�c optimizations in order to obtain optimal parallel execution.The lowest level in the hierarchy, the implementation strategy patterns, can useadditional supporting structures, such as mutexes or semaphores.
5
Figure 3: Parallel Design Pattern Hierarchy
1.2 Embedded Telecommunication and Networking Ap-
plications
Telecommunication systems are build using di�erent subsystems. Each subsys-tem has its own processing requirements. The �rst one is called the manage-ment subsystem. Its function is general management of the system (handlingalarms, counters, etc.). Processing requirements on the management subsystemare rather minimal. Small delays in execution of management functions are notcritical for healthy system operation. The second subsystem is called the controlsubsystem. Its main function is Processing of the communication channel (forexample codec negotiation). Performance requirements for this subsystem aremore demanding. The last subsystem is called the user or data subsystem. Thissubsystem is exclusively used for data processing and forwarding. Processingrequirements are very high for the user subsystem, since high throughput andlow latency is required.
1.2.1 A Design Pattern Approach
When following the top-down design pattern hierarchy proposed in Section 1.1.2,the �rst task is to determine potential concurrently executing tasks. Since anetworking application contains independent tasks it is clear that, for example,data subsystem tasks can be executed in parallel to control or managementtasks.
6
MANAGEMENT PLANEDevice configuration
Network availability monitoring
Alarm handling
Counter management
User plane initialization
User plane configuration
User plane monitoring
Management of the data channel
Data processing
Data forwarding
USER PLANECONTROL
PLANE
Figure 4: Telecommunication Network Planes
Since a typical networking application consists of di�erent subsystems, asexplained in Section 1.2, the Planar Pattern is applied (as depicted in Figure 4).The Planar Pattern can be applied on systems, divisible into self-containedpieces that have signi�cantly di�erent processing requirements [8] and that are,on a lower level, based on Task Decomposition [6]. For dual-core systems eachprocessor core can run data plane or control plane applications as singlecoreapplications. Since the software is not running parallel on application level, noAlgorithm Strategy or Implementation Strategy patterns are used.
2 Measurement Setup
This section provides general information about the measurement setup. Sec-tion 2.1 provides an overview of the used hardware components and Section 2.2describes the used network performance tools and automation scripts.
2.1 Hardware Setup
The used hardware consists of a Freescale P2020 communication processor (Sec-tion 2.1.1), which is the device under test (DUT), supported by data terminalequipment (DTE) (Section 2.1.2) functioning as data transfer source and sink.
DTE 1Freescale
P2020DTE 2
Data source Data sink
Device Under Test
Figure 5: Hardware Overview
2.1.1 Freescale P2020
The Freescale P2020 system-on-chip is a dual-core communications processorof the QorIQ family. The two processor cores are Power Architecture e500v2
7
CPUs, each having their own L1 instruction and data cache. Both cores alsohave a shared L2 cache of 512 KB at their disposal.
2.1.2 Data Terminal Equipment
The tests performed with the P2020 mainly consist of a data stream in the formof UDP packets send from a generator to a receiver with the P2020 in between,as illustrated in Figure 5. This setup validates the user plane performance capa-bilities of the P2020. When building a test setup like this, it is very importantto make sure that the data terminal equipment (DTE) connected to the deviceunder test (DUT) are deterministic. Every test should be reproducible andshould return similar results. Below we describe the most important aspectsthat should be considered when con�guring the data terminal equipment de-vices. For ease of use and maximum con�gurability all DTE devices run theLinux Ubuntu operating system.
Power Management One operating system option that can in�uence thetest results is power management. When conducting high performance tests,where accurate timing is of the essence, all possible forms of power managementshould be disabled. Entering or leaving sleep state is time consuming. Evensmall events such as screen dimming or a screen saver can in�uence the outcomeof heavy duty performance tests.
CPU Dynamic Frequency Scaling Every form of dynamic frequency scal-ing should be disabled on the DTEs. These features make the tests non-deterministic. Some Intel technologies include SpeedStep and Turbo Boost al-lowing the operating system to increase the CPU core frequency when needed.The opposite is also possible, when certain cores consume too much power ortheir core temperature gets too high, the system might decide to lower the corefrequency. These options can be disabled in the BIOS.
NIC Hardware DMA
Ring Buffer
IP
Processing
TCP/UDP
Processing
Socket RCV Buffer
SOCK RCV
SYS_CALL
Network
Application
Soft IRQProcess
Scheduler
NIC & Device Driver Kernel Protocol Stack Data Receiving Process
Figure 6: Linux Networking Subsystem: Packet Receiving Process [9].
Bu�ers The trip that a packet takes from the wire to the actual receiverapplication can be divided into three major stages [9] as shown in Figure 6. First,the packet is received by the network interface card (NIC). This NIC cooperateswith its device driver to transfer the frame to a memory bu�er called the ringbu�er. The maximum size of this bu�er is limited by the NIC and device driver.However, the default size of the bu�er it not always con�gured to be equal tothe maximum size. Hence, the size of this bu�er should de�nitely be taken intoaccount when trying to increase the network throughput.
8
Once a packet is copied into main memory, the CPU is informed through aninterrupt mechanism. In a second stage the soft interrupt request (softirq) [10]is serviced by the CPU and the packet is transferred from the ring bu�er to asocket bu�er. Every socket, created in kernel or user space, is bound to a sendand receive socket bu�er. Linux uses these bu�ers to store, manipulate andhand over packets from kernel space to user space [9]. When exchanging largeamounts of data, these bu�ers should be enlarged to accommodate for higherthroughput. If not, packets will be dropped when the bu�ers over�ow. Thesize of the socket bu�ers can be con�gured easily and independently of the NICdevice driver. In a last stage the end application can retrieve the packets fromthe corresponding socket bu�ers through a series of system calls.
The main goal of the ring bu�ers and socket bu�ers is to compensate thedi�erence between generation rate and consumption rate of network packets.However, care must be taken when con�guring the size of the bu�ers. Smallbu�ers correspond to small delays but cannot match high throughput rates. Onthe other hand, huge bu�ers can increase throughput but they will also increasenetwork delays and might not be feasible when considering embedded systemswith limited main memory. Socket bu�ers are generally only limited in size byavailable main memory, ring bu�ers on the other hand are limited in size by theNIC and the NIC device driver.
Priority Scheduling The Linux kernel provides three di�erent schedulingpolicies [10], namely SCHED_OTHER, SCHED_FIFO (First In First Out) andSCHED_RR (Round Robin). The �rst is the normal non-real-time policy whilethe latter two are considered real-time scheduling policies. SCHED_OTHERimplements a scheduling mechanism based on a priority value called nicenessand assigns time slices to invoked processes. The niceness value ranges from-20 to +19, where high priority processes run with a lower niceness value. Indi-rectly the niceness value determines the size of a time slice given to a runningprocess. The Linux operating system is pre-emptive, meaning when a processuses up its CPU time slice it is aborted and another process is invoked. Onthe other hand when a process becomes runnable and it has a higher prioritythan the current executing process, the scheduler may invoke the new process,or another one with an even higher priority. The real-time scheduling poli-cies (SCHED_FIFO and SCHED_RR) use a di�erent priority value, whichranges from 0 to 99 where a high value indicates a high priority. It is impor-tant to note that a SCHED_FIFO or SCHED_RR task will always be sched-uled over a SCHED_OTHER task. While SCHED_RR still uses time slices,which can cause a process to pre-empt, SCHED_FIFO does not. Meaning aSCHED_FIFO task can run inde�nitely and hog up all the CPU time leadingto convoy e�ect on the remaining processes.
In the case of the P2020 test setup, the packet generation and receiver toolswill always be scheduled with a raised priority in SCHED_FIFO to make surethat these processes have a full core available on the DTE devices, which aredual-core systems. The rescheduling of these processes is achieved using a sim-ple program written in C++ for which the scheduleFIFO function is shown inListing 1.
9
1 /**
2 * Schedule the process with the given ID as FIFO.
3 * Also set the priority of the process.
4 * @pid The process ID
5 * @prior The priority */
6 void scheduleFIFO(pid_t pid , int prior) {
7 // Priorities
8 struct sched_param sch_param;
9 memset (&sch_param , 0, sizeof(sch_param));
10 sch_param.sched_priority = prior;
11 assert(sched_setscheduler( pid , SCHED_FIFO , &
sch_param ) == 0); }
Listing 1: Scheduling function in C++.
Jumbo Frame Support Standard Ethernet NICs used to support maximumtransmission units (MTUs) up to 1500 bytes. With the introduction of 1Gbitand 10Gbit Ethernet, more NICs started to support bigger Ethernet frameswith MTUs up to 9000 bytes, also known as Jumbo Frames [11]. Since theDUT supports Jumbo Frames, it was important to select DTEs which had thesame support. The DTEs have a build-in Intel 825xx Gigabit NIC. It has to benoted that the MTU of a network interface is standard con�gured to 1500 byteson Linux. When conducting tests with larger frames, this should be changed tothe appropriate size using Ethtool .
Interrupt Coalescing When enabling interrupt coalescing several frames arebu�ered before an interrupt is generated. This can increase the throughput ofthe system. On the other hand the bu�ering of frames will introduce extradelay and may even result into packet streams bursts. When conducting highspeed network performance tests, with interrupt coalescing (mitigation) dis-abled, every received or transmitted frame completion generates an interrupttowards the processor. The servicing of these interrupts creates a lot of over-head for the operating system and may become a bottleneck. So care must betaken when con�guring the interrupt coalescing parameters. These parameterscan be con�gured using Ethtool . At �rst glance it seemed that the Intel NICsonly supported interrupt coalescing for received frames and not for send frames.However, the datasheet [12] does con�rm that the Intel 825xx Gigabit NIC doessupport interrupt coalescing for received frames as well as transmitted frames.This leads us to the conclusion that interrupt coalescing is hardware supportedbut not software con�gurable.
Since Linux kernel 2.5/2.6 a similar mechanism was added to the networkstack, called �New API� or NAPI. It was designed to improve network perfor-mance by among others using interrupt mitigation and packet throttling [13].
2.2 Software Tools
A lot of network performance measuring tools are available. In Section 2.2.1the most valuable tools are described brie�y. Section 2.2.2 gives an overview ofthe used automation scripts.
10
2.2.1 Available Tools
Iperf Iperf [14] is a popular and broadly used tool for testing network per-formance. It is capable of generating UDP and TCP packets at very high ratesand can measure bitrate and packet rate. Like most network performance tools,Iperf consists of a packet generator (the server) and a receiver (the client). Oneof the major advantages of Iperf is the fact that it can be run over any net-work and be hosted on Linux as well as Windows or Mac. Iperf is also able togenerate multiple parallel data streams from one client to one server. One dis-advantage of Iperf is the need for a return path for reporting. The Iperf client�rst makes a connection with the Iperf server at the start of a test. When thetest is �nished the server sends back a report to the client.
NTools NTools [15] is similar to Iperf but has the advantage of not needing areturn path from receiver to generator. On the other hand the packet generatorof NTools is not able to achieve the same packet rate as Iperf , especially forsmaller packet sizes (as shown in Figure 7). Another major di�erence betweenIperf and NTools is the con�guration of the packet sizes. While Iperf de�nesthe packet size as the number of bytes in the UDP payload, NTools de�nes it asthe full size of the packet. Both these reasons were used to decide to use Iperffor all performance measurements.
0
50000
100000
150000
200000
250000
300000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
IPerf
NTools
Figure 7: Iperf versus NTools
Other tools include:
• thrulay, a network capacity tester allowing to measure capacity, delay,and other performance metrics of a network by sending a bulk TCP orUDP stream over it.
• Network Security Toolkit (NST), a bootable ISO live CD based onFedora. This toolkit contains a range of network security and performanceanalyzing tools.
• packETH, a packet generator tool for Ethernet. packETH allows youto create and send any possible packet or sequence of packets.
• pktgen, a high-performance testing tool included in the Linux kernel.
These tools were evaluated but never used in the actual test setup and willnot be mentioned further.
11
2.2.2 Scripts
When testing a system some aspects such as comparable test results and re-peatability are mandatory. Hence, automated tests were introduced at the startof the case study. Next, automated tests also provide a way to minimize testsetup errors, lower necessary human interaction and allow for fast test resultanalysis.
For this study we decided to create test scripts written in Python [16], a freeprogramming language which has been ported to several operating systems andcan be cross-compiled to nearly every Linux distribution. In a �rst phase thetests were secluded scripts which needed a lot of work afterwards to combine theresults of the di�erent tests. This �rst approach was replaced by a client-servermodel (shown in Figure 8). This allows for fully automated tests, linking allresults merging them at the end.
python TestClient
Packet generator Request
Response
Generator DTE
Management Requester
python TestServer
Monitoring
P2020
python TestServer
Management requests
python TestServer
Packet sink
Sink DTE
UDP stream
UDP stream
Figure 8: Client-Server Python Scripts
While certain statistics, like packet rate and bitrate, need to be measuredon the DTEs, speci�cally the endpoint DTE, other parameters on the DUTneed to be monitored, for example CPU usage. In a �rst approach, SecureShell (SSH) was used to remotely start monitoring software, such as mpstat ,and retrieve the results. However, this approach was inadequate when the DUTwas heavily stressed and started to deny and drop SSH connections. At thisstage we decided to run a Python server script on the DUT. This did requirethe cross-compilation of Python to a PowerPC architecture, which was lessstraight-forward than most other cross-compilations, since Python's librariesneed to be compiled with Python itself. It should also be taken into accountthat the full Python environment (compiler and libraries) take up approximately56 megabytes of space. In the case of the system available for the case study thisamount of memory was not available, so the Python environment was placed onan SD card which was mounted at system startup.
Even with a script running on the target system for monitoring CPU usageusing mpstat , tests failed. In rare cases when CPU utilization was high, thempstat process did not exit successfully and kept running as a zombie process.The exact reason for this failure is still future work. Instead of using mpstat ,the CPU usage is directly retrieved from the kernel mapped statistics �le foundin /proc/stat. This approach proved to be the best so far.
3 Experiments
In order to compare di�erent system setups, several experiments are conducted.In Section 3.1 the general setup of such an experiment is described. In Sec-tion 3.2 the data terminal equipment is benchmarked by directly connectingboth DTEs to each other. Next, in Section 3.3 the DUT (P2020) is inserted
12
between the DTEs with SMP Linux installed. Finally, in Section 3.4 the SMPsystem is migrated to an AMP system.
3.1 Experiment Setup
In Section 3.1.1 di�erent parameters concerning the test setup are described. InSection 3.1.2 all system setups to be evaluated are described. Last, the measuredparameters are discussed in Section 3.1.3.
3.1.1 Test Setup
For all future tests both DTEs are con�gured according to the guidelines givenin Section 2.1.2:
• Power management is disabled.
• Dynamic CPU frequency adjusting is disabled.
• Send and receive socket bu�ers are con�gured to a size of 20MB.
• The Iperf generator is launched with a raised priority of 10 and scheduledusing the SCHED_FIFO mechanism.
• Interfaces are con�gured using ethtool to support MTUs up to 9000 bytes(Jumbo Frames).
Source/Sink Benchmark Before the DUT can be added to a test setup it isnecessary to analyze the performance of the test control devices when directlyconnected to each other. This is done by connecting the packet generator dataterminal equipment (DTE) directly to the packet receiver DTE as shown inFigure 9. This should give a benchmark of what throughput the test setup iscapable of.
DTE DTEeth0
10.0.0.1/16
eth0
10.0.1.1/16
python TestClient
IPerf generator
python TestServer
IPerf receiver
UDP stream
Figure 9: Benchmark Test Setup
Kernel Space Routing A Linux system can be con�gured to enable for-warding of IP packets. This turns the device into a router, allowing it to receivepackets which are not destined for the host itself and making forwarding deci-sions based upon the IP destination address. This IP forwarding functionalityis fully embedded in the Linux kernel and is part of the kernel TCP/IP protocolstack [17].
13
Network IP layer
Application
generates traffic
Sends packets to
socket
Sends packets to
transport layer
Sends packet to
network layer
Packet for Host ?
Packet arrives at
device
Forward
packet?
Drops packet
Sends packet to
network layer
Sends packets to
transport layer
Looks up route to
destination
Drops packet
Sends packet to
device
Sends packet to
socket
Pulls packet in
application bufferTransmits packet
NO
INTERNAL EXTERNAL
Figure 10: Abstraction of the Linux Message Tra�c Path [18]
As shown in Figure 10, three actions can be taken when receiving a packetafter it has been handed down to the IP network layer. First the decision canbe made to drop the packet altogether when certain checks fail or a rule existsto drop the packet. Secondly the packet can be handed to the appropriatetransport layer service if the packet has a destination IP address correspondingto one of the interfaces of the host itself. Finally a packet can be forwarded outtowards the destination host. The egress interface is hereby chosen based uponroutes in the routing table.
By placing the DUT between two DTE hosts, with the DUT con�gured toforward IP packets1, the DUT can be analyzed in a situation where most of thepackets (those forwarded) are handled by the kernel TCP/IP protocol stack andnever reach user space.
The results of these tests are used as a reference for other test setup scenariosand give an idea of the performance capabilities of the DUT. Figure 11 gives a
1Of course both Ethernet interfaces of the DUT should be con�gured with di�erent subnets.
14
detailed view of the test setup used for the kernel space routing test. As canbe seen in the �gure, UDP packets are generated by the DTE on the left-handside, routed by the P2020 DUT and eventually delivered to the right-hand sideDTE.
DTEP2020
(DUT)
eth0
10.0.1.1/24
python TestClient
Packet generator
DTEpython TestServer
Packet receiver
UDP streamUDP stream
eth0
10.0.0.1/24
data1
10.0.0.254/24
data2
10.0.1.254/24
mgmt1
172.21.0.200/16
Management
Requester
eth0
172.21.0.201/16
Re
qu
est
s
Re
spo
nse
Figure 11: Kernel Space Routing Test Setup
User Space Passthrough The routing test through kernel space gives a goodidea of what the DUT is able to handle in terms of packet rate and bitrate. Inthis case packet processing is minimal, since the payload data does not leavekernel space. When the data needs to be processed by a user application,the packet data needs to be moved to user space before it can be accessed.To emulate this behaviour a copy application was created in C , which copiesreceived data from a datagram source socket to a datagram destination socketthrough user space. All UDP packets processed by the program are redirectedto the destination host (DTE sink).
For simplicity and highest possible performance, this user space passthroughapplication does no �ltering and just copies packets from an incoming interfaceto an outgoing interface (after minimal processing).
Since this application only copies packets from one interface to another, Iperfcould not send status reports from server to client. This problem was solvedby con�guring a return path via Ethernet over Firewire. This did howeverrequire both DTEs to have reverse �ltering disabled. This allows packets to bereceived on an interface which would not have been chosen as egress interfacewhen a packet is send to the other host. In other words, Linux drops packets bydefault that are received on an interface it did not expect to receive the packeton. Reverse �ltering can be disabled by using the sysctl command shown inListing 2 and by adding a route to the routing table of the end DTE.
1 sudo sysctl -w net.ipv4.conf.eth0.rp_filter =0
2 sudo sysctl -w net.ipv4.conf.default.rp_filter =0
3 sudo sysctl -w net.ipv4.conf.all.rp_filter =0
4 sudo sysctl -w net.ipv4.conf.firewire0.rp_filter =0
Listing 2: Disabling Reverse Filtering on Both DTEs.
This leads to the test setup depicted in Figure 12.
15
DTEP2020
(DUT)
eth0
10.0.1.1/24
python TestClient
Packet generator
DTE
python TestServer
Packet receiver
UDP streamUDP stream
eth0
10.0.0.1/24
eth0
10.0.0.254/24
eth2
10.0.1.254/24
eth1
172.21.0.200/16
Management
Requester
eth0
172.21.0.201/16
Re
qu
est
s
Re
spo
nse
IPerf response
firewire0
10.0.2.1/24firewire0
10.0.2.2/24
Ingress interface Outgress interface
Figure 12: User Space Passthrough Test Setup
Control/Management Plane Response Time The kernel space routingand the user space passthrough applications test the processing performance ofthe user plane. To check for cross-plane in�uence between the user plane andthe control/management plane, a computational intensive CGI script2 is used.This script emulates a control/management plane application. Measuring itsresponse time during a stressed and unstressed user plane setup, will show ifthere is any in�uence between the di�erent planes.
This way, both in Figure 11 and in Figure 12, the horizontal axis representsthe user plane and the vertical axis represents the control/management plane.
3.1.2 System Setup
The used telecommunication system is based on an existing SMP con�gurationas depicted in Figure 13. In this setup the user, control and management appli-cations run along each other on a dual-core CPU. The �rst step to move froman SMP to an AMP system is by creating a singlecore system. This allowsfor veri�cation if the hardware description of the system is correct. Once thissystem is bootable, the next logical step is dividing the available hardware com-ponents between the two cores. Since only two cores are available with the usedplatform, we decided to merge the control and management applications to onecore. The user plane applications are hosted on their own dedicated core. Thissetup will allow for the maximal user plane throughput.
2This program is written in C and determines the Fast Fourier Transform (FFT) of adiscrete amount of data samples. It is made available as a CGI script through the lighttpd
webdaemon after crosscompiling it to the PowerPC architecture.
16
CORE0
OS
CORE1
P1 P2
CORE0
OS
CORE1
P1 P2
CORE0
OS0
CORE1
P1 P2
OS1
SMP
Hardware
Based
Singlecore
AMP
Figure 13: System Setup
3.1.3 Measured Parameters
The key parameters to monitor in the previously explained test setups (Sec-tion 3.1.1) are:
• Packet rate or number of packets transferred per second. This is one ofthe main parameters that are monitored in these test setups. Most of thetest results are compared based on the achieved packet rate and this is adirect indication of the performance of the user plane.
• CPU and core usage, which are monitored on the DUT and give anindication of the current system load.
• Bitrate, or the number of bits that the system is transporting per second.This parameter is measured, unless indicated otherwise, on the secondlayer of the OSI model and represents the number of Ethernet frame bitstransported per second.
• Delay and jitter. Delay indicates how much time it takes for a packetto travel through the P2020 system. Jitter indicates how much the delaydeviates relatively from the average delay.
• ...
Due to incorrect values reported by Iperf and NTools both delay and jittercould not be measured in any of the test setups. Delay and jitter can actuallyonly be accurately measured using an external hardware system capable of highprecision timestamping. At the time of this report a network analyser capableof measuring delay and jitter was not at our disposal.
17
3.2 DTE Benchmark
Section 3.2.1 gives a general description of the benchmark tests. In Section 3.2.2a comparison is made between the performance of the data terminal equipmentand the theoretical maximum Ethernet throughput.
3.2.1 Description
In a �rst test run Iperf is con�gured to generate a single UDP stream. Payloadsizes are increased from 64 bytes upto 9000 bytes (jumbo frames). For this testthe packet rate, bitrate as well as the CPU load are measured on the DTEs.
A second scenario consists of Iperf generating two UDP streams. By addinga second packet stream, Iperf is able to spawn a second thread, hereby using thefull power of the second core of the generator DTE. This con�guration pushesthe test control devices to their maximum potential.
3.2.2 Measurements
As illustrated in Figure 14, when generating a single UDP stream the maximumpacket rate is approximately 265000 packets per second. As the payload size isincreased, the packet rate diminishes and the bitrate increases towards the fullbandwidth of 1000 Mb/s (1Gb/s). At the moment the full bitrate is approached,the packet rate starts to decrease faster while the bitrate stays constant.
0
100
200
300
400
500
600
700
800
900
1000
0
50000
100000
150000
200000
250000
300000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Bit
rate
[M
b/s
]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Packet Rate
Bitrate
Figure 14: Benchmark with Single Packet Stream
The CPU usage for a single UDP stream is shown in Figure 15 and showsthat Iperf consumes core 1 completely, while core 0 remains almost idle.
0
10
20
30
40
50
60
70
80
90
100
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Co
re U
sag
e [
%]
Payload Size [bytes]
Core 1
Core 0
Figure 15: Core Usage of Benchmark with a Single Packet Stream
Figure 16 shows the resulting total packet rate of two UDP streams with amaximum of up to 447000 packets per second. It does however indicate that
18
two UDP streams are less stable for smaller payloads when generated at full ca-pacity. This is expected behaviour since both DTE cores are fully utilized wheninstructing Iperf to generate a second packet stream as depicted in Figure 17.
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
1 Stream
2 Streams
Figure 16: Single and Dual Stream Benchmark
0
10
20
30
40
50
60
70
80
90
100
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Co
re U
sag
e [
%]
Payload Size [bytes]
Core 1
Core 0
Figure 17: Core Usage of Benchmark with Dual Packet Stream
Figure 18 compares the packet rates of a single and dual UDP packet streamto the ideal maximum that a 1Gb Ethernet connection can handle [19]. Theideal maximum is given by the Formula 1 where n is the number of payloadbits, Rb is the bitrate of the connection, 64 is the combined number of bits forthe preamble and frame delimiter, 96 the interframe gap between two framesand Rp is the packet rate. As shown in Figure 18 we noted that there is stilla gap for smaller packet sizes. The packet rate of a system is in most caseslimited by factors such as CPU power, interrupt handling rates and bu�er sizes(software as well as hardware bu�ers). Taking the results of Figure 17 intoaccount, we noted that in this case the bu�er sizes or interrupt rates were notthe limiting factor, but CPU power was, which was expected since the send andreceive bu�ers have been adapted to allow for maximum throughput.
Rp =Rb
(64 + n+ 96)(1)
19
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Theoretical maximum
1 Stream
2 Streams
Figure 18: Packet Rate of Single and Dual Stream Benchmark versus IdealMaximum
To conclude the test results of the benchmark setup, the bitrate comparisonbetween a single and double packet stream are presented in Figure 19. Asexpected the bitrate for smaller payload sizes increases when a second UDPstream is introduced. Remark that the maximum bitrate limited by the networkcannot be exceeded.
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Bit
rate
[M
b/s
]
Payload Size [bytes]
1 Stream
2 Streams
Figure 19: Bitrate of Single versus Dual Stream Benchmark
3.3 Dual-Core SMP
In the case of the dual-core SMP experiment all hardware is enabled and as-signed to a single operating system, including both cores as shown in Figure 20.This means that the operating system can use both cores and schedule processesaccordingly.
python TestClient
Packet generator
UD
P s
tre
am
OS
Kernel Space Routing
User Space Passthrough
Other
processesCGI Script
DTE
python TestServer
Packet receiver
Re
qu
est
Re
spo
nse
Management
Requester
User
Plane
Control /
Management Plane
DTE
UD
P s
tre
am
Core 1Core 0
Figure 20: Dual-Core SMP System
20
In the following sections some important aspects such as interrupt a�nity arediscussed and analyzed. Next the kernel space routing tests are described and�nally the user space passthrough tests, both with and without the managementplane application requests are described.
3.3.1 Description
An important aspect when working with a communication system is interrupta�nity. In many cases interrupts are delivered by the PIC (ProgrammableInterrupt Controller) to core 0 only [20]. This is not a problem for desktopsystems, but it can create a bottleneck for dedicated embedded communicationsystems. The number of interrupts received by a CPU are already dramaticallyreduced by mechanisms like NAPI (included in the Linux kernel since version2.5/2.6) and network card interrupt coalescing. However, the interrupt overheadcan still cause a performance drop when not handled correctly.
For the dual-core SMP case the receive interrupts of the �rst Ethernet in-terface are delivered to core 0. The transmit interrupts of the third Ether-net interface are routed to core 1 as shown in Listing 3. The default a�n-ity can be con�gured by setting the CPU mask in the kernel mapped �le/proc/irq/default−smp−affinity. Speci�c interrupt a�nities can be con-�gured by modifying the �les /proc/irq/xxx/smp−affinity where xxx cor-responds to the IRQ numbers displayed in the �rst column of Listing 3. Theoperating system will not provide these �les when the Ethernet interface is dis-abled or when the Ethernet interface is down. So the Ethernet interfaces haveto be con�gured before their interrupt a�nity can be modi�ed.
21
1 > cat /proc/interrupts
2 CPU0 CPU1
3 19: 0 0 OpenPIC Level fsl -lbc
4 20: 0 0 OpenPIC Level fsldma -chan
5 21: 0 0 OpenPIC Level fsldma -chan
6 22: 0 0 OpenPIC Level fsldma -chan
7 23: 0 0 OpenPIC Level fsldma -chan
8 28: 0 23 OpenPIC Level ehci_hcd:usb1
9 29: 0 36 OpenPIC Level eth0_g0_tx
10 30: 603486 0 OpenPIC Level eth0_g0_rx
11 31: 0 667202 OpenPIC Level eth2_g0_tx
12 32: 46 0 OpenPIC Level eth2_g0_rx
13 33: 0 0 OpenPIC Level eth2_g0_er
14 34: 0 0 OpenPIC Level eth0_g0_er
15 42: 7224 0 OpenPIC Level serial
16 43: 31 0 OpenPIC Level i2c -mpc , i2c -mpc
17 68: 0 0 OpenPIC Level gianfar_ptp
18 72: 0 3407 OpenPIC Level mmc0
19 76: 0 0 OpenPIC Level fsldma -chan
20 77: 0 0 OpenPIC Level fsldma -chan
21 78: 0 0 OpenPIC Level fsldma -chan
22 79: 0 0 OpenPIC Level fsldma -chan
23 251: 0 0 OpenPIC Edge ipi call function
24 252: 2402 8153 OpenPIC Edge ipi reschedule
25 253: 10 8 OpenPIC Edge ipi call function
single
26 LOC: 18011 49619 Local timer interrupts
27 SPU: 0 0 Spurious interrupts
28 CNT: 0 0 Performance monitoring interrupts
29 MCE: 0 0 Machine check exceptions
Listing 3: P2020 Default SMP Interrupt Assignment
The Linux operating system assigns interrupts to cores based on the order inwhich the interfaces come online, unless manually con�gured otherwise. In otherwords, we manually con�gure the interrupt a�nity of the hardware componentsthat have a direct correlation to the performance of the system.
3.3.2 Measurements
Ethernet Interrupt A�nity A �rst kernel space routing test was conductedwith Ethernet interrupts handled by both cores. While eth0 receive interruptsare bound to core0, eth2 transmit interrupts are assigned to core 1. Next, thesame kernel space routing test was run but this time with all Ethernet interruptsbound to core 0. Figure 21 shows the results of both these routing tests.
The packet rate is lower when all Ethernet interrupts are assigned to asinglecore in case of small payloads. Once the payload size gets above 650bytes, there is no noticeable di�erence anymore. This is due to the lowerpacket rate and consequently lower interrupt rate. Listing 4 shows the resulting/proc/interrupts output after running the second test, which indicates that
22
core 0 handled all interrupt requests regarding the Ethernet interfaces.
0
50000
100000
150000
200000
250000
300000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Packets/s [1 Gb] - Interrupt Affinity set to both cores
Packets/s [1 Gb] - Interrupt Affinity set to core 0
Figure 21: Packet Rates on Dual-Core SMP System with Modi�ed InterruptA�nity
1 > cat /proc/interrupts
2 CPU0 CPU1
3 19: 0 0 OpenPIC Level fsl -lbc
4 20: 0 0 OpenPIC Level fsldma -chan
5 21: 0 0 OpenPIC Level fsldma -chan
6 22: 0 0 OpenPIC Level fsldma -chan
7 23: 0 0 OpenPIC Level fsldma -chan
8 28: 0 23 OpenPIC Level ehci_hcd:usb1
9 29: 159 0 OpenPIC Level eth0_g0_tx
10 30: 5629171 0 OpenPIC Level eth0_g0_rx
11 31: 4555718 0 OpenPIC Level eth2_g0_tx
12 32: 43 0 OpenPIC Level eth2_g0_rx
13 33: 0 0 OpenPIC Level eth2_g0_er
14 34: 0 0 OpenPIC Level eth0_g0_er
15 42: 2419 0 OpenPIC Level serial
16 43: 31 0 OpenPIC Level i2c -mpc , i2c -mpc
17 68: 0 0 OpenPIC Level gianfar_ptp
18 72: 0 3063 OpenPIC Level mmc0
19 76: 1 0 OpenPIC Level fsldma -chan
20 77: 0 0 OpenPIC Level fsldma -chan
21 78: 0 0 OpenPIC Level fsldma -chan
22 79: 0 0 OpenPIC Level fsldma -chan
23 251: 0 0 OpenPIC Edge ipi call function
24 252: 5973 7005 OpenPIC Edge ipi reschedule
25 253: 38 7 OpenPIC Edge ipi call function
single
26 LOC: 21326 28773 Local timer interrupts
27 SPU: 0 0 Spurious interrupts
28 CNT: 0 0 Performance monitoring interrupts
29 MCE: 0 0 Machine check exceptions
Listing 4: P2020 Interrupt Assignment to Core 0
From now on tests are conducted with interrupt a�nity set to both cores,unless explicitly indicated otherwise.
23
Kernel Space Routing In this test scenario the CPU usage of the P2020 wasmeasured while performing the kernel space routing function. From the resultsshown in Figure 22 we see that core 0 usage is at 60% for 64 byte payloads. Asthe payload size is increased, core 0 usage descends towards 0%. We expectedthat the packet receiving processes would fully consume core 0, but this was notthe case.
0
10
20
30
40
50
60
70
80
90
100
0
50000
100000
150000
200000
250000
300000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Co
re U
sag
e [
%]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Packet Rate
Core 0 Usage
Core 1 Usage
Figure 22: Dual-Core SMP System Routing
This can be easily explained when comparing the packet rate of the P2020 tothe packet rate of the generator, as depicted in Figure 23. The results show thatthe P2020 is able to process all packets delivered by the generator (maximum259352 packets per second) and it does not consume all its processing power.
0,00
100,00
200,00
300,00
400,00
500,00
600,00
700,00
800,00
900,00
1000,00
0
50000
100000
150000
200000
250000
300000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Bit
rate
[M
b/s
]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Routing packet rate
Benchmark packet rate
Routing bitrate
Benchmark bitrate
Figure 23: Dual-Core SMP System Routing versus Benchmark
We remark that if a system is sending as much data as it is receiving, andboth processes have the interrupts assigned to another core (as in Listing 3), thecore usages of those two should be approximately the same. This is however notthe case as can be seen from the results in Figure 22. Packet receiving process isa much more CPU intensive task than the corresponding sending process [21].
Management/Control Plane Baseline Before analyzing the impact of thecontrol/management plane application on the user plane, a baseline response ofthe management plane CGI script is given in Figure 24. The minimum responsetime of the script is 0.75s on an unstressed system.
24
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
0,95
1
0 5 10 15 20 25 30 35
Re
spo
nse
Tim
e [
s]
Testrun []
Response Time
Figure 24: Dual-Core SMP Management Plane Baseline Response
Kernel Space Routing with Management/Control Plane EmulationIn the next testrun the UDP packet stream is introduced while the CGI scriptresponse time is measured. When analyzing the packet rate (depicted in Fig-ure 25) and the bitrate (depicted in Figure 26) we note that the di�erencebetween both is small with a maximum di�erence percentage of 1.31%. Thisindicates that the management plane requests have only a small cross-in�uenceon the user plane packet processing.
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0
50000
100000
150000
200000
250000
300000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000P
erc
en
tag
e d
iffe
ren
ce [
%]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Only routing
With CGI script
Percentage difference
Figure 25: Packet Rate of a Dual-Core SMP Routed Packet Stream with Man-agement Plane Requests
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0,00
100,00
200,00
300,00
400,00
500,00
600,00
700,00
800,00
900,00
1000,00
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Bit
rate
[M
b/s
]
Payload Size [bytes]
Only routing
With CGI script
Percentage difference
Figure 26: Bitrate of a Dual-Core SMP Routed Packet Stream with Manage-ment Plane Requests
The packet stream has more impact on the execution time of the CGI scriptas illustrated in Figure 27. This results in unstable and higher response times ofthe management plane. The di�erence percentage reaches a maximum of nearly300% and is higher for smaller packet sizes than it is for larger packet sizes.
25
This can be linked to the core usage when routing packets, which is higher forsmaller payloads as shown in Figure 22. Figure 28 shows the CPU usage withthe CGI script requests. Core 0 is almost fully utilized while core 1 usage hasincreased 50% for smaller payloads compared to the kernel space routing test(Figure 22).
0,00
50,00
100,00
150,00
200,00
250,00
300,00
0
0,5
1
1,5
2
2,5
3
3,5
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Re
spo
nse
Tim
e [
s]
Payload Size [bytes]
Response Time
Baseline
Percentage difference
Figure 27: Response Time of Management Plane Script on a Dual-Core SMPSystem with a Routed Packet Stream
0
10
20
30
40
50
60
70
80
90
100
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Co
re U
sag
e [
%]
Payload Size [bytes]
Core 0 Usage
Core 1 Usage
Figure 28: Core Usage of a Dual-Core SMP Routed Packet Stream with Man-agement Plane Requests
User Space Passthrough The next tests were conducted with the user spacepassthrough application running on the P2020.
In the same way as the kernel space routing test, the user space passthroughtest was conducted while measuring the usage of both cores resulting in Fig-ure 29. As opposed to the routing test where the packet rate was constant forsmaller payloads, this is not the case anymore. The packet rate starts at amaximum of 118316 packets per second and decreases with increasing payloadsize.
Core 0, which handles the Ethernet receive interrupts, has a load usage whichis di�erent from that of the kernel space routing test (Figure 22).
The load on core 1 on the other hand is constant at 100% usage for payloadsbelow 1500 bytes after which it starts to decrease towards 60%. Core 1 handlesthe Ethernet send interrupts and the user space passthrough application. Theapplication was launched with core a�nity set to core 1 and FIFO scheduledwith a raised priority level of 10. Note that the packet rate is limited for smallerpayloads while the load on core 0 is not yet 100%.
26
0
10
20
30
40
50
60
70
80
90
100
0
20000
40000
60000
80000
100000
120000
140000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Co
re U
sag
e [
%]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Packet Rate
Core 0 Usage
Core 1 Usage
Figure 29: Dual-Core SMP System with User Space Passthrough
User Space Passthrough with Control Plane Emulation Again, as wasexperimented with the routing test setup, the user space passthrough test is re-peated but this time with the CGI script response time being measured. Lookingat the packet rate, shown in Figure 30, and the bitrate, shown in Figure 31, weconclude that the di�erence in packet processing between the test setup withand without the management plane requests is small with a maximum di�erencepercentage of 1.9%.
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0
20000
40000
60000
80000
100000
120000
140000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000P
erc
en
tag
e d
iffe
ren
ce [
%]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Only User Space Passthrough
With CGI script
Percentage difference
Figure 30: Packet Rate of a Dual-Core SMP User Space Passthrough PacketStream with Management Plane Requests
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0,00
100,00
200,00
300,00
400,00
500,00
600,00
700,00
800,00
900,00
1000,00
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Bit
rate
[M
b/s
]
Payload Size [bytes]
Only User Space Passthrough
With CGI script
Percentage difference
Figure 31: Bitrate of a Dual-Core SMP User Space Passthrough Packet Streamwith Management Plane Requests
The impact of the packet stream on the response time of the managementplane application is even worse than with the routing test as shown in Figure 32.Only for larger payloads is the minimal response time within acceptable range.The reason the response time is limited to 60 seconds is because the test runsare limited to 60 seconds. In other words in these cases the requests timed out.
27
0,00
1000,00
2000,00
3000,00
4000,00
5000,00
6000,00
7000,00
8000,00
9000,00
10000,00
0
10
20
30
40
50
60
70
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Re
spo
nse
Tim
e [
s]
Payload Size [bytes]
Response Time
Baseline
Percentage difference
Figure 32: Response Time of Management Plane on a Dual-Core SMP Systemwith a User Space Passthrough Packet Stream
Figure 33 shows the core usage. Core 0 is fully utilized for all payload sizes.Core 1 on the other hand starts to run more and more idle cycles as the payloadsizes increase.
0
10
20
30
40
50
60
70
80
90
100
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Co
re U
sag
e [
%]
Payload Size [bytes]
Core 1 Usage
Core 0 Usage
Figure 33: Core Usage of a Dual-Core SMP User Space Passthrough PacketStream with Management Plane Requests
As observable in Figure 33 the addition of the management plane applicationincreases the processor usage. Core 1 usage starts to decrease towards 60% once1500 byte payloads are reached. Core 0 however stays at 100% for all packetsizes.
3.4 Dual-Core AMP
Since the goal of these experiments is to build an AMP system, we decided to�rst build a singlecore system where the operating system is running on onecore with all other hardware at its disposal.
To turn the P2020 into a singlecore system we could emulate a system withonly one core by manually setting the processor a�nity of the user space copyscript and bind the interrupts to one speci�c processor core. Emulating a sin-glecore hardware platform can be a useful baseline but is actually not a singlecoresystem and would not contribute to the AMP case. In fact the operating systemwould still be aware of the second core and a number of processes are still spawnto it. Another approach is to actually disable the core hardware.
In Section 3.4.1 a singlecore system is built by disabling the second CPU core.Next, in Section 3.4.2 all available and needed hardware is divided over the twoCPU cores. Next, the Linux kernel con�guration is described in Section 3.4.3.
28
Once the con�guration is �nished the system boot process is elucidated in Sec-tion 3.4.4. All AMP test results are discussed in Section 3.4.5.
3.4.1 Singlecore
Disabling a CPU core can be achieved by removing the core from the �at devicetree (FDT). The device tree can be seen as a database that represents thehardware components of the system [22]. The actual device tree binary (DTB)is compiled from a device tree source (DTS) �le. In case of the P2020 the CPUdescription of the original dual-core system is given in Listing 5.
1 cpus {
2 #address -cells = <1>;
3 #size -cells = <0>;
4 PowerPC ,P2020@0 {
5 device_type = "cpu";
6 reg = <0x0 >;
7 next -level -cache = <&L2 >; };
8 PowerPC ,P2020@1 {
9 device_type = "cpu";
10 reg = <0x1 >;
11 next -level -cache = <&L2 >; }; };
Listing 5: Original P2020 DTS CPU Description for Dual-Core SMP
Disabling the second core by removing it from the DTS �le, as shown inListing 6, results in a singlecore system.
1 cpus {
2 #address -cells = <1>;
3 #size -cells = <0>;
4 PowerPC ,P2020@0 {
5 device_type = "cpu";
6 reg = <0x0 >;
7 next -level -cache = <&L2 >; }; };
Listing 6: Singlecore DTS CPU Description
After compiling the DTS �le into the DTB and uploading it to the board, theLinux operating system will only detect a singlecore as shown in the output ofListing 7, which can be replicated by executing the ls/proc/device− tree/cpus/command.
1 > ls /proc/device -tree/cpus/
2 #address -cells #size -cells PowerPC ,P2020@0 name
Listing 7: Linux Detecting Singlecore CPU
A con�rmation can be found in /proc/cpuinfo which is the standard inter-face to the kernel's information regarding CPUs as shown in Listing 8.
29
1 > cat /proc/cpuinfo
2 processor
3 : 0 cpu
4 : e500v2 clock
5 : 1000.000000 MHz revision
6 : 5.1 (pvr 8021 1051) bogomips
7 : 125.00
8 total bogomips
9 : 125.00 timebase
10 : 62500000 platform
11 : P2020 RDB model
12 : fsl ,P2020 Memory
13 : 512 MB
Listing 8: Linux CPU Information
Once this singlecore system is operational we start following the steps of thenext sections and begin building a working dual-core AMP system.
3.4.2 Dividing the Hardware
It is obvious that the board hardware, which was completely available to theoperating system on an SMP system, has to be divided over the operatingsystens of the new AMP system. This means that one operating system willnot be able to access the hardware assigned to another. There are however afew exceptions like the cache controller and the interrupt controller that areshared between all cores and thus are available to all operating systems. So,before we can start distributing hardware over operating systems, we must havea clear view of the system goals and the tasks that need to be performed byeach operating system.
In case of the P2020 the goal is to isolate the user plane from the control/-management plane and create a setup as depicted in Figure 34. Core 0 andcore 1 each run the same operating system as the one used for the SMP case,but now each core runs its own instance of the operating system and a slightlyrecon�gured kernel. The operating system running on core 0 is responsiblefor transferring the user plane data tra�c between the two Ethernet interfaces,while the second operating system hosts the control/management plane services.
Core 0
OS 0
Kernel Space Routing
User Space Passthrough
Other
processes
Core 1
OS 1
CGI Script Other processes
python TestClient
Packet generator
UD
P s
tre
am
DTE
python TestServer
Packet receiver
Re
qu
est
Re
spo
nse
Management
Requester
User
Plane
Control /
Management Plane
DTE
UD
P s
tream
Figure 34: Dual-Core AMP System
30
Taking the P2020 system as an example, the most important hardware com-ponents that need to be divided are described in Listing 9. The PowerPC, P2020entries found in the device tree �le represent an e500 processor core. ecm andecm− law nodes represent the e500 Coherency Module. The coherency modulemaintains coherency between cache and cacheable memory.The memory− controller@ff702000 node represents the 64-bit DDR2/DDR3SDRAM memory controller with ECC (Error-Correcting Code) support and thel2− cache− controller@ff720000 controls the available Level 2 cache. Theavailable peripherals re�ected by the device tree are serial, gpio− controller,usb, ethernet and sdhci. The serial nodes refer to two available serial in-terfaces while the gpio− controller node refers to the General Purpose In-put/Output controller. The usb node and three ethernet nodes refer respec-tively to the USB interface and three Ethernet interfaces available on the board.The sdhci node represents the SD-card controller. The pic@ff740000 noderefers to the Programmable Interrupt Controller while msi@ff741600 refersto the Message Signaled Interrupt controller. The global− utilities blockcontrols power management, I/O device enabling, power-on-reset con�gurationmonitoring, general-purpose I/O signal con�guration, alternate function selec-tion for multiplexed signals, and clock control.
1 PowerPC ,P2020@0
2 PowerPC ,P2020@1
3 ecm -law@ff700000
4 ecm@ff701000
5 memory -controller@ff702000
6 serial@ff704500
7 serial@ff704600
8 gpio -controller@ff70f000
9 l2-cache -controller@ff720000
10 usb@ff722000
11 ethernet@ff724000
12 ethernet@ff725000
13 ethernet@ff726000
14 sdhci@ff72e000
15 pic@ff740000
16 msi@ff741600
17 global -utilities@ff7e0000
Listing 9: Flattened Device Tree Nodes
Next to the necessary hardware components like a CPU core, memory, Eth-ernet interfaces, ... there are some other hardware devices that need to beassigned to one of the two instances. First both systems need a serial inter-face (serial@ff704500 and serial@ff704600 in Listing 9) necessary for basicsystem setup and debugging. Next, the data plane system needs the SD cardcontroller (sdhci@ff72e000 in Listing 9) because the Python environment ishosted on an SD card.
Once the necessary hardware has been selected for each core, there is stillthe matter of con�guring the interrupt controller. As opposed to an SMP sys-tem, each device tree in an AMP system has to specify the protected interruptsources. This is basically saying to the core that it needs to stay away from
31
that interrupt source and should never try to reset or con�gure that interruptsource, as this will be taken care of by one of the other cores. So core 0 (hostingthe user plane applications) should have the interrupt vectors of core 1 (host-ing the management plane applications) listed as protected sources as shownin Listing 10 and vice versa for core 1 as shown in Listing 11. Remark that ifa hardware component uses multiple interrupt vectors, all of them should bespeci�ed as a protected source in the DTS �le of the other cores.
1 mpic: pic@40000 {
2 protected -sources = <
3 42 /* SERIAL1 */
4 35 36 40 /* ENET1 */
5 68 69 70 /* PTP */
6 03 /* MDIO */ >; };
Listing 10: Core 0 (Data Plane) Protected Interrupt Sources
1 mpic: pic@40000 {
2 protected -sources = <
3 17 /* ECM */
4 18 /* MEM */
5 42 /* SERIAL0 */
6 47 /* GPIO */
7 16 /* L2 CACHE */
8 72 /* SDHCI */
9 29 30 34 /* ENET0 */
10 31 32 33 /* ENET2 */
11 0xe0 0xe1 0xe2 0xe3 /* MSI */
12 0xe4 0xe5 0xe6 0xe7 >; };
Listing 11: Core 1 (Control/Management Plane) Protected Interrupt Sources
Once the DTS �les have been updated they should be compiled using a devicetree compiler (DTC) and put inside the TFTP (Trivial File Transfer Protocol)directory that will later be used to allow the downloading of the kernel images,device tree binaries and the RAM disk images by the uboot instance hosted onthe P2020 platform.
3.4.3 Con�guring the Linux Kernel for AMP
Before the AMP operating systems can be con�gured we must decide how theavailable memory will be divided over the operating systems. Since the P2020system used for this case has a total of 512MB DDR2 RAM, we decided to assigneach operating system half of the total memory (512MB). The booting processof the AMP system will boot core 1 �rst. This will determine the physicallocation of the operating system running on core 1, which is in this case thelower region of the second memory chunk or 0x10000000 (256MB).
The next step consists of con�guring both Linux kernels for AMP use3. Inthis case two copies were made of the SMP kernel source, one for each core.
3This can be done using the menucon�g utility and the following command 'makeCROSS_COMPILE=powerpc-linux-gnu- ARCH=powerpc menucon�g'
32
First of all both kernels should have "Symmetric multi-processing support"4
removed. The image for core 0 is booted last and is placed in the �rst halve ofthe memory region. It therefore requires no further con�guration.
The second core is booted �rst and has its kernel loaded in the second mem-ory region. To allow this, the kernel needs to be con�gured accordingly. Firstof all "Prompt for advanced kernel con�guration options", which can be foundunder the "Advanced setup" category, should be enabled. Next the "physicaladdress where the kernel is loaded" is set to 0x10000000 assuming core 1 willstart from 256MB. The other options are left with their default values.
Once these con�gurations are applied, the kernels are build and placed insidethe TFTP directory.
3.4.4 Booting the AMP System
Next in line is booting the system. This can be accomplished using the bootscript5 given in Listing 12. Basically this script loads all necessary �les viaTFTP into the �ash memory. Next it boots the �rst kernel image on core 1.After releasing core 1, the second kernel image is booted on core 0. A fewseconds later both serial terminals will show a login prompt.
1 tftp 11000000 uImage_1
2 tftp 12000000 rootfs_1.img
3 tftp 10 c00000 core1.dtb
4 tftp 1000000 uImage_0
5 tftp 2000000 rootfs_0.img
6 tftp c00000 core0.dtb
7 setenv bootm_low 0x10000000
8 setenv bootm_size 0x10000000
9 interrupts off
10 bootm start 11000000 12000000 10 c00000
11 bootm loados
12 bootm ramdisk
13 bootm fdt
14 fdt boardsetup
15 fdt chosen $initrd_start $initrd_end
16 bootm prep
17 cpu 1 release $bootm_low - $fdtaddr -
18 setenv bootm_low 0
19 setenv bootm_size 0x10000000
20 bootm 1000000 2000000 c00000
Listing 12: AMP Boot Script
3.4.5 Measurements
Kernel Space Routing The �rst test conducted with the new AMP systemis the kernel space routing test. The operating system running on core 0 (userplane) is instructed to route packets through kernel space from one subnet to
4Which can be found under "Processor support"5This script was compiled using the mkimage tool included with uboot.
33
another. The operating system running on core 1 (management plane) is left inthe state. The results of this test are shown in Figure 35. Again the packet ratecurve has a similar form to the one already encountered with the SMP system(Figure 22). However, this time the system is not able to process all packetssent by the generator when considering smaller packet payloads. This can beobserved when comparing the processed packet rate to the generated packetrate as given in Figure 35. Figure 35 shows that for smaller payloads the coreusage is at 100%.
0
10
20
30
40
50
60
70
80
90
100
0
50000
100000
150000
200000
250000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Co
re U
sag
e [
%]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Packet Rate
Core 0 Usage
Figure 35: Dual-Core AMP Kernel Space Routing
0,00
100,00
200,00
300,00
400,00
500,00
600,00
700,00
800,00
900,00
1000,00
0
50000
100000
150000
200000
250000
300000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Bit
rate
[M
b/s
]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Routing packet rate
Benchmark packet rate
Routing bitrate
Benchmark bitrate
Figure 36: Dual-core AMP Kernel Space Routing versus Benchmark
Control Plane Baseline As was done with the previous systems, the re-sponse time of the management plane application (CGI script), hosted on thesecond core, was measured on an unstressed system. These results, depictedin Figure 37, will be the baseline for further tests. The response time of thescript is 0.75s on an unstressed system, which is the same as in case of the SMPdual-core system.
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
0,95
1
0 5 10 15 20 25 30 35
Re
spo
nse
Tim
e [
s]
Testrun []
Response Time
Figure 37: Dual-Core AMP Management Plane Response Baseline
34
Kernel Space Routing with Control Plane Emulation Next, the UDPpacket stream is processed by core 0 (user plane) while the response time ofthe CGI script, hosted on core 1 (management plane), is measured. Whenanalyzing the packet rate (depicted in Figure 38) and the bitrate (depicted inFigure 39) there is a notable di�erence of about 3.58%. These results seemto provide prove for what is called cross-plane in�uence. The requests madeto the management plane application have an obvious in�uence on the packetprocessing performance of the user plane application.
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0
50000
100000
150000
200000
250000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Only routing
With CGI script
Percentage difference
Figure 38: Packet Rate of a Dual-Core AMP Routed Packet Stream with Man-agement Plane Requests
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0,00
100,00
200,00
300,00
400,00
500,00
600,00
700,00
800,00
900,00
1000,00
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Bit
rate
[M
b/s
]
Payload Size [bytes]
Only routing
With CGI script
Percentage difference
Figure 39: Bitrate of a Dual-Core AMP Routed Packet Stream with Manage-ment Plane Requests
The same behavior is even better pronounced when comparing the baselineresponse time of the CGI script to the response time of the stressed system asshown in Figure 40. This �gure illustrates that the user plane data processinghas a clear impact on the response time of the management plane application.
35
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
0,95
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Re
spo
nse
Tim
e [
s]
Payload Size [bytes]
Response Time
Baseline
Percentage difference
Figure 40: Response Time of Management Plane on a Dual-Core AMP Systemwith a Routed Packet Stream
User Space Passthrough The user space passthrough tests are conductedwith the user plane application running on core 0 as indicated before in Fig-ure 34. While the application copies packets from an ingress interface to anoutgress interface, the other core is left idle.
Using the exact same test con�guration for the user space passthrough asused with the dual-core SMP system did not seem an option here. While theSMP system is able the handle the packet rate generated by Iperf from a cer-tain point and drop packets if not, this is not the case with the AMP system.When for smaller payloads Iperf generates its full potential6, the user spacepassthrough application hogs up all the CPU power and gets almost no packetsacross from the input socket to the output socket as is shown in Figure 41.Even changing bu�er sizes and coalescing parameters did not provide a betterresult. For this reason, the Iperf generator was con�gured to generate a lowerbitrate for smaller payloads. The bitrates were determined to match the highestpossible throughput achievable in case of user space passthrough.
0
10000
20000
30000
40000
50000
60000
70000
80000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Packet Rate
Figure 41: Dual-core AMP System with User Space Passthrough MaximumGenerated Bitrate
The test results, depicted in Figure 42, indicate a much lower maximumpacket rate compared to the SMP case (Figure 29). Instead of 118316 packetsper second, the user plane application is only able to process a maximum of90904 packets per second.
6Remark that the maximum potential of the Iperf generator is not a bitrate of 1 gigabitfor smaller payload sizes as can be seen in Figure 14. Refer to section 3.2 for more details.
36
0
10
20
30
40
50
60
70
80
90
100
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Co
re U
sag
e [
%]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Packet Rate
Core 0 Usage
Figure 42: Dual-Core AMP System with User Space Passthrough
User Space Passthrough with Control Plane Emulation The �nal testfor the dual-core AMP system consists of measuring the response time of themanagement plane running on the second core while copying packets throughuser space using the user space passthrough program running on the �rst core.Again these tests were conducted with Iperf con�gured to generate a lowerbitrate because the system could not handle the maximum bitrate generated bythe DTE.
As expected the test results of the packet rate (shown in Figure 43) andbitrate (shown in Figure 44) indicate that under heavy load the planes arein�uencing each other. The maximum di�erence percentage of the packet ratelies around 10%.
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Pa
cke
t R
ate
[p
ack
ets
/s]
Payload Size [bytes]
Only User Space Passthrough
With CGI script
Percentage difference
Figure 43: Packet Rate of a Dual-Core AMP User Space Passthrough PacketStream with Management Plane Requests
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0,00
100,00
200,00
300,00
400,00
500,00
600,00
700,00
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Bit
rate
[M
b/s
]
Payload Size [bytes]
Only User Space Passthrough
With CGI script
Percentage difference
Figure 44: Bitrate of a Dual-Core AMP User Space Passthrough Packet Streamwith Management Plane Requests
The minimum response time of the management plane CGI script con�rms
37
the presence of this crossplane in�uence e�ect, as is clearly visible in Figure 45.The execution time of the script increases with approximately 10% when theother core is processing packets at high data rates.
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
0,95
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Pe
rce
nta
ge
dif
fere
nce
[%
]
Re
spo
nse
Tim
e [
s]
Payload Size [bytes]
Response Time
Baseline
Percentage difference
Figure 45: Response Time of Management Plane on a Dual-Core AMP Systemwith a User Space Passthrough Packet Stream
4 Comparison
This Section compares the measured throughput results of the SMP con�gura-tion to those of the AMP con�guration.
Section 4.1 gives an overview of all the test results comparing SMP to AMP.Next, Section 4.2 gives an interpretation of the most important test results ofthis report and tries to link the di�erent cases.
4.1 Summarized Measurements
Figure 46 shows the bitrate comparison of the SMP and AMP system in caseof a kernel space routing setup. The results show that the AMP system is notable to keep up with the SMP system in terms of packet processing for smallerpayloads. The graph shows an average bitrate di�erence of 60MB/s. Once thepayload hits 512 bytes both systems are able to process the generated packetstream.
The test results also show the cross-plane in�uence within the AMP con�g-uration. Once the management requests are introduced, the user plane has asmall but noticeable performance drop. Once the 512 bytes marker is reached,the cross-plane in�uence becomes insigni�cantly small.
38
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
150
250
350
450
550
650
750
850
950
64 128 256 512 1024 2048
Dif
fere
nce
Pe
rce
nta
ge
[%
]
Bit
rate
[M
B/s
]
Payload Size [bytes]
Kernel Space Routing
SMP
SMP with Management Requests
AMP
AMP with Management Requests
Difference Percentage SMP and AMP
Figure 46: Achieved Bitrate in a Kernel Space Routing Setup with SMP andAMP
The user space passthrough comparison results of the SMP and AMP con�g-urations are depicted in Figure 47. The AMP system is not able to process thefull packet steam when the data has to be brought to user space. Even whenthe payload size increases, the AMP con�guration cannot process more than627MB/s. Again a cross-plane in�uence exists once the management plane ap-plication is utilized. When comparing the results with these of the kernel spacerouting test (Figure 46), the performance drop of the user plane application ismore irregular.
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0
100
200
300
400
500
600
700
800
900
1000
64 128 256 512 1024 2048
Dif
fere
nce
Pe
rce
nta
ge
[%
]
Bit
rate
[M
B/s
]
Payload Size [bytes]
User Space Passthrough
SMP
SMP with Management Requests
AMP
AMP with Management Requests
Difference SMP and AMP
Figure 47: Achieved Bitrate in a User Space Passthrough Setup with SMP andAMP
At �rst glance it would seem that the cross-plane in�uence is not that ex-
39
plicitly present in case of SMP. However, Figure 48 shows a cross-plane in�u-ence of the user plane on the management plane, especially in the user spacepassthrough setup. In certain cases the requests even timeout and no responseis received. Why this happens around 512 bytes is not exactly clear and makespart of future research.
0,000
10,000
20,000
30,000
40,000
50,000
60,000
70,000
64 128 256 512 1024 2048 4096
Re
spo
nse
Tim
e [
s]
Payload Size [bytes]
SMP Management Plane Response
SMP
SMP with Kernel Space Routing
SMP with User Space Passthrough
Figure 48: Achieved Management Plane Response Time with SMP
The cross-plane in�uence of the user plane on the management plane appli-cation is less harsh in case of AMP as is shown in Figure 49. A small increaseof 50ms is measured in both setups. The graph also shows that the di�erencedecreases with increasing payload size in case of kernel space routing. The re-sponse time di�erence is more irregular in case of the user space passthroughsetup.
40
0,500
0,600
0,700
0,800
0,900
1,000
1,100
1,200
1,300
1,400
1,500
64 128 256 512 1024 2048 4096
Re
spo
nse
Tim
e [
s]
Payload Size [bytes]
AMP Management Plane Response
AMP
AMP with Kernel Space Routing
AMP with User Space Passthrough
Figure 49: Achieved Management Plane Response Time with AMP
4.2 Interpretation of Results
Following sections discuss the results of the SMP (Section 4.2.1) and AMP(Section 4.2.2) tests.
4.2.1 SMP
Introducing the management plane application requests into an SMP kernelspace routing setup creates cross-plane in�uence e�ects as depicted in Figure 25.The data processing of the user plane application su�ers slightly under theincreased system load. This impact on the user plane can be explained by thefact that the Ethernet receive interrupts are assigned to core 0, which is almostat 100% usage. Core 1 however has a maximal load of 50%. This is due tothe fact that the operating system schedules the management plane applicationto a core based on the current system load and might schedule the process oncore 0 in one request while assigning it to core 1 in another. It does howeverseem that there is a predilection towards core 0. The performance of the systemmight be improved by shielding the CGI script from core 0, so core 0 can befully utilized by the Ethernet receiving process. This task can be accomplishedwith a python tool called cset which allows processes to be shielded from certainCPUs or cores. This tool was not tested in context of this case study. This mightactually also improve the response time of the management plane application(Figure 27) since it does not have to compete as much with interrupt requests.
Replacing the kernel space routing application with the user space passthroughapplication lowers the maximum throughput of the user plane signi�cantly (Fig-ure 22 versus Figure 29). It is however remarkable that the packet rate of theuser space throughput application is limited while the load on core 0 (whichhandles Ethernet receive interrupts) is below 100%. This may however be dueto Linux NAPI which supports early packet dropping when the system is unableto handle more packets. As the payload increases, more packets are making itto the user space passthrough process and more CPU processing power is used
41
for the receival process (core 0). However, this is an assumption made at thispoint which is not be veri�ed.
When management plane application requests are made while processingdata using the user space passthrough application, the cross-plane in�uence iseven more pronounced than in the case of kernel space routing. This cross-planein�uence is especially visible when analyzing the response times of the manage-ment plane application. Figure 32 shows the core usage and also explains whythe response times are a lot higher for smaller packet sizes. Mainly because bothcores are fully utilized and a contention struggle for CPU processing power existsbetween the user space passthrough application, the CGI script and the Eth-ernet packet receiving/transmitting interrupt processes. Apparently, the CGIscript succumbs to the user space passthrough application, which was scheduledas a real-time process using FIFO with a raised priority. This is the reason whythe packet rate does not drop signi�cantly in this test setup.
4.2.2 AMP
When analyzing the AMP kernel space routing throughput (Figure 36) it isobvious that the AMP user plane is not able to forward all packets deliveredby the generator. Clearly the bottleneck here for smaller packet sizes is theavailable processor power as can be seen in Figure 35.
Less expected is the obvious cross-plane in�uence that is still existing be-tween both planes as indicated in Figure 38 and Figure 40. This is explained bythe fact that while both operating systems are fully separated, some hardwarecomponents are still shared. Examples of such components are L2 cache andmemory, which are shared between both planes and can have an impact on thesystem when one of the AMP cores is utilizing a component intensively.
When comparing Figure 38 and Figure 40 we conclude that the impact ofthe user plane on the management plane is more pronounced than the otherway around. Moreover, the di�erence percentage of the user plane applicationthroughput seems to drop to nearly 0% once the payload data gets bigger. Themanagement plane response is fairly constant and is approximately 5% higherfor all payload sizes.
The user space passthrough application has a much lower throughput (Fig-ure 42) than the corresponding SMP passthrough application (Figure 29) (90904instead of 118316). We deduct from the core usage graph (Figure 42) that theCPU processing power is the limiting factor in this case. The singlecore doesnot only have to handle Ethernet transmit and receive interrupts, it also has toserve the user space passthrough application.
5 Future Work
We were able to con�rm a cross-plane in�uence in the SMP as well as theAMP con�guration. In case of the SMP setup this was expected. The exactcause of this in�uence in case of AMP could not be determined and shouldde�nitely be investigated further. Several options are available but a hardwaredebugger/JTAG could help to pinpoint the exact cause.
Iperf was not able to report correct and accurate delay and jitter measure-ments. These parameters are very important in telecommunication platforms.
42
Therefore these should be measured using specialized hardware capable of gen-erating a gigabit UDP stream with high accuracy timestamping.
The user plane was hosted on a singlecore in this case. Higher user planeperformance and throughput can be achieved by using a processor with morethan two cores and hosting multiple user plane operating systems. This woulde�ectively create a load-balanced user plane.
In many test results 564 byte payloads is a turning point at which the systemstarts to perform better or cross-plane in�uence disappears. Why this happensexactly at this point could not be determined and remains future work.
The tests conducted in this paper primarily focused on performance andcross-plane in�uence. Another aspect that is important in a telecommunicationsystem is the ability of the control plane to monitor and con�gure the userplane. This would require an ICC (Inter-Core Communication) API for theAMP con�guration to enable communication between both planes.
6 Conclusion
SMP su�ers from cross-plane in�uence. The e�ects are more manageable whencon�guring the system for maximum user plane performance. This is achievedby con�guring Ethernet interrupt a�nities, process scheduling classes and pro-cess priorities. This way, the throughput of the user plane is stabilized, but atthe cost of the management plane.
Bringing data from kernel space to user space has impact on the performanceof the user plane applications and should be taken in consideration when build-ing a packet processing application. When Linux assigns interrupts to processorcores, interrupts are divided based on the order in which the Ethernet interfacescome online. Manual interrupt con�guration provides better performance andalso creates a deterministic system setup.
Even AMP su�ers from cross-plane in�uence, however to a lesser degree thanSMP. The exact cause of this e�ect has not yet been identi�ed. However, thiswill most likely be due to one of the shared hardware components such as L2cache or memory.
Acknowledgment
The authors wish to acknowledge Ivan De Baere, Elie De Brauwer, Yves De-weerdt and Newtec for providing support and the P2020RDB based platform.
43
References
[1] G. E. Moore, �Cramming more components onto integrated circuits,� Elec-tronics Magazine, vol. 38, no. 8, p. 4, 1965.
[2] G. M. Amdahl, �Validity of the single processor approach to achievinglarge scale computing capabilities,� in AFIPS '67 (Spring) Proceedings ofthe April 18-20, 1967, spring joint computer conference, 1967.
[3] �Multicore Communications API ( MCAPI ) Speci�cation,� 2011.
[4] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns: El-ements of Reusable Object-Oriented Software, K. Habib, Ed. Addison-Wesley, 1995.
[5] K. Keutzer and T. G. Mattson. (2009) Our Pattern Language (OPL): ADesign Pattern Language for Engineering (Parallel) Software.
[6] T. G. Mattson, B. A. Sanders, and B. Massingill, Patterns for parallelprogramming. Addison-Wesley, 2005.
[7] R. Vincke, S. V. Landschoot, E. Steegmans, and J. Boydens, �Refactor-ing Sequential Embedded Software for Concurrent Execution Using DesignPatterns,� in Annual Journal of Electronics, 2012.
[8] A. Mackay, �Three design models for multicore systems,� EETimes-India,no. February, p. 2, 2008.
[9] W. Wu and M. Crawford, �The Performance Analysis of Linux Networking- Packet Receiving,� in Computer Communications, vol. 30, no. 5, 2007, p.1044.
[10] R. Love, Linux Kernel Development (3rd Edition). Addison-Wesley Pro-fessional, 2010.
[11] J. Chase, A. Gallatin, and K. Yocum, �End system optimizations for high-speed TCP,� in IEEE Communications Magazine, vol. 39, no. 4, Apr. 2001,pp. 68�74.
[12] Intel, �Intel I/O Controller Hub 8/9/10 and 82566/82567/82562V SoftwareDeveloper's Manual,� p. 370, 2009.
[13] S. Bhattacharya and V. Apte, �A Measurement Study of the Linux TCP/IPStack Performance and Scalability on SMP systems,� pp. 1�10, 2006.
[14] A. Tiruminal, F. Qin, J. Dugan, J. Furgeson, and K. Gibbs, �Iperf: TheTCP/UDP bandwidth measurement tool,� 2005.
[15] N. Vegh, �NTools tra�c generator/analyzer and network emulator pack-age,� n.d.
[16] G. van Rossum and F. Drake, �Python Reference Manual,� PythonLabs,Virginia, USA, Tech. Rep., 2001.
[17] A. K. Chimata, �Path of a Packet in the Linux Kernel Stack,� Tech. Rep.,2005.
44
[18] G. Herrin, �Linux IP Networking,� Tech. Rep., 2000.
[19] S. Karlin, L. Peterson, S. F. D. Field, M. Max, and P. Sizes, �MaximumPacket Rates for Full-Duplex Ethernet,� Tech. Rep., 2002.
[20] A. Sandler, �SMP a�nity and proper interrupt handling in Linux,� 2008.[Online]. Available: http://www.alexonlinux.com/
[21] A. Grover and C. Leech, �Accelerating Network Receive Processing,� inLinux Symposium, 2005, p. 281.
[22] C. Hallinan, Embedded Linux Primer: A Practical Real-World Approach(Prentice Hall Open Source Software Development). Prentice Hall, 2010.
45