dynamic hardware monitors for network processor protection · processor monitoring system that...

1

Dynamic Hardware Monitors for NetworkProcessor Protection

Kekai Hu, Harikrishnan Kumarapillai Chandrikakutty, Zachary Goodman,Russell Tessier, Senior Member, IEEE , and Tilman Wolf, Senior Member, IEEE

F

Abstract—The importance of the Internet for society is increasing.To ensure a functional Internet, its routers need to operate correctly.However, the need for router flexibility has led to the use of software-programmable network processors in routers, which exposes thesesystems to data plane attacks. Recently, hardware monitors have beenintroduced into network processors to verify the expected behavior ofprocessor cores at run time. If instruction-level execution deviates fromthe expected sequence, an attack is identified, triggering processor corerecovery efforts. In this manuscript, we describe a scalable networkprocessor monitoring system that supports the reallocation of hardwaremonitors to processor cores in response to workload changes. The scal-ability of our monitoring architecture is demonstrated using theoreticalmodels, simulation, and router system-level experiments implementedon an FPGA-based hardware platform. For a system with four processorcores and six monitors, the monitors result in a 6% logic and 38%memory bit overhead versus the processor’s core logic and instructionstorage. No slowdown of system throughput due to monitoring is re-ported.

Index Terms—network security, network infrastructure, data plane at-tack, hardware monitor, multicore processor, FPGA

1 INTRODUCTION

Network routers are core components of the Internetinfrastructure. Routers implement the packet processingand forwarding operations that need to be performed onevery packet that is transmitted through the network.Typical processing tasks on routers include securitychecks, data filtering, and traffic statistics collection.These functions augment basic forwarding as requiredby the Internet Protocol (IP) [1]. Since network func-tions may be introduced dynamically, routers require theprogrammability offered by network processors (NPs)[2] as opposed to fixed-function application-specific inte-grated circuits (ASICs). Network processors often featuremultiple software-programmable processor cores, whichoperate in parallel to achieve high throughput rates. Asa result, NPs are the computing device of choice in thelarge fraction of contemporary network routers.

• K. Hu, Z. Goodman, R. Tessier and T. Wolf are with the Departmentof Electrical and Computer Engineering, University of Massachusetts,Amherst, MA, 01003 USA e-mail: ([email protected]).

• H. Chandrikakutty is with Juniper Networks, Chelmsford, MA, 01824USA

The programmability offered by network processors,while providing system-level adaptability, also exposespotential security weaknesses. Similar to network end-systems, such as general-purpose desktop and servercomputers, software-based network processors are vul-nerable to remote attacks. These attacks can cause routersto exhibit unpredictable and malicious behavior. Re-cently, it has been shown that the functionality of anetwork processor could be modified by processing asingle User Datagram Protocol (UDP) packet [3] in thedata plane. In this attack, the network processor is repro-grammed to indefinitely retransmit the malicious packetto downstream routers. This self-propagating attack canbe particularly difficult to control since it only requiresdata plane access (i.e., no access to the control planeis needed by the attacker). Thus, rapid identification ofthese malicious packet processing operations must takeplace on the network processor to prevent the attackfrom propagating and disabling the network.

Since network routers are embedded systems, theirdefense mechanisms are necessarily limited in extent.Network processor cores typically do not execute oper-ating systems, thus anti-malware software is not suitablein this context. Additionally, network intrusion detectionsystems (e.g., Snort [4] or Bro [5]) are often only posi-tioned on the ingress side of campus networks and thusdo not protect the Internet core. In response to this need,hardware monitors for network processor cores havebeen introduced to provide runtime execution protection[6]. A hardware monitor operates in concert with anembedded network processor core to assess runtimebehavior. If anomalous or unexpected operation is ob-served, the core can be reinitialized to avoid processingmalicious code while continuing to process subsequentbenign packets.

Most previous hardware monitor designs have fo-cused on processors with a single processor core ex-ecuting a single program or a program that changesvery infrequently. This model is not effective for networkprocessors, which contain core counts in the hundredsand exhibit frequent changes in packet processing ob-jectives with changing traffic patterns [7]. As a result,new monitoring techniques for embedded multicoresare needed to ensure adequate system protection. This

2

manuscript presents the architecture and in-circuit hard-ware evaluation of a Scalable Hardware Monitoring Grid(SHMG) to address multicore monitoring. A lightweightinterconnection network between processor cores andmonitors is dynamically configured to form monitoringconnections in response to packet processing needs. Insome cases multiple processor cores can share a singlemonitor, reducing memory overhead. Results developedusing both analysis and simulation indicate that ourmonitoring approach is scalable for network processorscontaining hundreds of processing cores.

The specific contributions of our work are:• The design of a scalable architecture for hardware

monitors that can be used in a practical networkprocessor system with a large number of processorcores.

• An algorithm that can dynamically allocate moni-tors to processor cores as application packet work-loads change.

• A simulation and analysis of performance of theproposed design at runtime that considers the ef-fects of dynamically assigning processors to moni-tors and the resulting resource contention.

• A prototype system implementation of a hardwaremonitoring system on an field-programmable gatearray (FPGA) platform that illustrates the feasibil-ity of our design and provides detailed resourcerequirement numbers. A system which includes afour-core network processor with six monitors hasbeen deployed and tested.

Our results indicate that our Scalable Hardware Mon-itoring Grid and associated allocation algorithm pro-vide a low-overhead and scalable solution for networkprocessor protection against data plane attacks, thussecuring Internet infrastructure.

This remainder of the manuscript is organized asfollows. Section 2 discusses related work. We describedata plane attacks and potential defense mechanisms indetail in Section 3. Section 4 introduces the design of ourScalable Hardware Monitoring Grid, which is evaluatedin Section 5. The details of our monitor resource alloca-tion algorithm are described in Section 6. Results froma prototype implementation are presented in Section 7.Section 8 concludes this paper and offers directions forfuture work.

2 RELATED WORK

Network processors are used in routers to implementstandard IP forwarding functions as well as advancedfunctions related to performance, network management,flow-based operations, etc. [2]. Network processors useon the order of tens to low hundreds of parallel cores ina single multi-processor system-on-chip (MPSoC) config-uration. Example devices include Cisco QuantumFlow[8], Cavium Octeon [9], and EZchip NP-5 [10] with datarates in the low hundreds of Gigabits per second.

Attacks on networking devices have been described in[11], but that work explored vulnerabilities in the controlplane, where attacks aim to hack into the control interfaceof a router (e.g., IOS [12]). In more recent work, Chasakiand Wolf have described attacks on network processorsthrough the data plane [3], where attackers merely needto send malformed data packets. In our work, we focuson the latter type of attack.

Since the processor cores of routers are very simple,there are not sufficient resources to run complex intru-sion detection or anti-malware software. These resourceconstraints are similar to what has been encountered inthe embedded system domain. Embedded systems (ofwhich network processors are one class) exhibit a rangeof vulnerabilities [13], [14].

One defense technique for systems, where softwaredefenses are not practical, is hardware monitoring. Ahardware monitor operates in parallel with a processorcore and verifies that the core operates within certainconstraints (e.g., not accessing certain memory loca-tions, executing certain sequences of instructions, etc.).Hardware monitoring has been studied extensively forembedded systems [15]–[17] and has also been pro-posed for use in network processors [6]. In our recentwork, we describe a high-performance implementationof such a hardware monitoring system that can meetthe throughput demands of a network processor with asingle processing core [18].

What has been missing in the space of hardwaremonitoring for network processors is a system-leveldesign of a comprehensive monitoring solution that cansupport a large number of processor cores and can adaptto quickly changing workloads. Since network processorshave many processor cores, it is not practical to equipevery core with a monitor that can handle any typeof processing since this would lead to prohibitivelyexpensive monitors. Instead, monitors need to be limitedto handle one or a few processing tasks and adapt asthe workload changes. Since network processors mayexperience highly dynamic workload changes based onchanging traffic patterns [7], effective solutions for suchan environment need to be developed.

This paper substantially extends our previous confer-ence publication on network processor multicore mon-itoring [19]. We examine the system-level operation ofmonitoring in greater detail using a cycle-accurate mul-ticore simulator and additional benchmark applications.This infrastructure is used to evaluate a new workloadallocation algorithm that has been developed to dy-namically assign applications and monitors to processorcores. Results obtained via simulation match values de-termined using an analytical model and using hardwareimplemented in an FPGA-based board in the laboratory.Additionally, in this extended version we consider boththe power and energy consumption of monitoring inaddition to area. Direct comparisons are made for theseparameters between processor cores, monitors, and inter-faces. Finally, we carefully quantify the time needed to

3

Internet

routerrouter

router

router

end-

system

router input port

network processor

off

-ch

ip m

em

ory

processor

core

processor

core

processor

core

processor

core

I/O interface

interconnect

control

processor

me

mo

ry in

terf

ace

me

mo

ry in

terf

ace

memory

memory

memory

off

-ch

ip m

em

ory

network interface switch fabric interface

packet

end-

system

end-

system

router

vulnerable

packet

processing

code

malformed

data packet

attack on

packet

processor

Fig. 1: Attack on a network processor.

load new monitoring information into a monitor versusthe time needed to connect a processor to a previously-loaded monitor via programmable interconnect.

3 DATA PLANE ATTACKS AND DEFENSES

To provide the necessary context for our Scalable Hard-ware Monitoring Grid, we briefly describe how networkprocessors can be attacked in the data plane and howhardware monitors can be used to defend against theseattacks.

3.1 Vulnerabilities in Networking InfrastructureThe typical system architecture and operation of a net-work processor is illustrated in Figure 1. Network pro-cessors are located at router ports, where they processtraffic that is traversing the network.

Due to the very high data rates at the edge andthe core of the network, network processors typicallyneed to achieve throughput rates in the order of tens tohundreds of Gigabits per second. To provide the neces-sary processing performance, network processors are im-plemented as multi-processor systems-on-chip (MPSoC)with tens to hundreds of parallel processor cores. Eachprocessor has access to local and shared memory andis connected through a global interconnect. Dependingon the software configuration of the system, packets aredispatched to a single processor core for processing (run-to-completion processing) or passed between processorcores for different processing steps (pipelined process-ing). An on-chip control processor performs runtimemanagement of processor core operation.

In order to fit such a large number of processor coresonto a single chip, each processor core can only usea small amount of chip real estate. Therefore, network

processor cores are typically implemented as very simplereduced instruction set computer (RISC) cores with onlya few kilobytes of instruction and data memory. Thesecores support a small number of hardware threads, butare not capable of running an operating system. There-fore, conventional software defenses used for worksta-tion and server processors cannot be employed. Never-theless, these cores are general-purpose processors andcan be attacked just like more advanced processors onend-systems.

An attack scenario for network processors is illustratedin Figure 1. The premise for this attack is that theprocessing code on the network processor exhibits avulnerability. It was shown in prior work that sucha vulnerability can be introduced due to an uncaughtinteger overflow in an otherwise benign and fully func-tional packet processing function [3]. If a vulnerability inpacket processing code is matched with a suitable attackpacket (e.g., a malformed UDP packet), then an attack ona processor core can be launched. In the case of [3], theattack packet smashed the processor stack and led to theexecution of code that was carried in the packet payload.The processor ended up re-transmitting the attack packetat full data rate on an outgoing link without recoveringuntil the network processor was reset.

Launching a denial-of-service attack, such as in [3],can be done by using a single packet and can havemore impact than conventional botnets, which are morecomplex to coordinate and are constrained by the accesslink bandwidth of the bots [20]. In addition, attacks onnetwork processors have been shown both on systemsthat are based on von Neumann architecture [3], lead-ing to arbitrary code execution, and on systems basedon Harvard architecture [18], leading to return-to-libcattacks.

3.2 Defense Mechanisms With Hardware Monitoring

Solutions to protect network processors from attacks onvulnerable processing code are constrained by the lim-ited resources available on these systems. One promisingapproach is to use hardware monitors, which have beensuccessfully used in resource-constrained embedded sys-tems [15]–[17].

The operation of a hardware monitor is illustrated inFigure 2. The key idea is that the processing core reportswhat it is doing as a monitoring stream to the monitor.The monitor compares the operations of the processorcore with what it thinks the core should be doing. If adiscrepancy is detected, the recovery system is activatedto reset the processor core. In order to inform the monitorof what processing steps are valid, the processing binaryis analyzed offline to extract the “monitoring graph” thatcontains all possible valid program execution sequences.

The granularity of monitoring can range from basicblocks [15] to individual processor instructions [17]. Thedetection times are as low as a single processor cycle, andthe recovery times are in the order of tens of processor

4

core

monitor

core

monitor

core

monitor

core

monitor...

...

Fig. 3: One-to-oneconfiguration.

core core core

monitor monitor monitor monitor

...

...

Fig. 4: Full interconnectconfiguration.

core core

monitor monitor monitor

...

core core

monitor monitor monitor

...

...

...

...

Fig. 5: Cluster configuration.

hardware

monitor memory

comparison

logic

monitoring

information

multicore network processor monitoring subsystem

off-line

run-time

off-line program

analysis

call

stack

attack detect.

protocol

processing

binarymonitoring

graph

packet I/O

network

processor

core

reco

ve

ry

mo

du

le

reset

Fig. 2: Hardware monitor for a single processor core.

cycles. Since the monitor does not need to implement afull processor data path and the monitoring informationcan be compressed through hashing, the overall size of atypical monitor and its memory is about 10–20% of thatof a processor core.

Hardware monitors for single core network processorsystems have been demonstrated in prior work [3], [18].These solutions, however, do not address three criticalproblems that appear in practical network processorsystems:

• Multiple cores: Practical network processors usemultiple processor cores in parallel, and all of thesecores need to be protected by hardware monitors.

• Multiple processing binaries: Network processorsneed to perform different packet processing func-tions on different types of network traffic. These dif-ferent operations are represented by different pro-cessing binaries on the network processing system.Thus, different cores may need to execute differentbinaries and need to be monitored by hardwaremonitors that match these binaries.

• Dynamically changing workload: Due to changesin network traffic during runtime, the workload ofprocessor cores may change dynamically [7]. Thus,

hardware monitors need to adapt to the changingprocessing binaries during runtime.

We present the design and prototype of a hardwaremonitoring system that can accommodate these require-ments.

4 SCALABLE HARDWARE MONITORING GRID

4.1 Design ChallengesThe development of a scalable monitoring system formulticore network processors has several challenges. Theuse of monitoring should not impact the throughput orlatency of the network processor. For monitors that trackindividual instructions, each per-instruction monitoringoperation must be completed in real time (i.e., duringthe execution of the instruction), so that deviations fromexpected program behavior are identified immediately.Additionally, the amount of hardware resources used formonitoring should be limited to the minimum necessaryto reduce chip area and power consumption. Since net-work processor programs may change frequently, it mustbe possible to modify monitoring tasks for each NP coreto accommodate changing workloads.

These challenges necessitate the design of a cus-tomized solution for multicore monitoring. Perhaps themost straightforward monitoring approach would besimply to attach a dedicated monitor to each individualNP core, following previous approaches to single-coremonitoring, as shown in Figure 3. Although this ap-proach minimizes the amount of interconnect hardwareneeded to connect an NP core to a monitor, it suffersfrom the need to reload monitoring information eachtime the attached NP core’s program is changed. Alter-natively, allowing an NP core to dynamically access anymonitor among a pool of monitors as shown in Figure 4,while flexible, is expensive and incurs a high processor-to-monitor communication cost. In the next section, wedescribe a scalable monitoring grid system that balancesthese two concerns of area and performance overheadby using the clustered approach illustrated in Figure 5.

4.2 Architecture of Scalable Hardware MonitoringGridOur model of the multicore NP system including moni-toring is shown in Figure 6. The architecture includes acontrol processor that coordinates overall NP operation

5

Proc Proc Proc... Proc Proc Proc

...

Proc Proc Proc...

n Processors

Inter-coreInterconnect

External Memory

Crossbar Crossbar Crossbar

Mon Mon Mon...

m Monitors

Mon Mon Mon... Mon Mon Mon...

... ...

CentralizedMonitorMemory

Control Processor

...

External Interface

ControlSignals

NP Core

32

Hash

MonitorSelect

3

4Hash_1

E/R fromMonitor_1 to 6

Error/ResetNP Core

32

Hash

3

4Hash_2

NP Core

32

Hash

3

4

. . .

Hash_4

Interconnection

Monitor_1

Hash 4

FromHash_1 to 4

Error/Reset

1

Monitor_2 Monitor_3 Monitor_4 Monitor_5 Monitor_6

Proc Select 2

Monitor Table I

Monitor Table II

Table Output I

Table Output II

TableSelect

TableSelect

1 1

kk k+1k+1

Position of Matching host

Position of Matching host

Network Interface

TPM

Fig. 6: Overview of Scalable Hardware Monitoring Grid with network processor cores organized into clusters. Notethat the local processor and control processor memory shown in Figure 1 have been omitted for clarity.

by assigning arriving packets to individual NP cores.Each core executes a program using instructions fromits local memory. External memory, which can be usedto buffer packets and instructions for currently unusedprograms, is located off-chip. An on-chip interconnect isused to connect cores to external memory and outsideinterfaces. In this architecture, processors are groupedinto clusters of n processors. Any of the processors in acluster can be connected to any of m monitors.

The management of loading application-specific mon-itoring graphs into monitors and configuring specificprocessor-to-monitor connections is performed by thesame control processor used to assign packets to NPcores. Graph loading is performed in conjunction withsecurity key management hardware (e.g., a trusted plat-form module, TPM). Copies of monitoring graphs forprograms that are currently being executed or are likelyto be executed in the near future are stored on-chip in acentralized monitor memory (CMM).

To ensure that monitoring graph information can beinstalled securely in the CMM, we use the following pro-cess to ensure authenticity, integrity and confidentiality:

• Authenticity and integrity: We use asymmetric cryp-tography to generate digital signatures that enablethe TPM to verify authenticity and integrity of amonitoring graph. Public keys are installed usingcertificates that establish a chain of trust from therouter manufacturer to the network operator to theTPM on the router system. Thus, an attacker cannotinstall a modified monitoring graph without havingaccess to the private keys of the network operator.

• Confidentiality: We use symmetric cryptography tohide the monitoring graph information from anattacker when the monitoring graph is installedremotely through the network. The symmetric keyis provided by the network operator securely tothe TPM by encrypting it asymmetrically with theTPM’s public key. Using this approach, an attacker

[…]

49c: 97c20010 lhu v0,16(s8)

4a0: 00000000 nop

4a4: 2c420033 sltiu v0,v0,51

4a8: 1440000a bnez v0,4d4

4ac: 00000000 nop

4b0: 3c026666 lui v0,0x6666

4b4: 34430191 ori v1,v0,0x191

4b8: 97c20010 lhu v0,16(s8)

[…]

49c

4a0

4a4

4a8

4ac

4b0

4b4

4b8

0

7

11

10

10

7

3

6

4d4

Fig. 7: State machine for hardware monitor generatedfrom processing binary.

cannot obtain any information about the monitoringgraph.

The details of the installation and verification process,including a detailed security analysis, are presented in[21].

The amount of time needed to load a monitor with agraph from the centralized monitor memory is signifi-cant enough that reloading should be minimized. It isdesirable to have a program monitor used by differentcores at different times during packet processing, neces-sitating a flexible interconnection between NP cores andmonitors. In cases where m > n, a total of m−n monitorsare unused at a given point in time, although they canbe activated by the control processor, if needed.

4.3 Multi-Ported Hardware Monitor DesignTo support scalability, we have optimized the structureof single-processor monitors, which are capable of track-ing NP core execution on an instruction-by-instructionbasis. The monitoring graph for an application is gener-ated from the application binary (Figure 2) at compile-time by performing an analysis of execution flow andfeasible branch targets. Each instruction in the programcan be represented as a state in a finite state machine(Figure 7). A transition between states is representedas the hash of the instruction binary. Control flow in-structions (e.g., branch, jump) have multiple next statesrepresenting different flows of control.

6

One-hot

Encoding

Hash

Comparison

4-bit Hash

Function

16

4

32

1

Processor

Instruction

Reset/Recover

Next

State

Select

4

Graph

Select

K-1

K

1

Next State

Valid Hashes

One-hot

Encoding

Hash

Comparison

4-bit Hash

Function

4

16

Next

State

Select

Graph

Select

1

K

4

14

Next State

Valid Hashes 14

1

Reset/Recover

Processor

Instruction

32

Monitor 1

Monitor 2

Monitoring Graph 2

Monitoring Graph 1

Fig. 8: Two monitors sharing a single dual-ported mem-ory block.

The monitoring graph includes all feasible next statesfor the current state and the expected hash values fortransitions from the current state. Hash values are usedrather than the instruction values themselves to savememory space for the monitoring graph. Our approachhandles hash collisions (e.g., multiple edges from astate with the same hash value) by converting the non-deterministic graph (NFA) generated directly from thebinary to a deterministic finite automata (DFA). Thisprocess removes the effects of the hash collisions byadding states. Our approach is effective for all programswhere the expected execution flow, including branches,can be determined either statically or via profiling. Ourbenchmark analysis shows that this is the typical casefor NP applications. Full details of graph generation,including a detailed example, are shown in [18].

The architecture of two monitors that perform thistype of instruction-by-instruction monitoring is shownin Figure 8. The monitoring graph, which is stored ina memory block, includes one entry for each state inthe execution state diagram. A k-bit pointer indicatesthe entry in the graph that corresponds to the currentlyexecuted instruction. As an instruction is executed, afour-bit hash value of the instruction is generated, whichis then converted to a one-hot encoding. This encoding isthen compared against the expected hash values that arestored in the graph entry (valid hashes) for the instruction.The next entry (memory row) in the monitoring graphis determined using next state information stored in thecurrent entry and the matched hash value. The imple-mented monitor requires only one memory lookup perinstruction, limiting the time overhead of monitoring.

Although separate hash comparison and next stateselect information is needed for each monitor, multiplemonitoring graphs can be packed into the same memoryblock if the block is multi-ported (Figure 8). In theexample, the monitoring graph for the monitor on theleft is located in the top half of the memory block whilethe graph for the monitor on the right is located in thebottom half. For each monitor, the selection of whichmonitoring graph (top or bottom) is used by the monitoris set by a single graph select bit which forms the top

NP Core

32

Hash

Monitor

Select

3

4Hash_1

R/R from

Monitor_1 to 6

Reset/Recover

NP Core

Hash

Hash_2

NP Core

Hash

Hash_4

Monitor_1

Hash 4

From

Hash_1 to 4Reset/

Recover

1

Monitor_2 Monitor_3 Monitor_4 Monitor_5 Monitor_6

Proc

Select2

NP Core

Hash

Hash_3

1

4

Fig. 9: Flexible interconnection between processors andmonitors in a cluster.

address bit into the block memory. A benefit of thisshared memory block approach is the possibility of bothmonitors accessing the same monitoring graph at thesame time without having to reload monitor memory(e.g. both associated NPs execute the same program andrequire the same monitor). In this case, the second graphin the memory block would be unused.

4.4 Scalable Processor-to-Monitor InterconnectionThe detailed interconnection network between a clusterof n processors and m monitors is shown in Figure 9. Inthis architecture, any processor can be connected to anymonitor via a series of n-to-1 (processor-to-monitor) andm-to-1 (monitor-to-processor) multiplexers. The four-bithash values shown in Figure 8 are generated frominstructions close to the processor, reducing processor-to-monitor interconnect. One of n four-bit values fromthe processors is selected for a specific monitor usingmultiplexer dlog ne select bits. During monitoring, amonitor generates a single reset/recover bit, which isreturned to the monitored processor to indicate if anattack has occurred. In our implementation, this signalis sent to the target processor via a multiplexer withm single-bit inputs. The monitor and processor select bitsare generated by the control processor and sent to theappropriate multiplexers via decoders.

5 ANALYSIS OF SHMG ADAPTATION

Although the SHMG runtime adaptation can adjust theprocessing resource distribution at runtime to maximizethe system throughput, due to variations in workloadthere may be a situation where more processors need toexecute a particular program than monitors are available.In this case, some processors temporarily block (until amonitor becomes available, at which point they continueprocessing). We provide a brief analytical analysis ofthe blocking probability of the system and the resultingthroughput for different cluster configurations.

5.1 Monitor ConfigurationIn the n processors, m monitors SHMG system wedefined in Section 4, for each program i (1 ≤ i ≤ p), we

7

assume that ti represents the average processing timeand qi represents the proportion of traffic that requiresthis program. We assume

∑pi=1 qi = 1, which implies

that each packet is processed only by one program.(The analysis can be extended to consider more complexworkload configurations.) The total amount of “work,”wi, that the network processor needs to do for eachprogram i is the product of the traffic share and theprocessing time:

wi = qi · ti. (1)

In order to make the assignment of monitors to pro-grams match the operation of the network processors,we need to determine how many of the n processors areexecuting program i at any given time. We assume thatprocessors randomly draw from available packets (andthus the associated programs) when they are available.Thus, the probability of a processor being busy withprocessing program i, bi, is proportional to the amountof work, wi, that is incurred by the program (see Equa-tion 1):

bi =n · wi∑pj=1 wj

. (2)

That is, more processors are busy with program i ifprogram i is either used by more traffic or has a longeraverage processing time.

Monitors should be configured to match the propor-tions of bi for each program. The fraction of monitors,ai, that should be assigned to monitor program i is

ai = max(mn· bi, 1

). (3)

Since each program needs to have at least one monitorassigned to it, the lower bound for ai is 1.

In practice, the number of monitors per program needsto be an integer. We denote the integer allocation ofmonitors with Ai. One way to translate from ai to Ai

is to use a max-min fair allocation process.

5.2 Blocking Probability and Throughput

Given a monitoring system where Ai monitors are al-located to program i, we need to figure out what theprobability is that the number of processors executingprogram i exceeds Ai (leading to blocking). The numberof processors executing program i, Bi, is given by abinomial probability distribution

Pr(Bi = k) =

(n

k

)(bin

)k (1− bi

n

)n−k

. (4)

The expected number of processors, Ri, that are blockedbecause of program i not having enough assigned mon-itors is

Ri =

n∑j=Ai+1

(j −Ai)Pr(Bi = j). (5)

0.5

0.6

0.7

0.8

0.9

1

1 1.2 1.4 1.6 1.8 2

thro

ughp

ut

monitor overprovisioning (m/n)

n=2n=4n=8

n=16n=32

0.5

0.6

0.7

0.8

0.9

1

1 1.2 1.4 1.6 1.8 2

thro

ughp

ut


0.5

0.6

0.7

0.8

0.9

1

1 1.2 1.4 1.6 1.8 2

thro

ughp

ut


0.5

0.6

0.7

0.8

0.9

1

1 1.2 1.4 1.6 1.8 2

thro

ughp

ut


0.5

0.6

0.7

0.8

0.9

1

1 1.2 1.4 1.6 1.8 2

thro

ughp

ut


Fig. 10: Throughput depending on overprovisioning ofmonitors for different numbers of processors (n).

0.8 0.82 0.84 0.86 0.88

0.9 0.92 0.94 0.96 0.98

1

1 2 4 8 16

thro

ughp

ut

number of clusters (c)

c*n=32c*n=64

c*n=128c*n=256

Fig. 11: Throughput depending on number of clustersfor different numbers of total processors (c · n).

The total number of blocked processors, R, across allprograms is

R =

p∑i=1

Ri. (6)

Note that in this case, the probabilities in Ri are notindependent since

∑pi=1Bi = n.

The fraction of blocked processors is then Rn and the

throughput, t, of the system is

t = 1− R

n. (7)

5.3 System ComparisonTo illustrate the effect of blocking due to the unavail-ability of monitoring resources, we present several re-sults based on the above analysis. For simplicity, weassume p = 2 programs with w1 = w2. Figure 10shows the throughput as a function of how many moremonitors than processors are in the system. We call this“monitor overprovisioning” (i.e., m/n). In the figure, theoverprovisioning factor ranges from 1 (equal number of

8

monitors and processors) to 2 (twice as many monitorsas processors). The figure shows that only for very smallconfigurations (e.g., n = 2 processors), there is a signif-icant decrease in throughput. For larger configurations,there is only a slight decrease for low overprovisioningfactors. For our prototype implementation, we choose aconfiguration of n = 4 processors and m = 6 monitors(i.e., m/n = 1.5), which achieves a throughput of over96%.

The effect of clustering is shown in Figure 11. Sincewe need to cluster monitors to achieve scalability in thesystem implementation, a key question is how muchworse a clustered system performs compared to a systemwith no clustering (i.e., full interconnect between allprocessors and monitors). We denote the number ofclusters with c. The figure shows the throughput forconfigurations with the same total number of processorsand a monitor overprovisioning factor of 1.5. The fullinterconnect (c = 1) always achieves full throughput. Asthe number of clusters increases, small systems degradein throughput slightly. However, if the number of pro-cessors per cluster does not drop below 8, throughputof over 99% can be achieved. These results indicate thatusing a clustered monitoring system instead of a fullinterconnect can achieve nearly full performance, whilebeing much less costly to implement.

6 RUNTIME RESOURCE REALLOCATION AL-GORITHM

While the previous section provides an analytical evalu-ation of dynamic resource allocation, system through-put based on varying workloads can also be evalu-ated through experimentation. In the Scalable HardwareMonitoring Grid design, the control processor (see Fig-ure 6) assigns programs to processors and monitors. Asthe traffic workload changes, the optimal assignmentof cores and monitors should reflect the processingworkload. In order to achieve this goal, a RuntimeResource Reallocation Algorithm (RRRA) is needed todynamically reconfigure the SHMG at runtime based onnetwork workload changes.

6.1 Reallocation AlgorithmThe control processor periodically monitors networkworkload to assess the current allocation of processingresources. As network packets enter the network pro-cessor, they are buffered in the external memory in aseries of packet queues. Each queue stores a differenttype of network packet. The control processor assignspackets to queues based on processing requirements andthe number of packets in the queue defines the queuelength. Similar, stable queue lengths for each packet typereflects packet processing balance in terms of number ofassigned processors and router processing speed. If thequeue length increases significantly, the network traffic istoo heavy compared to the current processing speed andmore compute resources are needed for this program.

We assume the input network traffic fully utilizes thesystem, i.e., when the workload increases for one packettype, then the workloads of other packet types decrease.As mentioned in Section 4.2, a processor/monitor clusterconsists of n processors and m monitors. The workloadof the system consists of p different programs that eachmonitor may execute (one program per packet type).For practicality, we assume m ≥ n and m ≥ p and noprocessor is idle.

For each program i (1 ≤ i ≤ p), ai is the number of pro-cessors assigned to the program and qli(t) is the queuelength at time t. If queue length qli increases and exceedsthreshold θ, the required packet processing exceeds thecurrent processing power and more processing resourcesneed to be allocated to this program. Our algorithmperforms this process in two steps: First, the algorithmexamines all p queues to locate a program j which canrelease resources to program i; second, the algorithmdetermines which monitoring resources to allocate to thereassigned program (and thus which cluster is used).Each step is explained in detail in the following.

6.1.1 Identification of Program for ReallocationIn order to find the most suitable program j, the follow-ing criteria are applied during the search:

1) If a queue is empty, select this program to releaseone processor. If there is more than one emptyqueue, select the program that has had an emptyqueue for the longest time. For this purpose, anempty time marker tek is used for each emptyqueue k to record the time the queue drained. Thealgorithm maintains a priority queue for the emptyqueue that allows easy identification of the queuethat has been empty longest (i.e., with minimumtek).

2) If no queue is empty, select the program withthe shortest queue length that has at least twoprocessors allocated. (Each program is guaranteedat least one active processor in the system if it hasa non-zero queue length, so deallocating resourcesfrom a program with a single processor is notallowed.)

This queue monitoring algorithm is shown in Algo-rithm 1. A program that needs an additional core is i andthe program that releases a core is j. When a new packetis assigned to queue i, the algorithm assesses the queuelength qli. If qli passes the threshold θ, an additional pro-cessor is assigned to program i. The algorithm examinesthe length of all queues to find program j based on theabove criteria.

6.1.2 Identification of Monitor for ReallocationAfter i and j have been determined, the next step is toselect a specific system processor to switch from programj to program i. To minimize monitor reloading duringthe switching process, the selection is made as follows:

1) Identify all unused monitors in the system.

9

Algorithm 1 Runtime Resource Reallocation Algorithm

1: qli(t)← qli(t) + 1 . packet arrival for program i2: if qli(t) ≥ θ then . queue length passes threshold3: qlmin ← θ . initialize variables to find j4: temin ← t5: for k = 0 to p do . iterate over all queues6: if qlk(t) = 0 then . queue empty7: if tek ≤ temin then . empty for longer

time8: j ← k9: temin ← tek

10: qlmin ← 011: end if12: else if qlj(t) ≤ qlmin then . queue is shorter13: if aj ≥ 2 then . has 2 or more processors

allocated14: j ← k15: qlmin ← qlk(t)16: end if17: end if18: end for19: end if20: aj ← aj − 121: ai ← ai + 1

2) If there is an unused monitor that has a preloadedgraph of program i, identify all the processorsin the same cluster as this monitor. If there is aprocessor running program j in the same cluster,switch it to program i, disconnect the program jmonitor and connect the processor to the programi monitor.

3) If there is no processor running program j in thesame cluster, try to find another unused monitorwith program i in a different cluster.

4) If there is no unused monitor that has preloadedprogram i, switch one processer j to program i andreload the least recently used monitor to programi in the same cluster of the switched processor.

After switching resources, several packets must beprocessed before the effect of the new configurationreflects on the queue lengths. To prevent additionalprograms from passing the threshold soon after resourceswitching and taking processing resources from the sameprogram j, a mandatory delay δ is introduced. After oneadaptation, new switching requests will be blocked untilδ packets are processed.

6.1.3 Reallocation Algorithm ComplexityOverall, the Runtime Resource Reallocation Algorithmhas two traversal operations:

• Evaluate all p queues to find j when the thresholdθ is exceeded by a program i. This action whichrequires O(p) time.

• Evaluate all monitors in the clusters that includeprogram j to find a monitor and processor core to

use or switch functionality, which requires O(m+n)time.

In total, RRRA has an asymptotic complexity of O(p+m + n), which is linear in the number of programs,monitors, and cores in the system.

6.2 System SimulationA Java-based simulator was built to verify RRRA andevaluate runtime throughput results with the resultsobtained from the analytical model in Section 5.3. Thesimulator can generate network packets that requireprocessing by programs in different ratios and can varythese ratios over time. With this time-changing networktraffic input, the simulator assesses the behavior ofRRRA and measures the runtime resource allocation andsystem throughput.

6.2.1 Runtime AdaptationFigure 12 shows that network traffic is balanced (i.e.,the total number of packets during a fixed time periodis the same) for three different types of packets. Thepacket proportions change from 1: 1 : 1 to 8: 1 : 1, thento 1: 1 : 8, and finally back to 1: 1 : 1 and then decrease to0. For simplicity, we assume that the packet processingtime for all types is equal (i.e., t1 = t2 = t3). Figure 13shows the number of processors running each programduring the runtime in an experimental system with 16processors and 24 monitors. We can observe that theratios of the processors assigned to each program followthe ratios of each packet type in the network traffic, thusvalidating the effectiveness of the RRRA. In Figure 14,the system throughput is at a maximum shortly after thesystem starts processing. Then, there are three obvioustransitions corresponding to the three network trafficallocation changes. The biggest drop happens when thenetwork traffic changes dramatically from 8: 1 : 1 to1: 1 : 8. Using RRRA, the system can adapt its resourceallocation to traffic changes and the throughput quicklyreturns to the maximum value.

6.2.2 Overprovisioning SimulationTo verify the monitor overprovisioning analysis, twoexperiments were conducted in simulation. The firstexperiment considered the simplest case of p = 2 withw1 = w2, total processor number n = 4, 8, 16, 32, andoverprovisioning factor ranging from 1 to 2. The resultsshown in Figure 15 match with the previous analyticalresults in Figure 10. The second experiment kept the as-sumption of evenly distributed workload, but extendedthe program number to three and the overprovisioningfactor range from 1 to 3. Although the upper bound ofthe overprovisioning range increases to 3, the through-put is still more than 95% after m/n = 1.5.

The effect of number of clusters on throughput wasmeasured in a second simulation. The experiment wassetup with monitor overprovisioning of 1.5, the same asthe previous analytical analysis, and a total number of

10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Incoming pa

cket ra

/o

Time t (kilocycles)

Packet type 1

Packet type 2

Packet type 3

Total

Fig. 12: Input network traffic used during simulation.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Num

ber o

f processors a

ssigne

d

Time t (kilocycles)

Packet type 1

Packet type 2

Packet type 3

Total

Fig. 13: Processor distribution for different packet typesas traffic allocation changes.

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Throughp

ut (p

ackets/kilo

cycles)

Time t (kilocycles)

Fig. 14: Throughput changes is relation to traffic changes.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 1.2 1.4 1.6 1.8 2

throughp

ut


n=4

n=8

n=16

n=32

Fig. 15: Simulation throughput depending on overprovi-sioning of monitors for different numbers of processors(n) with two different packet types.

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8 10 12 14 16

throughp

ut

number of clusters

c*n=32

c*n=64

c*n=128

c*n=256

Fig. 16: Simulation throughput depending on number ofclusters for different numbers of processors (n) with twodifferent packet types.

processors of 32, 64, 128 and 256. Compared to the ana-lytical results in Figure 11, results from this experiment,shown in Figure 16, demonstrate good consistency, thussupporting the appropriateness of the analytical model.

7 PROTOTYPE IMPLEMENTATION ANDEVALUATION

To demonstrate the effectiveness of our Scalable Hard-ware Monitoring Grid in a real system, we have imple-mented a prototype system.

7.1 Experimental Setup

We have implemented a prototype network processorin an Altera Stratix IV FPGA on an Altera DE4 board.This board contains four 1 Gbps Ethernet ports to receive

11

and send network traffic. We implemented one SHMGcluster in the FPGA, consisting of four processor cores(soft processors created using a synthesizable PLASMAprocessor [22]) and six hardware monitors (i.e., n = 4and m = 6). The flexible, multiplexer-based interconnectshown in Figure 9 is used to allow any processor toconnect to any monitor within our cluster.

To evaluate the functionality and performance of themonitoring system, we transmit traffic through the pro-totype system. Packets are received on two of the Eth-ernet ports and transmitted on the other two. For eachpacket, a simple flow classifier determines the appro-priate NP program for processing. After the packet isprocessed by a core, it is sent to the appropriate outputqueue for subsequent transmission.

We use two types of packets, which need differenttypes of processing and thus different monitors: (1) IPv4packets and (2) IPv4/UDP packets that require conges-tion management (CM) for processing. The processingcode for IPv4 does not exhibit vulnerabilities, but theIPv4+CM processing code exhibits the integer overflowvulnerability described in [3]. We introduce 1% of attackpackets, which can trigger a stack smashing attack in theIPv4+CM processing code [3].

To generate the monitoring graph, as described inSection 4.3, the program is first passed through the stan-dard MIPS-GCC compiler flow to generate assembly-level instructions. The compiler output allows for theidentification of branch instructions and their branch tar-get addresses. The instructions and branch informationare then processed to generate the data structure usedinside the hardware monitor. This data structure is thenloaded into the SHMG system.

7.2 Experimental ResultsOur system was verified through a series of experimentsthat were run on the FPGA in real time.

7.2.1 Correct OperationTo illustrate the operation of our SHMG, we have as-signed two cores to process IPv4 and two cores toprocess IPv4+CM. Of the available six monitors, two areconfigured to monitor IPv4 and four are configured tomonitor IPv4+CM (since the latter is more processing-intensive). All four NP cores execute program code frominternal FPGA memory. The initial configuration of themonitors, program code, and the processor-to-monitorinterconnect is set when the design is compiled to theFPGA and the bitstream is loaded into the design onsystem powerup.

Figure 17 shows the operation of a processor coreand its corresponding monitor on the IPv4 program.(Waveform figures are generated through simulation inorder to obtain signals; however, the same functionalityhas been verified in real-time operation of the system onnetwork traffic.) Similarly, Figure 18 shows the operationof a core on the IPv4+CM program. In this case, the

TABLE 1: Resource utilization and dynamic power con-sumption in the prototype system

Available DE4 Network SHMGin FPGA interface processors monitors intrcon.

LUTs 182,400 33,427 15,025 816 96- 67.8% 30.4% 1.7% 0.1%

FFs 182,400 36,467 8,367 147 0Bits 14,625,792 2,263,888 2,097,134 786,432 0

- 44.0% 40.7% 15.3% 0%Pwr(mW) - 1490.83 388.6 41.76 5.30

packet is benign and no attack occurs. Figure 19 showsthe processing of an attack packet in IPv4+CM. Theattack is identified within one cycle of the instructionfetch. Figure 19 shows that the monitor identifies theattack since the stack gets smashed and the control flowis redirected to code that differs from what the programanalysis has determined as valid. The processor core isthen reset and continues processing the next packet. Thereset operation completes in two cycles and thus doesnot affect the throughput performance of the system(and cannot be used as a target for denial of serviceattacks). Other processor cores continue processing with-out being affected.

A key functionality of SHMG is the dynamic assign-ment of processors to hardware monitors. In our proto-type system, we can trigger the reassignment of proces-sors to monitors on-demand. In our experimental setup,we switch one of the processor cores from IPv4 (Figure17) to IPv4+CM (Figure 18). The processor-to-monitorinterconnect for the core that was previously processingIPv4 packets is switched to connect the core to an unusedIPv4+CM monitor. The affected NP core and newlyconnected monitor are then reset, and processing bythe core commences. After this runtime reconfiguration,three NP cores process packets for IPv4+CM, while onecore processes IPv4.

Thus, we are able to show dynamic reassignment ofprocessors to monitors at runtime as well as the correctdetection of and recovery from attacks.

7.2.2 Resource Requirements and System Throughput

The resource requirements for the FPGA in our pro-totype system are shown in Table 1. The lookup ta-ble (LUT), flip flop (FF), and memory resources (Bits)required for the network processor cores, monitors,switches and other circuitry are shown in Table 1.An LUT is a n-input, 1-output logic element that canperform any logic function of n inputs. In Stratix IVdevices, LUTs can be configured to have between twoto seven inputs. Each monitoring graph can hold upto 4,096 separate entries. The FPGA in the system isable to operate at 125 MHz. For this relatively smallcluster size, the amount of logic needed for processor-to-monitor interconnection is less than 1% of the totallogic needed for the monitors, cores, and processor-to-

12

Fig. 17: Simulation waveforms showing correct forwarding of an IPv4 packet.

Fig. 18: Simulation waveforms showing forwarding of an IPv4+CM packet.

Fig. 19: Simulation waveforms showing identification of and recovery from an IPv4+CM attack packet.

monitor interconnect since only hash value, reset, andcontrol signals are communicated.

To assess the generality of our area results across dif-ferent FPGA generations, we resynthesized the networkprocessor cores, monitors, and interconnect to an AlteraStratix II device. The resulting LUT counts of 14,912, 774,and 92 for the processor cores, monitors, and intercon-nect, respectively are similar to the Stratix IV numbers.For a Stratix II device, an LUT can range in size from 2-input to 7-input depending on the desired logic function.The distribution of input counts for LUTs across thisinput spectrum was similar for both architectures.

The dynamic power consumption of the components,shown in Table 1, was determined using the AlteraPowerPlay power analyzer. The monitors and associatedinterconnect consumed 12% of the dynamic power of the

processors. The network and PCI interfaces on the boardconsumed 3.4× more dynamic power than these compo-nents combined. Based on board level experimentation,average time to process one hundred 256 byte packetsis 6 ms. As a result, the dynamic energy to process 100256-byte packets at 125 MHz is 6 ms × 1926.49 mW =11.56 mJ.

The throughput of our system including monitoringwhen processing normal 256-byte packets and an occa-sional attack packet is shown in Figure 20. The through-out of the system is limited by the processing capabilityof the processor cores, not monitoring. The throughputfor normal packets is the same both with and withoutmonitoring. A small throughput reduction is observedin the presence of attack packets due to the amount oftime needed to flush the packet buffer.

13

0

20

40

60

80

100

120

140

160

1 3 5 10 20 30 40 50 60 80 100

200

300

400

500

600

700

800

900

1000

Input Rate (Mbps)

Outpu

t Rate (M

bps)

Allnormal

Attack100:1

Fig. 20: Throughput results when processing normalIPv4 with congestion management packets and whenprocessing IPv4 with congestion management packets ifone out of 100 packets is an attack packet.

7.2.3 Monitoring Graph Swap Time Overhead

To better illustrate the benefits of overprovisioning themonitors relative to processor count (m > n), we assessthe average time required to swap monitors during aprocessor allocation for the case when m/n = 1.5 versusthe case when a monitoring graph must be reloadedfrom centralized monitor memory for every processorreallocation. The steps required to perform each taskof monitor swapping includes: identification of a newprogram i for allocation, identification of a program toswap out (j), identification of a target processor core andmonitor for the new program, and monitor reload fromcentralized monitor memory (if needed). The realloca-tion operations needed to perform the first three stepsin the list were discussed in Section 6.1. The followinganalysis is performed for a two cluster system with n = 6processor cores and m = 9 monitors in each cluster.The control processor operates at 125 MHz, the clockspeed for our prototype hardware implementation. Sinceprocessor throughput is one clock cycle, we equate aninstruction execution to a clock cycle. The instructionsfor the programs are stored in each processor core’s localmemory.

The initial allocation and deallocation steps requireexamination of packet queues to identify a program i forallocation to a processor and the reduction of one proces-sor for a program j (Section 6.1.1). Our experimentationusing system simulation shows that 28 control processorinstructions are needed on average to identify a programi which requires an additional processor. An additional28 control processor instructions are required to identifya program j that should have a processor deallocated.Combined, these actions require 0.45 µs.

The tasks needed to identify a monitor for reallocationare detailed in Section 6.1.2. This stage attempts to iden-tify a spare monitor which has been previously loadedwith the monitoring graph for program i and a processorcore which is currently tasked with program j. This coreis subsequently switched to program i and the processor

TABLE 2: NpBench monitor graph reload cost

Network Memory graph Graph reload Graph reloadbenchmark size (bits) time (cycles) time (µs)crc 8,460 529 2.64frag 18,660 1,166 5.83red 25,410 1,588 7.94md5 96,840 6,052 30.26ssld 25,620 1,601 8.01wfq 28,590 1,787 8.93mtc 77,160 4,822 24.11mpls (up) 52,590 3,287 16.43mpls (down) 51,180 3,199 15.99

core/monitor interconnect is configured for the newconnection. The process of identifying a processor core,swapping its program, and locating a suitable preloadedmonitor requires 197 instructions (clock cycles) on aver-age, based on our simulation. The configuration of theinterconnect between the monitor and processor core inthe cluster requires 3 clock cycles. In total, these actionsrequire 1.60 µs.

In some cases, if a spare monitor with the appropriategraph cannot be found, a graph must be loaded intomonitor memory. To evaluate the average monitoringgraph reloading cost from centralized monitor memoryto the dual-ported memory in a monitor, nine bench-marks from the NpBench suite [23] were processed withan offline analysis flow. NpBench is a benchmark suitetargeting modern network processor applications. Thebenchmark applications are categorized into three spe-cific functional groups: traffic management and qualityof service group (TQG), security and media processinggroup (SMG) and packet processing group (PPG). Inour evaluation, monitor graph sizes generated with a4-bit nibble-sum hash function were calculated and thegraph read/write times to an on-chip memory wereestimated for each of the benchmarks. The reloadingtime estimation was based on on-chip SRAM which is a200 MHz SSRAM with a 16-bit data bus. Table 2 showsthe evaluation results. The average reload time is foundto be 13.34 µs.

Based on our simulation, we determined that it wasnecessary to reload a monitor from centralized monitormemory 16% of the time during a reallocation for m/n= 1.5. During the remaining cases, a spare monitorwith program i was available in a cluster and could beconnected to the newly-allocated processor core. As aresult, the average amount of time needed to reallocate aprocessor core can be calculated as program allocation time+ monitor/processor identification time + %reload × graphreload time. In total, this analysis results in an averagereallocation time of 0.45 µs + 1.60 µs + 0.16 × 13.34 µs= 4.18 µs. In contrast, the amount of time needed if aprocessor is dedicated to a monitor is program reallocationtime + graph reload time. In total, for the m = n case, thisanalysis results in an average 0.45 µs + 13.34 µs = 13.79µs delay.

14

8 CONCLUSIONS AND FUTURE WORK

The use of general-purpose processors to implementpacket forwarding functions in routers has opened thedoor for a new class of network data plane attacks.Prior work has shown examples for such attacks andtheir considerable effects. To provide practical protectionfor network processors, which are multicore systemswith highly dynamic workloads, we have presentedour design of a Scalable Hardware Monitoring Grid.This monitoring system groups multiple processors andmonitors into clusters and provides an interconnect todynamically assign processor cores to monitors basedon their current workload using a Runtime ResourceReallocation Algorithm. We show that the system cancorrectly identify attacks and recover the attacked coreso that it can continue processing. In the future we planto assess techniques to speed up the runtime swappingof monitoring information.

REFERENCES

[1] F. Baker, “Requirements for IP version 4 routers,” Network Work-ing Group, RFC 1812, Jun. 1995.

[2] W. Eatherton, “The push of network processing to the top ofthe pyramid,” in Keynote Presentation at ACM/IEEE Symposium onArchitectures for Networking and Communication Systems (ANCS),Princeton, NJ, Oct. 2005.

[3] D. Chasaki and T. Wolf, “Attacks and defenses in the data plane ofnetworks,” IEEE Transactions on Dependable and Secure Computing,vol. 9, no. 6, pp. 798–810, Nov. 2012.

[4] The Open Source Network Intrusion Detection System, Snort, 2014,http://www.snort.org.

[5] The Bro Network Security Monitor, The Bro Project, 2014, http://www.bro-ids.org.

[6] D. Chasaki and T. Wolf, “Design of a secure packet processor,” inProc. of ACM/IEEE Symp. on Arch. for Networking and Communica-tion Systems (ANCS), San Diego, CA, Oct. 2010, pp. 1–10.

[7] Q. Wu and T. Wolf, “Runtime task allocation in multi-core packetprocessing systems,” IEEE Transactions on Parallel and DistributedSystems, vol. 23, no. 10, pp. 1934–1943, Oct. 2012.

[8] The Cisco QuantumFlow Processor: Cisco’s Next Generation NetworkProcessor, Cisco Systems, Inc., San Jose, CA, Feb. 2008.

[9] OCTEON Plus CN58XX 4 to 16-Core MIPS64-Based SoCs, CaviumNetworks, Mountain View, CA, 2008.

[10] NP-5 – 240-Gigabit Network Processor for Carrier Ethernet Appli-cations, EZchip Technologies Ltd., Yokneam, Israel, May 2012,http://www.ezchip.com/.

[11] A. Cui, Y. Song, P. V. Prabhu, and S. J. Stolfo, “Brave newworld: Pervasive insecurity of embedded network devices,” inProc. of 12th International Symposium on Recent Advances in IntrusionDetection (RAID), ser. Lecture Notes in Computer Science, vol.5758, Saint-Malo, France, Sep. 2009, pp. 378–380.

[12] Cisco, Inc., “Cisco IOS,” http://www.cisco.com.[13] P. Koopman, “Embedded system security,” Computer, vol. 37,

no. 7, pp. 95–97, Jul. 2004.[14] S. Parameswaran and T. Wolf, “Embedded systems security – an

overview,” Design Automation for Embedded Systems, vol. 12, no. 3,pp. 173–183, Sep. 2008.

[15] D. Arora, S. Ravi, A. Raghunathan, and N. K. Jha, “Secure embed-ded processing through hardware-assisted run-time monitoring,”in Proc. of the Design, Automation and Test in Europe Conference andExhibition (DATE’05), Munich, Germany, Mar. 2005, pp. 178–183.

[16] R. G. Ragel, S. Parameswaran, and S. M. Kia, “Micro embed-ded monitoring for security in application specific instruction-setprocessors,” in Proc. of the International Conference on Compilers,Architectures and Synthesis for Embedded Systems (CASES), SanFrancisco, CA, Sep. 2005, pp. 304–314.

[17] S. Mao and T. Wolf, “Hardware support for secure processing inembedded systems,” IEEE Transactions on Computers, vol. 59, no. 6,pp. 847–854, Jun. 2010.

[18] H. Kumarapillai Chandrikakutty, D. Unnikrishnan, R. Tessier,and T. Wolf, “High-performance hardware monitors to protectnetwork processors from data plane attacks,” in Proc. of DesignAutomation Conference (DAC), Austin, TX, Jun. 2013, pp. 1–6.

[19] K. Hu, H. Chandrikakutty, R. Tessier, and T. Wolf, “Scalablehardware monitors to protect network processors from data planeattacks,” in Proc. of the IEEE Conference on Network and ComputerSecurity, Washington, DC, Oct. 2013, pp. 314–322.

[20] D. Geer, “Malicious bots threaten network security,” Computer,vol. 38, no. 1, pp. 18–20, 2005.

[21] K. Hu, T. Wolf, T. Teixeira, and R. Tessier, “System-level securityfor network processors with hardware monitors,” in Proc. DesignAutomation Conf. (DAC), San Francisco, CA, Jun. 2014, pp. 1–6.

[22] S. Rhoads, Plasma – most MIPS I(TM) Opcodes, 2001, http://www.opencores.org/project,plasma.

[23] B. K. Lee and L. K. John, “NpBench: A benchmark suite for controlplane and data plane applications for network processors,” inProc. of IEEE International Conference on Computer Design (ICCD),San Jose, CA, Oct. 2003, pp. 226–233.

Kekai Hu is a Ph.D. candidate in the Electricaland Computer Engineering department at theUniversity of Massachusetts, Amherst. He re-ceived the B.S. and M.S. degrees in electricaland computer engineering from Wuhan Univer-sity, China, in 2007 and 2009, respectively. Hisresearch interests include network routers, em-bedded system security, and computer architec-ture.

Harikrishnan Kumarapillai Chandrikakuttyreceived the B.Tech. degree in applied electron-ics and instrumentation engineering from Col-lege of Engineering, Trivandrum, India in 2008and the M.S. degree in electrical and com-puter engineering from the University of Mas-sachusetts, Amherst in 2013. He is currently withJuniper Networks, Westford, MA, where he isinvolved in the design and verification of nextgeneration routers and switch systems.

Zachary Goodman received the B.S. degreein electrical engineering from the University ofMassachusetts, Amherst in 2014.

Russell Tessier (M’00-SM’07) received the B.S.degree in computer and systems engineeringfrom Rensselaer Polytechnic Institute, Troy, NY,USA, in 1989, and the S.M. and Ph.D. degreesin electrical engineering from the MassachusettsInstitute of Technology, Cambridge, MA, USA, in1992 and 1999, respectively. He is currently Pro-fessor of Electrical and Computer Engineeringwith the University of Massachusetts, Amherst,MA. His current research interests include com-puter architecture and FPGAs.

Tilman Wolf (M’02-SM’07) is Professor of Elec-trical and Computer Engineering at the Uni-versity of Massachusetts Amherst. He receiveda Diplom in informatics from the University ofStuttgart, Germany, in 1998. He also receiveda M.S. in computer science in 1998, a M.S. incomputer engineering in 2000, and a D.Sc. incomputer science in 2002, all from WashingtonUniversity in St. Louis. His research interestsinclude Internet architecture, network routers,and embedded system security.

dynamic hardware monitors for network processor protection · processor monitoring system that...

Documents