hthreads: a hardware/software co-designed multithreaded...

8
hthreads: A Hardware/Software Co-Designed Multithreaded RTOS Kernel David Andrews, Wesley Peck, Jason Agron, Keith Preston, Ed Komp, Mike Finley, Ron Sass EECS Department University of Kansas Information and Telecommunications Technology Center (ITTC) {dandrews, peckw, jagron, preston, komp, rsass}@ittc.ku.edu Abstract This paper describes the hardware/software co-design of a multithreaded RTOS kernel on a new Xilinx Virtex II Pro FPGA. Our multithreaded RTOS kernel is an integral part of our hybrid thread programming model being de- veloped for hybrid systems which are comprised of both software resident and hardware resident concurrently exe- cuting threads. Additionally, we provide new interrupt se- mantics by migrating uncontrollable asynchronous inter- rupt invocations into controllable, priority based thread scheduling requests. Performance tests verify our hard- ware/software codesign approach provides significantly tighter bounds on scheduling precision and significant jit- ter reduction when compared to traditionally implemented RTOS kernels. It also eliminates the jitter associated with asynchronous interrupt invocations. 1 Introduction This paper describes the hardware/software co-design of a multithreaded RTOS kernel on a new Xilinx Virtex II Pro [16] FPGA. Our multithreaded RTOS kernel is an in- tegral part of our hybrid thread programming model being developed for hybrid systems, such as the new FPGA sys- tems being developed by Xilinx [16] and Altera [1] which integrate general purpose CPUs into the FPGA fabric. The hybrid thread programming model is being developed to provide a uniform high level programming environment in which programmers are able to access the capabilities of the FPGA from within a comfortable and familiar soft- ware programming model. A description of our hybrid thread model is given in [2] [3] [11] [6] [4]. The multithreaded RTOS kernel presented in this paper provides a shared memory programming model similar to POSIX Threads [15] for developers of real-time applica- tions. Currently our system supports up to 256 active soft- ware threads, 256 active hardware threads, 64 blocking semaphores, 64 binary spin lock semaphores, preemptive priority, round robin, and FIFO scheduling algorithms. Applications access the hardware based operating system components through familiar APIs such as create, join, and exit. However, these operations have now been effi- ciently migrated into state machines which operate in par- allel within the FPGA. These operations cause changes in the state of an application thread and may lead to mak- ing new scheduling decisions. These scheduling decisions are performed in hardware, similar to [7] and [12], al- lowing the overhead of a scheduling decision to be re- duced into a single memory read operation. Thus, a deter- ministic context switch is the only major source of over- head on the CPU. Migrating the scheduler into hardware has also allowed us to address the major source of jitter within real-time and embedded systems: asynchronous in- terrupt invocations. Our RTOS kernel now contains a new CPU Bypass Interrupt Controller (CBIS) that transforms the semantics of asynchronous interrupts into that of syn- chronous and controllable thread scheduling requests. 2 hthreads: A Multithreaded HW/SW RTOS Kernel All operating system services introduce some general level of overhead that results in a reduction of execution capacity for application programs on the CPU. A sim- ple hardware/software co-design of existing system ser- vices can reduce this overhead and therefore provide more CPU capacity for application programs [9] [12] [14]. Al- though a necessary step, simply migrating overhead func- tionality into the hardware is not sufficient for addressing the more critical issues that challenge developers of real- time systems. Specifically, developing an operating sys- tem with minimal jitter, deterministic behavior, and fine scheduling granularity have by far been the most chal- lenging goals for real-time operating systems designers. To achieve the best and tightest bounds on application program scheduling, migration of both the scheduler pro- cessing as well as the management of state information must be moved off of the CPU and into hardware compo- nents. If the scheduler has access to all necessary state in- formation then the scheduler can perform look-ahead pro- cessing concurrently with the application program with- out interrupting or requiring processing time on the CPU. Providing the scheduler with the necessary state informa- 0-7803-9402-X/05/$20.00 © 2005 IEEE

Upload: trinhnhi

Post on 29-Mar-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: hthreads: A Hardware/Software Co-Designed Multithreaded ...fileadmin.cs.lth.se/ai/Proceedings/etfa2005/html/articles/CF... · hthreads: A Hardware/Software Co-Designed Multithreaded

hthreads: A Hardware/Software Co-Designed Multithreaded RTOS Kernel

David Andrews, Wesley Peck, Jason Agron, Keith Preston, Ed Komp, Mike Finley, Ron SassEECS Department University of Kansas

Information and Telecommunications Technology Center (ITTC){dandrews, peckw, jagron, preston, komp, rsass}@ittc.ku.edu

Abstract

This paper describes the hardware/software co-design ofa multithreaded RTOS kernel on a new Xilinx Virtex IIPro FPGA. Our multithreaded RTOS kernel is an integralpart of our hybrid thread programming model being de-veloped for hybrid systems which are comprised of bothsoftware resident and hardware resident concurrently exe-cuting threads. Additionally, we provide new interrupt se-mantics by migrating uncontrollable asynchronous inter-rupt invocations into controllable, priority based threadscheduling requests. Performance tests verify our hard-ware/software codesign approach provides significantlytighter bounds on scheduling precision and significant jit-ter reduction when compared to traditionally implementedRTOS kernels. It also eliminates the jitter associated withasynchronous interrupt invocations.

1 Introduction

This paper describes the hardware/software co-design ofa multithreaded RTOS kernel on a new Xilinx Virtex IIPro [16] FPGA. Our multithreaded RTOS kernel is an in-tegral part of our hybrid thread programming model beingdeveloped for hybrid systems, such as the new FPGA sys-tems being developed by Xilinx [16] and Altera [1] whichintegrate general purpose CPUs into the FPGA fabric. Thehybrid thread programming model is being developed toprovide a uniform high level programming environmentin which programmers are able to access the capabilitiesof the FPGA from within a comfortable and familiar soft-ware programming model. A description of our hybridthread model is given in [2] [3] [11] [6] [4].

The multithreaded RTOS kernel presented in this paperprovides a shared memory programming model similar toPOSIX Threads [15] for developers of real-time applica-tions. Currently our system supports up to 256 active soft-ware threads, 256 active hardware threads, 64 blockingsemaphores, 64 binary spin lock semaphores, preemptivepriority, round robin, and FIFO scheduling algorithms.Applications access the hardware based operating system

components through familiar APIs such as create, join,and exit. However, these operations have now been effi-ciently migrated into state machines which operate in par-allel within the FPGA. These operations cause changesin the state of an application thread and may lead to mak-ing new scheduling decisions. These scheduling decisionsare performed in hardware, similar to [7] and [12], al-lowing the overhead of a scheduling decision to be re-duced into a single memory read operation. Thus, a deter-ministic context switch is the only major source of over-head on the CPU. Migrating the scheduler into hardwarehas also allowed us to address the major source of jitterwithin real-time and embedded systems: asynchronous in-terrupt invocations. Our RTOS kernel now contains a newCPU Bypass Interrupt Controller (CBIS) that transformsthe semantics of asynchronous interrupts into that of syn-chronous and controllable thread scheduling requests.

2 hthreads: A Multithreaded HW/SWRTOS Kernel

All operating system services introduce some generallevel of overhead that results in a reduction of executioncapacity for application programs on the CPU. A sim-ple hardware/software co-design of existing system ser-vices can reduce this overhead and therefore provide moreCPU capacity for application programs [9] [12] [14]. Al-though a necessary step, simply migrating overhead func-tionality into the hardware is not sufficient for addressingthe more critical issues that challenge developers of real-time systems. Specifically, developing an operating sys-tem with minimal jitter, deterministic behavior, and finescheduling granularity have by far been the most chal-lenging goals for real-time operating systems designers.To achieve the best and tightest bounds on applicationprogram scheduling, migration of both the scheduler pro-cessing as well as the management of state informationmust be moved off of the CPU and into hardware compo-nents. If the scheduler has access to all necessary state in-formation then the scheduler can perform look-ahead pro-cessing concurrently with the application program with-out interrupting or requiring processing time on the CPU.Providing the scheduler with the necessary state informa-

0-7803-9402-X/05/$20.00 © 2005 IEEE

Page 2: hthreads: A Hardware/Software Co-Designed Multithreaded ...fileadmin.cs.lth.se/ai/Proceedings/etfa2005/html/articles/CF... · hthreads: A Hardware/Software Co-Designed Multithreaded

CPUSoftware Interface

system bus

softwarethread

softwarethread

softwarethread

hardware interface

hardwarethread

hardware interface

hardwarethread

Mutexes Condition Variables CBIS

Software Thread

Manager

Thread Scheduler

Shared Memory

Figure 1. System Block Diagram

tion requires redirecting all scheduler entry invocations tothe hardware, including API calls that cause state changesin threads, blocking operations such as semaphores, andmore importantly, interrupts. This requires a complete re-partitioning, and in some cases, re-interpretation of tradi-tional operating system semantics.

Figure 1 shows a block diagram of our co-designed com-ponents. This co-designed real-time kernel consists of

1. a hardware resident thread manager

2. a hardware resident thread scheduler

3. hardware resident semaphores

4. a CPU Bypass Interrupt Scheduler (CBIS)

5. software API’s, context switching, and kernel code

The semaphore components shown in Figure 1 are out-lined in [6] [4] and are not discussed further in this paper.

2.1 Software Thread Manager (SWTM)

Figure 2 shows the block diagram of our Software ThreadManager (SWTM) whose main function is to store andmodify thread state and serve as the interface for all statechange requests. State change requests can be caused byan application thread, by hardware resident applicationthreads, by the system jiffy timer, by our semaphore com-ponents, or by the CBIS. In order to allow efficient execu-tion of the thread management and scheduling operations,the SWTM and scheduler component have a dedicated in-terface which is used to communicate thread schedulingrequests from the SWTM to the scheduler. The state of allthreads are also maintained in the hardware by using statetables stored in the Virtex II Pro’s Block RAM (BRAM).

All thread management functions are invoked by readingor writing one register in the set of memory mapped reg-isters shown in Figure 2. A write or read operation on aregister invokes a state machine that performs the associ-ated processing and updates the state table. As an exam-ple, Figure 3 shows the pseudocode for our co-designedthread create API with the pseudocode executed on theCPU appearing on the left of Figure 3, and the resulting

Figure 2. Software Thread Manager

actions performed by the hardware resident state machineshown on the right.

Each register shown in Figure 2 is not a physical registerbut instead is a virtual register which can have a variabledepth. This helps save precious CLB resources inside ofthe FPGA. The depth of a given register specifies the num-ber of successive, 32 bit address locations that comprisethe register. Registers having a depth greater than one areutilizing the least significant address lines to encode pa-rameters, such as a thread identifier, to be passed to theregister. In most instances, the 32 bit address is encodedas:

31 8 7 0Register Base Address Thread Identifier

where the lower 8 bits contains the identifier of the threadwhich is requesting the operation. This encoding allowsthe user of the API to both send information with a re-quest and receive results from that request in a single,atomic bus transaction. This approach was taken for tworeasons. First, implementing physical registers within anFPGA requires CLB resources so their use should be min-imized. Second and more importantly, implementing arequest more traditionally as a store instruction would re-quire an additional load instruction to read back returnedvalues. The combination of store and load instructionswould then need to be protected by a critical region. Thisis a fundamental concern when partitioning functionalitybetween distinct software and hardware components thatcan be interrupted. Although critical regions can be pro-tected by semaphores, this introduces additional perfor-mance delays and semaphore resource requirements. Forthese reasons we chose to use “virtual” registers where

Page 3: hthreads: A Hardware/Software Co-Designed Multithreaded ...fileadmin.cs.lth.se/ai/Proceedings/etfa2005/html/articles/CF... · hthreads: A Hardware/Software Co-Designed Multithreaded

Figure 3. API Example

each virtual register corresponds to a single commandwhose parameters are encoded in its memory address. Bydoing so, we create truly atomic operations implementedas single instruction read requests. This approach tradesoff additional address space in place of physical resources.Additionally, the approach simplifies to coding of APIsand provides an inherent consistency within the system.The atomicity of an operation is guaranteed by the SWTMthat operates as a slave device on the bus and controls thetermination of the bus transaction. Once the register baseaddress and thread identifier is decoded, the SWTM in-vokes the appropriate state machine and places any returndata directly onto the data bus. For operations where nodata is returned, the SWTM performs an immediate busacknowledgement when the address has been decoded.The state machine can then run concurrently with the ap-plication program on the CPU. For operations that requirereturn status, the SWTM holds the bus until the state ma-chine can determine what return status to place on the databus. All operations within the SWTM are performed bystate machines that transition at the system clock rate andcomplete within tens of clock cycles. This approach en-sures that all SWTM operations are atomic, deterministic,and have minimal overhead.

2.2 Scheduler Component

A block diagram of the scheduler component is shown inFigure 4. This component currently implements FIFO,round robin, and priority based scheduling algorithms.The scheduler was initially developed as an integratedcomponent within the SWTM, but shortly afterwards itwas separated out as a standalone component to promoteportability for system designers and to allow implementa-tion of custom scheduling algorithms.

The primary functions of the scheduler are to determine

the thread identifier of the next thread in the system to run,and to determine if a context switch interrupt should begenerated to the CPU. A context switch interrupt is onlygenerated when a thread has the highest priority of anythread in the ready-to-run queue and its priority is higherthan that of the currently running thread. This type of in-terrupt semantics allows the CPU to remain undisturbeduntil the scheduler decides that a context switch is neces-sary.

The scheduler must maintain its own ready-to-run queuedata structure within the FPGA in order to provide fastaccess to each thread’s scheduler attributes (priority, nextthread in queue, previous thread in queue, etc.). Thescheduler module uses a partitioned ready-to-run queuewith a parallel priority encoder that allows all sched-uler operations to execute in constant time. The par-titioned ready-to-run queue has been implemented us-ing two BRAMs: the Priority Data BRAM and theThread Data BRAM. The Priority Data BRAM is indexedby priority level and contains the head and tail point-ers of the ready-to-run queue for that priority level. TheThread Data BRAM is indexed by thread identifiers andcontains the thread attribute information including prior-ity level, queued/not-queued flag, a pointer to next threadin queue, and a pointer to the previous thread in queue.Additionally, a parallel priority encoder continually cal-culates the highest priority in the system using a 128-bitregister field that represents which priority levels have ac-tive (queued) threads. The parallel priority encoder calcu-lates the highest priority in the system only when a changeoccurs in its 128-bit input register in a quick and constantfour clock cycles. The parallel priority encoder along withthe partitioned ready-to-run queue eliminates the need totraverse the scheduler’s ready-to-run queue as individualpriority queues can be accessed using the Priority DataBRAM and individual thread attributes can be accessed

Page 4: hthreads: A Hardware/Software Co-Designed Multithreaded ...fileadmin.cs.lth.se/ai/Proceedings/etfa2005/html/articles/CF... · hthreads: A Hardware/Software Co-Designed Multithreaded

Figure 4. Thread Scheduler

Figure 5. Scheduler Time Line

using the Thread Data BRAM.

State change requests to the scheduler are initiated by

1. the software thread manager submitting a new threadidentifier to be queued, or

2. the CPU reading the Next Thread ID register duringa context switch

During and an add thread operation the SWTM will firstperform error checking and then update its state infor-mation before providing the thread identifier of the newthread to add to the scheduler’s ready-to-run queue. Allcommunication is done over the dedicated hardware inter-face between the SWTM and the scheduler. For this typeof queuing operation the scheduler performs the followingsequence of actions. First, the thread identifier is linkedinto the bottom of the ready-to-run queue correspondingto its priority level. Second, the parallel priority encoder’s128-bit input register is updated as necessary for the cur-rent enqueue operation. Third, the output of the paral-lel priority encoder is read and is used to access the headpointer of the highest priority level’s ready-to-run queue

out of the Priority Data BRAM. Fourth, the head pointeris placed in the Next Thread ID register. Finally, a pre-emption check is carried out by accessing the priority levelof the next thread to run and the priority level of the cur-rently executing thread. The priority levels are comparedto one another and if the priority level of the next thread torun is better than that of the currently running thread thena context switch (preemption) interrupt is generated to theCPU.

It is important to note that when a new thread identifier isqueued, the scheduler does not search through the ready-to-run queue or perform any type of priority sorting. In-stead, the following three operations are performed: anenqueue of the thread to the tail of a list, a lookup of thehead pointer of the highest priority queue in the system,and a comparison of the priority level of the next thread torun and the priority level of the currently running thread.This approach is not only very fast ( 13 clock cycles) but,more importantly, executes in constant time which elim-inates the jitter introduced in traditional software sched-ulers that must sort or search through a single ready-to-runqueue.

Page 5: hthreads: A Hardware/Software Co-Designed Multithreaded ...fileadmin.cs.lth.se/ai/Proceedings/etfa2005/html/articles/CF... · hthreads: A Hardware/Software Co-Designed Multithreaded

Dequeue_BeginTID = Thread_ID to DeQ = current_thread_idt_entry = Thread_Data[TID]de-assert next_thread_valid

Lookup_Priority_EntryPRI = t_entry.priorityt_entry.Q = 0p_entry = Priority_Data[PRI]

Check_Encoder_Inputold_head = Thread_Data[p_entry.head]equal_flag = (p_entry.head == p_entry.tail)

Set_Q_To_Emptyencoder_input[PRI] = 0

Update_Q_Head_Ptrp_entry.head = old_head.next

Write_Back_EntriesThread_Data[TID] = t_entryThread_Data[p_entry.head] = old_headPriority_Data[PRI] = p_entry

Lookup_Highest_Priority_Entryh_entry = Priority_Data[encoder_output]exist_flag = (encoder_input != 0)

equal_flag = 1 equal_flag = 0

Return_Next_Thread_Validnext_thread_id = h_entry.headassert next_thread_valid

Return_Next_Thread_InvalidNo active threads in the system(No threads are Q'd)

exist_flag = 1 exist_flag = 0

Figure 6. Dequeue Operation State Machine

A new scheduling decision is initiated immediately uponthe CPU reading the Next Thread ID register during acontext switch. When the Next Thread ID register is readthe scheduler must remove the thread identifier found inthe Next Thread ID register from the ready-to-run queueand calculate the new next thread to run. The newnext thread can be calculated in constant time by read-ing the output of the parallel priority encoder and us-ing it to lookup the head pointer of the highest prioritylevel’s queue. This thread identifier is then placed in theNext Thread ID register and the next scheduling decisionhas been completed without any linked list searching ortraversal. This process happens in constant time, regard-less of the number of queued threads. It also executeswithout introducing any jitter into the system. The statemachine which implements this operation is shown in Fig-ure 6. Figure 5 shows that the next scheduling decision isalways pre-calculated before a context switch completes.

The scheduler component implemented on a XilinxMl310 development board requires 1034 out of 13696slices (7.5%), 522 out of 27392 slice flip-flops (1.9%),1900 out of 27392 4-input LUTs (6.9%), and 2 BRAMs(1.5%). The module has a maximum operating frequencyof 143.8 MHz, which easily meets our goal of a 100 MHzclock frequency.

2.3 Precise Interrupt Control

By far the largest sources of jitter within real-time andembedded systems are interrupts. As the integration of

embedded and real-time systems continues to increase, itis becoming even more critical to address the jitter in-jected from the processing of external interrupts. Soft-ware approaches such as those employed by RTLinux [5],RTAI [13] and KURT [8] minimize the effects of this jit-ter using mark and delay processing techniques. Althoughthese approaches can delay the major portion of interruptprocessing overhead, they still incur the initial overheadcosts of context switching and executing the equivalent of“top halves” of interrupt service routines (ISRs) to markthe pending interrupt. The jitter of context switching andISR processing cannot be eliminated by software tech-niques that must first field the interrupt to determine if itshould be processed. To completely eliminate this jitterrequires evaluation of the request within a computationalcomponent external to the application processor.

To minimize this jitter and provide precise control overexternal interrupts we direct interrupt requests to ourhardware-based thread scheduler through a new hardwarecomponent that binds interrupt requests to thread iden-tifiers. From a systems perspective this has the posi-tive effect of unifying the priorities of application threadsand external interrupts under a single scheduling frame-work. Within our approach, the semantics of interrupthandlers are modified so that they are standard, schedu-lable threads. Additionally, routing interrupt processingrequests directly to the hardware scheduler component al-lows the continued or resumed execution of the thread tobe considered in relation to the state of all other systemthreads. This provides programmers with the ability todynamically modify the priority of an interrupt dependingon its request rate, processing time, and criticality. Fig-ure 7 shows a block diagram of our CPU Bypass Inter-rupt Scheduler (CBIS) which takes the place of the tradi-tional priority interrupt controller (PIC). The CBIS allowssimultaneous interrupts to request processing by storingthe requests into the pending latches similar to the oper-ation of existing priority interrupt controllers. However,the semantics of the processing sequence for pending in-terrupts within the CBIS vary from existing PICs. Eachinterrupt channel (interrupt number) exists within one ofthree states shown in the state machine for a single inter-rupt shown in Figure 8.

In the default state, no interrupts are pending, and nothread has registered for the interrupt. Two scenarios canoccur from this state:

1. an interrupt request from a device is asserted beforeany thread has requested that the interrupt be associ-ated with its thread identifier, or

2. a thread requests to be associated with the interruptbefore the interrupt has been asserted

In the first scenario, the CBIS state machine marks the in-terrupt as pending, but does not issue a scheduling request.

Page 6: hthreads: A Hardware/Software Co-Designed Multithreaded ...fileadmin.cs.lth.se/ai/Proceedings/etfa2005/html/articles/CF... · hthreads: A Hardware/Software Co-Designed Multithreaded

Pending LatchesAccept/Ignore

Thread_IDThread_ID

Thread_ID

Finite State Machine Controller

GlobalEnable/Disable

PriorityEncoder

Bus Interface

01

n

.

.

.

Figure 7. CPU Bypass Interrupt Scheduler

DefaultValid=0

Pending =0

Valid = 0Pending =1

Valid = 1Pending = 0

ThrID -> Table

Associate

Interrupt

AssociateReturn Pending

InterruptThrID ->Scheduler

Figure 8. CBIS State Machine

When a thread issues an associate request for this pendinginterrupt, the CBIS sets the valid bit, representing a pend-ing request, and returns the pending status but does nottransfer the thread identifier to the scheduler as the threadis already running on the CPU. Thus, no context switch-ing overhead is incurred and the exception handler threadmay continue to run if it is the highest priority thread. Inthe second scenario, when the thread first associates withno pending interrupt, the CBIS writes the thread identifierinto the request table, sets the valid bit, and returns a nopending status to the software kernel.

When a thread associates its thread identifier with an in-terrupt the CBIS will return a status code indicating ei-ther that a interrupt is already pending so that the threadshould continue running or that there is no interrupt pend-ing so the thread state should be changed from runningto blocked. If the thread is blocked because there is nointerrupt pending then the CBIS will place that thread’sidentifier into the ready-to-run queue once that interruptoccurs.

3 Performance

Performance results are shown for

1. basic scheduling operations of threads with no exter-nal interrupts

2. the effect of modifying the semantics of interruptsinto priority-based thread scheduling requests

We have analyzed our thread APIs with no external in-terrupts to closely study the actual scheduling delays andjitter of our hardware/software co-designed multithreadedkernel. All interrupts were disabled during these tests toeliminate external jitter. More recently, we have begun toquantify the effects of modifying the semantics of the in-terrupts into thread scheduling requests. We present ourinitial testing of a set of simple, periodic, schedulabletasks and measured performance gains of our approachversus standard interrupt semantics.

3.1 Thread Scheduling

The following timing results were generated on aMEMEC [10] development board with a Virtex II Pro 7series hybrid chip using the traditional Xilinx priority in-terrupt controller. The CPU clock speed was 300 MHz,and the PLB, OPB, and FPGA clock speeds were 100MHz. The scheduling delay times shown in Figures 9, 10,and 11 represent total end-to-end delay times includingthe time taken by the CPU to acknowledge the interrupt,context switch into the non-critical interrupt service rou-tine, handshake the PIC, enter the thread context switchcode, save the old threads context and stops immediately

Page 7: hthreads: A Hardware/Software Co-Designed Multithreaded ...fileadmin.cs.lth.se/ai/Proceedings/etfa2005/html/articles/CF... · hthreads: A Hardware/Software Co-Designed Multithreaded

Figure 9. Scheduling Delay - 250 Threads

Figure 10. Scheduling Delay - 128 Threads

before control returns back to the newly scheduled appli-cation thread. The measurements were obtained by cre-ating an additional counter register in the timer hardwarethat resets to zero on the clock cycle when a schedulinginterrupt is asserted to the PIC, and then continues to in-crement at the 10 nsec clock cycle time. The free runningcounter was then read at the very end of the context switchcode, immediately before the new thread runs.

Figures 9, 10, and 11 show mean scheduling delays of1.97 µsecs, 1.97 µsecs, and 1.75 µsecs and maximumscheduling times of 3.3 µsec, 3.1 µsecs, and 2.1µsecsfor 250, 128, and 2 active threads respectively. Althoughlow, there is still small jitter (1.4 µsecs, 1.2 µsecs, and0.3 µsecs) in the system. To identify the source of thejitter, we ran the same tests with the data cache, instruc-tion cache, and both data and instruction caches turnedoff. With both caches off, the maximum jitter for 2through 250 threads was around 10 clock cycles (100nsecs). This is a strong indication that the jitter observedin the scheduling times is a result of data cache missesthat occur during the saving and restoring of the interruptroutine and thread context switches and should not be at-tributed to the hardware or software implementations.

Figure 12 shows the 0.8 µsec delay time from the timethat the interrupt request is generated from the hardwarebased scheduler, through the CPU acknowledgment, con-text switching to the ISR, handshaking the interrupt con-troller and determining that the kernel code should run.This RAW ISR delay time represents over 30% of the to-tal scheduling time delay.

Figure 11. Scheduling Delay - 2 Threads

Figure 12. Raw Interrupt Delay

Scheduling Period

C1 C2 NC1 NC2

INT

INT

INT

INT

C1 C2 NC1 NC2I I IC1 C2 NC1I

C1 NC2I I C2 I C2 NC1 NC1I

C1 C2 I I I NC2NC1 NC1I

C1 C2 I I INC2NC1 I

Asynchronous Interrupt Scheme

Priority C1 > INT > C2

Priority C2 > INT

Priority NC2 > INT

Figure 13. Interrupt Jitter Control

Thread C1 C2 NC1 NC2No Int’s 31114 49471 57068 60170Interrupt 36477 54841 70167 75949

C > Pr > NC 31244 49692 69127 74592Pr < NC 31090 49433 57050 60381

Figure 14. Worst Case Finish Time(Cycles)

Thread C1 C2 NC1 NC2No Int’s 5 10 12 16Interrupt 715 872 938 782

C > Pr > NC 22 47 1533 1783Pr < NC 7 12 14 14

Figure 15. Jitter for Interrupts(Cycles)

Page 8: hthreads: A Hardware/Software Co-Designed Multithreaded ...fileadmin.cs.lth.se/ai/Proceedings/etfa2005/html/articles/CF... · hthreads: A Hardware/Software Co-Designed Multithreaded

3.2 Interrupt Control

Figure 13 shows a graphical representation of interruptprocessing control through simple changing of priorities.For our experimental setup, we set a complete schedulingperiod of 100,000 clock cycles, or 1 msec and ran 1,000periods per test run. Our low overhead, low jitter multi-threaded programming model allows us to support sucha fast scheduling cycle because of our fine-grained andlightweight threads. We generated a periodic interrupt,with a period of 625 µsecs that continued throughout thecomplete test run. In order to measure the overhead due tocontext switching only, our interrupt service routine con-tained no processing and simply returned. This allowedus to isolate the overhead and jitter due just to the invo-cation effects of the ISR. Figure 14 provides worst casefinish times in clock cycles for our four threads, and Fig-ure 15 shows the corresponding finish time jitter. The firstrows in both figures show our ideal case with interruptsdisabled. Thread C1 has a worst case finish time of 311µsecs, C2 494 µsecs, NC1 570 µsecs, and NC2 601 µsecs.The corresponding jitter for C1 is 50 nsec, C2 100 nsecs,NC1 120 nsecs, and NC2 160 nsecs. The second row ofboth figures (labeled Interrupt), shows the correspondingworst case finish times and jitter for normal asynchronousinterrupt requests. As seen in Figure 15, substantial jitteris introduced in performing traditional interrupt contextswitching, even for a low frequency ( 1.6 interrupts perscheduling period) interrupt.

The next two rows show the effects of routing the inter-rupt requests through the CBIS. In these cases, the inter-rupt request does not cause an ISR invocation. Instead,the interrupt is translated to a thread scheduling requestand routed directly to our hardware resident scheduler. Inthe third row, the priority of the interrupt routine is set be-tween the priorities of the critical and non-critical threads.The jitter effects are substantially eliminated from the crit-ical threads and migrated into the non-critical threads. Thelast row shows the effects of lowering the interrupt ser-vice routines priority to below the non-critical tasks. Inthis case, the finish times and jitter are within a few clockcycles of the no interrupt base case. This is intuitive aswe now delay the processing of any interrupt during thescheduling period until after the threads have run.

4 Conclusion

In this paper we have presented the hardware/software co-design of our hthreads real-time multithreaded operatingsystem kernel. Our codesigned operating system takes ad-vantage of new hybrid CPU/FPGA chips as an enablingtechnology to provide a level of scheduling precision notobtainable from traditional software based operating sys-tems. Our current system supports standard multithreadedoperations through portable API’s, allowing 256 simulta-neous threads with 128 priority levels, and 64 semaphores.

Our co-designed components can be integrated into theFPGA resources along with other standard components toform a complete general purpose SoC solution. This newcapability is a fundamental step in providing very precise,small footprint operating system capabilities using famil-iar general purpose programming models to the embeddeddomain. More detailed descriptions of the hybrid threadsproject can be found at www.ittc.ku.edu/hybridthreads.

AcknowledgementThe work in this article is partially sponsored by NationalScience Foundation EHS contract CCR -0311599

References

[1] Altera Corporation. 2005. www.altera.com.[2] D. Andrews. Programming models for hybrid fpga/cpu

computational components: A missing link. IEEE Micro,pages 42–53, July/August 2004.

[3] D. Andrews, D. Niehaus, and P. Ashenden. Programmingmodels for hybrid fpga/cpu computational components.IEEE Computer, pages 118–120, January 2004.

[4] D. Andrews, D. Niehaus, and R. Jidin. Implementingthe thread programming model on hybrid fpga/cpu com-putational components. Proceedings of the 1st Workshopon Embedded Processor Architectures of the InternationalSymposium on Computer Architecture (ISCA), February2004.

[5] FSMLabs Inc. 2005. www.fsmlabs.com.[6] R. Jidin, D. Andrews, D. Niehaus, and P. Ashenden.

Implementing multi threaded system support for hybridfpga/cpu computational components. Proceedings of theInternational Conference on Engineering of Reconfig-urable Systems and Algorithms(ERSA), June 2004.

[7] P. Kuacharoen, M. Shalan, and V. Mooney. A configurablehardware scheduler for real-time systems. InternationalConference on Engineering of Reconfigurable Systems andAlgorithms (ERSA’03), pages 96–101, 2003.

[8] KURT Linux. 2005. www.ittc.ku.edu/kurt.[9] J. Lee, V. Mooney, K. Instom, A. Daleby, T. Klevin, and

L. Lindh. A comparison of the rtu hardware rtos with ahardware/software rtos. Proceedings of the Asia and SouthPacific Design Automation Conference, 2003.

[10] Memec. 2005. www.memec.com.[11] D. Niehaus and D. Andrews. Using the multi-threaded

computation model as a unifying framework for hardware-software co-design and implementation. Proceedings ofthe 9th Workshop on Object-oriented Real-time Depend-able Systems (WORDS 2003), September 2003.

[12] RF RealFast AB. 2005. www.realfast.se.[13] RTAI Project. 2005. www.rtai.org.[14] T. Samuelsson, M. Akerholm, P. Nygren, J. Starner, and

L. Lindh. A comparison of multiprocessor real-time oper-ating systems implemented in hardware and software. Pro-ceedings of the International Workshop on Advance Real-Time Operating System Services (ARTOSS), 2003.

[15] The Open Group. 2005. www.opengroup.org.[16] Xilinx Inc. 2005. www.xilinx.com.