compiling for fine-grain concurrency: planning and ...• for each explicit looping guest event in...

1

RTSS 2002 -- December 3-5, Austin, TexasAlex Dean

Center for Embedded Systems Research

Dept. of ECE, NC State [email protected] http://www.cesr.ncsu.edu/agdean/

Compiling for Compiling for FineFine--Grain Concurrency:Grain Concurrency:Planning and Performing Planning and Performing

Software Thread Integration Software Thread Integration

2

OverviewOverview• STI Background

– Hardware to software migration– Code Transformations

• STI Methods– Preparation– Planning Integration– Performing Integration

• Experiments– Target System– Analysis– Integration– Results

• Code Expansion• Performance Gains

• Current and Future Work

3

Hardware to Software MigrationHardware to Software Migration• What is it?

– Replacing hardware components with software functions on aconventional CPU (uniprocessor)

• Why do it? – Custom HW is too expensive unless

production volume is high– Cost, size, reliability, weight, power– Function availability, time to market,

field upgrades• Who does it?

– Everyone• Why do they hate it so passionately?

– Code in assembler and it takes forever to develop

• Tools are real-time-ambivalent– Code in C and you can’t get decent

performance, but you finish coding sooner• Slow context switches and scheduling• Tools are real-time-oblivious

Cost of Microcontroller Throughput

0

10

20

30

40

50

60

1995 1996 1997 1998 1999 2000 2001 2002Year

Pric

e pe

r MIP

S (U

S ¢)

ROM

I/O

CPU RAM CPU RAMROM

I/O

4

Software Thread Integration for HSMSoftware Thread Integration for HSM• Thread-Level - Basic STI

– Set of transforms which interleave code freely and efficiently based on sub-thread timing requirements

– Can trade off code size for execution efficiency

– Txs implemented in compiler• System-Level - STI for HSM

– Choosing guests - finding idle time– Choosing hosts - finding temporal determinism– Matching threads to integrate - predicting

system efficiency and performance– Triggering guests to meet thread timing

requirements - latency and resources• Result is greater efficiency

– Integrated threads more efficient– Integration process automated

RT Guest Thread

Existing Thread

Integrated Thread

Idle Time

Hardware Function

5

Integration

ThrintThrint

foo.s

foo.int.sfoo.id

Data-flow Analysis

Control-flow Analysis

Static Timing Analysis

Integration Analysis

GProf

XVCG GnuPlot

6

Detailed ViewDetailed View

Planning Integration

Static Timing Analysis

Data-Flow Analysis (w/loop iteration analysis)

Control-Flow and Control-Dependence Analysis

Code Regeneration

Performing Integration

Single Events Looping Events

Looping EventsSingle Events

7

Single Event Integration Single Event Integration -- Loop NodeLoop Node

Split Loop(Optimize for Speed)

Guard Guest(Optimize for Space)

8

Loop Integration Loop Integration -- Guest Loop SlowerGuest Loop SlowerOriginal LoopsOriginal Loops

Host Host IterationIteration

BusyBusy--wait Paddingwait Padding

Idle TimeIdle Time

Guest Iteration (RealGuest Iteration (Real--Time)Time)

HostHost GuestGuest

Unroll HostUnroll Host

9

1.1. Identify Guest NodesIdentify Guest NodesPROCEDURE VidRef_LineDraw__Fv_en INTO VidRef_LineDraw__Fv_en

TOLERANCE 0BLOCK Video_Reset_Ptr AT 20LOOP Video_Loop PERIOD 40 ITERATION_COUNT 128

BLOCK Video_Pix0 INTO_LOOP Video_Loop FIRST_AT 90BLOCK Video_Pix1 INTO_LOOP Video_Loop FIRST_AT 100BLOCK Video_Pix2 INTO_LOOP Video_Loop FIRST_AT 110BLOCK Video_Pix3 INTO_LOOP Video_Loop FIRST_AT 120

BLOCK Video_End IMPLICITEND

10

2. Analyze Host and Guest Functions2. Analyze Host and Guest Functions• Data Flow Analysis

• Register reallocation partitions register file• Loop iteration counts extracted

• Static Timing Analysis• Predict best and worst case execution times• Find temporally deterministic host segments• Find idle time in guests

0

2

4

6

8

10

0 25 50 75 100Segment Duration

Jitte

r

ABCD

Segment B

3 9 28x

5 5 4

? ?x

9

6 4

8

425

Segment ASeg. C

Seg. D

11

3.a Plan Integration (w/o Loop Fusion)3.a Plan Integration (w/o Loop Fusion)For each explicit single event guest in

integration directives list– Find host node which will be executing

(in current segment)Node::Find Hosts(target_time, int_plan)

If this node finishes too earlycontinue to next node

Elsepad node start jitterif this node ends in target time

plan to put guest AFTER this nodeelse descend into CDG, switch on

this->typeCODE: plan to put guest WITHIN

this nodePREDICATE: for each condition TV

get_first_child(TV)->find_hosts

LOOP: if guest node is not in loop

find hosts in loop (will plan to split loop or guard guest as needed)

elsefind hosts (loop fusion has already been planned, so just find correct location without any transformations)

this node

Best CaseStart Time

Worst CaseStart Time

Best CaseEnd Time

Worst CaseEnd Time

Guest TargetTime Range

Time

Pad StartJitter

12

3.b Plan Loop Fusion3.b Plan Loop Fusion• for each explicit looping guest

event in integration directives list– while guest loop iterations

remain• find time covered by guest

iters• if any host loops overlap, for

each host loop– if guest loop starts first

» peel preceding itersfrom guest loop

» plan integration as non-looping guests

– else if host loop starts first» plan to unroll guest or

host loop» mark for fusion» if loop iterations

remain» mark for clean-up

loops

• else no host loops during guest loop

– peel all guest loop iterations

– plan their integration as non-looping guests

Host

Guest

Peel GuestLoop Iterations

PP

Fuse Guest Loop Iterations with Host

PP FF

Peel GuestLoop Iterations

PP FF PP

13

Target Host Nodes Are Now IdentifiedTarget Host Nodes Are Now Identified

14

4. Integrate4. Integrate

• Det_Segment::Integrate(plan)– for each top_level_guest

• if guest is loop, – for each host

» if has a loop transform, do it

» make guarded copy of guest loop after host loop if needed

– cur_guest <= top_level_guest’s first child

• else cur_guest <= top_level_guest

– do• pad previous nodes (start

jitter)• pad host node (end jitter)• for each host cur_host

– do host loop transforms– update cur_host based on

splitting, prev guests, etc.– insert implicit guests and

explicit guest– advance cur_host

• if cur_guest in guest loop– advance to next guest

– while cur_guest– for each top_level_guest

• create fused loop control tests

15

0%

10%

20%

30%

40%

50%

60%

010

020

030

040

050

0

Idle Time Bubble Size

Idle

Tim

e D

istr

ibut

ion

Application: NTSC Video Signal GeneratorApplication: NTSC Video Signal Generator• Predictable 100 MIPS CPU (Alpha ISA)• Fixed memory access time (on-chip

memory)• 512x494 resolution image• Generate stream of pixels (100 ns each)

and sync signals in SW to feed a DAC• Coarse-grain idle time handled by timer-

based ISRs (vertical sync, serration, equalization pulses)

• Fine-grain idle time handled by integration (between pixels)

Generate with integrated thread

16

Software ArchitectureSoftware Architecture• Application defers

work via queues– Line and sprite drawing

functions split into enqueing stub, dequeuing service routine

– Interrupt service routine triggers integrated functions for rendering and display refresh (host work fills idle inter-pixel time)

– If both queues are empty, ISR calls dedicated video refresh function (nops fill idle inter-pixel time)

17

Initial IntegrationInitial Integration

DrawSprite

Video Refresh

DrawLine

Video Refresh

18

Final IntegrationFinal Integration

DrawSprite

DrawLine

19

LineDrawing

CircleDrawing

SpriteDrawing

31

10157.8

190 180

690

0

100

200

300

400

500

600

700

Pixe

ls R

ende

red

per V

ideo

Row

R

efre

shed

DiscreteIntegrated

PerformancePerformance

0%

5%

10%

15%

20%

25%

30%

35%

0 200 400 600

Pixels Drawn per Video Row RefreshedC

PU T

ime

Rem

aini

ng

Discrete Draw LineIntegrated Draw LineDiscrete Draw SpriteIntegrated Draw Sprite

20

0

200

400

600

800

1000

1200

1400

DrawLine DrawSprite DrawCircle

Func

tion

Size

(ins

truct

ions

)

PaddingLp. fusion testRepl. guestRepl. hostReg. realloc.Orig. guestOrig. host

Code ExpansionCode Expansion

• Code expansion is limited to integrated functions, not entire application

21

ConclusionsConclusions• Introduced techniques to plan and perform STI• STI improves HSM

– Product (Code)• Reclaims idle time better than traditional HSM methods• Helps designers meet embedded system constraints

– Process (Development)• Automated, built into post-pass compiler• Provides a sharp tool, allows development at higher level (i.e.

faster)• Current Work: Extend concepts of STI

– HSM for bit-banged communication protocols– Retargeted Thrint to support AVR 8-bit microcontroller

• Communication protocols• Encryption• Video Generation

– Integration to boost ILP and performance on wide CPUs• [email protected], http://www.cesr.ncsu.edu/agdean/

compiling for fine-grain concurrency: planning and ...• for each explicit looping guest event in...

Documents