embedded techcon practical techniques for embedded system optimization processes rob oshana...

Embedded TechCon Practical Techniques for Embedded

System Optimization Processes

Rob [email protected]

Agenda• Follow a process• Define the goals, quantitatively• The platform architecture makes a big difference• Don’t be naïve about the algorithms• Do some estimation and modelling• Help out the compiler if possible• Power is becoming more important• What about multiple cores?• Track what you are doing

2

There is a right way and a wrong way

• Donald Knuth; "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”

• Discipline and an iterative approach are the keys to effective serial performance tuning• measurements and careful analysis to guide decision making• change one thing at a time• meticulously re-measure to confirm that changes have been beneficial

There are always tradeoffs• Symptoms

– Excessive optimization– Premature optimization

• These consume project resources, delay release, compromise software design w/o directly improving performance

• Fixation on efficiency• Model first before optimizing

Follow a process

Spend time up front understand your non-functional requirements

Functional; “The embedded software shall…(monitor, control, etc)”

Non-Functional; “The embedded software shall be..(fast, reliable, scalable, etc)

IPFwd

Fast

Kpps

Should 600

Must 550

Functional = what the system should do

Non-functional = how well the system should do it

“it has to be really fast”

“it has to be able to kick <competitor A>’s butt”

-- examples of real performance “requirements”

There is a Difference Between Latency and Throughput

“It is not possible to determine both the position and momentum of an object beyond a certain amount of precision.”

-Heisenberg’s Principle

•Similarly, it is not possible to design a system that provides both low latency and high performance•However, real-world systems (such as Media, eNodeB, etc.) need both•Need to tune the system for the right balance of latency and performance

Latency; 10usec avg, 50 uses max wake up latency for RT Tasks

Throughput; 50Mbps UL, 100 Mbps DL for 512B packets

Map the application to the coreCPU (latency oriented cores)

GPU (throughput oriented cores)

VLIW DSP CPU Core

Data Path 1

D1M1S1L1

A Register File

Data Path 2

L2S2M2D2

B Register File

Instruction Decode

Instruction Dispatch

Program Fetch

Interrupts

Control Registers

Control Logic

Emulation

Test

VLIW DSP CPU Core

Data Path 1

D1M1S1L1

A Register File

Data Path 2

L2S2M2D2

B Register File

Instruction Decode

Instruction Dispatch

Program Fetch

Interrupts

Control Registers

Control Logic

Emulation

Test

Or offload to the cloud

Estimating Embedded Performance can be done prior to writing the code

1. Maximum CPU Performance“What is the maximum number of times the CPU can execute your algorithm? (max # channels)

2. Maximum I/O Performance

“Can the I/O keep up with this maximum #channels?”

1. CPU Load (% of maximum)

2. At this CPU Load, what other functions can I perform?

1. CPU Load (% of maximum)

2. At this CPU Load, what other functions can I perform?

3. Available Hi-Speed Memory

“Is there enough hi-speed internal memory?”

Example – Performance Calculation

Algorithm: 200-tap (nh) low-pass FIR filter

Frame size: 256 (nx) 16-bit elements

Sampling frequency: 48KHz

How many channels can the core handle given this algorithm?

Max # channels: does not include overhead for interrupts, control code, RTOS, etc.

Are the I/O and memory capable of handling this many channels?

Required memory assumes: 60 different filters, 199 element delay buffer, double buffering rcv/xmt

X

Estimation results drive options

20%

CPU Load Graph Application: simple, low-end (CPU Load 5-20%)

What do you do with the other 80-95%?1

• Additional functions/tasks

• Increase sampling rate (increase accuracy)

• Add more channels

• Decrease voltage/clock speed (lower power)

Application: complex, high-end (CPU Load 100%+)

How do you split up the tasks wisely?2

• GPP/uC (user interface), DSP (all signal processing)

• DSP (user i/f, most signal proc), FPGA (hi-speed tasks)

• GPP (user i/f), DSP (most signal proc), FPGA (hi-speed)

Core Core Acc+or

DSP Acc++GPP

Help out the compiler► A compiler maps high-level code to

a target platform• Preserves the defined behavior of

the high level language• The target may provide functionality

that is not directly mapped into the high level language

• Application may use algorithmic concepts that are not handled by the high level language

► Understanding how the compiler generates code is important to writing code that will achieved desired results

Big compiler impact 1; ILP

restrict enables SIMD optimizations

void VecAdd(int *a, int *b, int *c) { for (int i = 0; i < 4; i++) a[i] = b[i] + c[i]; }

void VecAdd(int * restrict a, int *b, int *c) { for (int i = 0; i < 4; i++) a[i] = b[i] + c[i]; }

Independent loads andstores. Operations canbe performed in parallel!

Stores may alias loads.Must perform operationssequentially.

Big compiler impact 2; data locality

Spatial locality of B enhanced Unroll outer loop and fuse new copies of the inner loop Increases size of loop body and hence available ILP

General guideline; Align computation and locality

for (i=0;i<N;i++) for (j=0;j<N;j++)

A[i][j] = B[j][i];

for (i=0; i<N; i+=2) for (j=0; j<N; j++) { A[i][j] = B[j][i]; A[i+1][j] =B[j][i+1]; }

Use cache efficiently

C PU600 M Hz

Exte rn a l Me mo ry~100 M Hz M e mo ry

O n -C h ip L 1 C a ch e600 M Hz

O n -C h ip L 2 C a ch e300 M Hz

mem

ory size

spe

ed

/ cost

Use the right algorithm think Big O

200 cycles

100 cycles40 cycles

DFT versus FFT performance

1

10

100

1000

10000

100000

1000000

10000000

1 2 3 4

Number Points

DFT

FFT (radix 2)

% increase

O(n^2) vs O(nlogn)

Understand Performance Patterns (and anti-patterns)

(green – data, red – control, blue – termination)

Power Optimization; Active vs. Static PowerPower consumption in CMOS circuits:

P total = Pactive + Pstatic

C = charge (q) / voltage (V),

q = CV

W = V * q, or in other terms, W = V * CV or W = CV²

Power is defined as work over time, or in this discussion it is how many times a second we oscillate the circuit.

P = (work) W / (time) T and since T = 1/F then P = WF or substituting, P = CV²F

Top Ten Power Optimization Techniques1. Architect SW to have natural “idle” points (inc. low power boot)2. Use interrupt-driven programming (no polling, use OS to block)3. Code and data placement close to processor to minimize off-chip accesses (and overlays from non-volatile to fast

memory)4. Smart placement to allow frequently accessed code/data close to CPU (and use hierarchical memory models)5. Size optimizations to reduce footprint, memory and corresponding leakage6. Optimize for speed for more CPU idle modes or reduced CPU frequency (benchmark and experiment!)7. Don’t over calculate, use minimum data widths, reduce bus activity, smaller multipliers8. Use DMA for efficient transfer (not CPU)9. Use co-processors to efficiently handle/accelerate frequent/specialized processing10. Use more buffering and batch processing to allow more computation at once and more time in low power modes11. Use OS to scale V/F and analyze/benchmark (make right 1st !!)

When you have more than 1 core to optimize (multicore)

Goal; Exploit multicore resourcesStep 1; Optimize serial implementation

easierless time consumingless likely to introduce bugsreduce the gap less parallelization is neededallows parallelization to focus on parallel behavior, rather than a mix of serial and parallel issues

Serial optimization is not the end goalApply changes that will facilitate parallelization and the performance improvements that parallelization can bringSerial optimizations that interfere with, or limit parallelization should be avoided

avoid introducing unnecessary data dependenciesexploiting details of the single core hardware architecture (such as cache capacity)

There’s Amdahl and then there’s Gustafson (know the difference)

• Conventional Wisdom− Speedup decreases with increasing portion of

serial code (S)- diminishing returns

− Imposes fundamental limit (1/S) on speedup

− Assumes parallel vs. serial code ratio is fixed for any given application – unrealistic?

• Theoretical Max?− Applies to applications without a fixed code ratio

– e.g. networking/routing− Speedup becomes proportional to the number of

cores in the system− Packet processing provides opportunity for

parallelism

Time/Problem SizeTime/Problem Size

Many types of Parallelism (more than one may apply)

Multithreaded Programming has some hazards

Deadlock

Livelock

False sharing

Data hazards

Lock contention

Optimize for best case scenario, not worst case

No Lock Contention – No System Call

Very useful in low number of threads scenario

Optimize for best case scenario, not worst case

Since most operations willnot require arbitration betweenprocesses, this will not happen in most cases

Top Ten Performance Optimization Techniques for Multicore1.Achieve proper load balancing2.Improved data locality and reduce false sharing3.Affinity Scheduling if necessary4.Lock granularity5.Lock frequency & ordering6.Remove sync barrier7.Async vs sync communication8.Scheduling9.Worker thread pool10.Manage thread count11.Use parallel libraries (pthreads, openMP, etc)

Threads, cache, naïve and smart

Recommendation; Start developing crawl charts

1.DL throughput : 60 Mbps (with MCS=27 DL MIMO)2.UL throughput : 20 Mbps (with MCS=20 for UL )

521

412

312288

0

100

200

300

400

500

600

"Out of Box C" C w/ intrinsics Hand assembly Full entitlement

Cycle Count

optimization improvement

05000

100001500020000

out o

f box

intrin

sics

prag

mas

partia

l sum

mation

mult

i sam

pling

optimization type

tota

l cy

cles

cycles

Recommendation; Form a Performance Engineering Team

Feature Content

Configuration Settings

Repository / Branch / Patches

Feature Content

Configuration Settings

As content is upstreamed

SoCKernel

Upstream Kernel

Feature Merge

Feature Integration

Performance Engineering

SoC Features and NPIs

embedded techcon practical techniques for embedded system optimization processes rob oshana...

Documents