maximize machine learning performance with iar embedded … · 2020-03-02 · an example ml...

22
Maximize Machine Learning performance with IAR Embedded Workbench Aaron Bauch, IAR Sr. FAE

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Maximize Machine Learning performancewith IAR Embedded Workbench

Aaron Bauch, IAR Sr. FAE

Page 2: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Agenda

Challenges of AI in embedded systems Helping the Optimizer Challenges with debugging optimized code Demo Summary

Page 3: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

An example ML application• Let’s look at a Convolutional Neural Network (CNN) application that is

based on a CIFAR-10 example from Caffe and uses ReLU activation, pooling, and fully-connected functions.

• This network has:• 3 convolution layers

• Interspersed ReLU activation layers• Interspersed max pooling layers

• Fully-connected layer at the end

Page 4: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Challenges facing ML and AI applications

Size optimization• Get the smallest possible code• Important because IoT devices

tend to have small memoryfootprint

• Allows for more features to be added

• Generally has bad performance

Speed optimization• Get the fastest possible code• Nobody likes to wait• Better battery life• Good “first impression” with

customers

• Can be way too big for selected device

The limits of optimization

Page 5: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Optimization Results

No Opt Low Opt High Size High Speed MixCycles 421096744 405915070 271550851 110605342 102258144% Speed Vs None 100.00% 103.74% 155.07% 380.72% 411.80%RO Code(Bytes) 4880 4632 3894 4954 4244RO Data(Bytes) 36388 36388 36998 36998 36364RO Total(Bytes) 41268 41020 40892 41952 40608RW Data(Bytes) 96852 96852 96852 96852 96852

% code size Vs None 100.00% 94.92% 79.80% 101.52% 86.97%

Page 6: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

A deeper look into optimizationHere are some common compiler optimizations and their effect on code:

Common sub-expressions Speed ↑ Size →Loop Unrolling Speed ↑ Size ↑Function Inlining Speed ↑ Size ?Code Motion Speed ↑ Size →Dead Code Elimination Speed → Size ↓Static Clustering Speed ↑ Size ↓Instruction Scheduling Speed ↑ Size →

Page 7: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Structuring your code• To make your code efficient and portable:• Isolate “device-dependent” code• Decide which parts really need speed, like ML engine• Optimize everything else for size

SetPort(Port,Pin,Status);

ComInterfaceSpeed Optimization

General CodeSize Optimization

HardwareDevice Driver Files

Generic Program

Files

Tuned Program

Files

Page 8: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Use “correct” data sizesUsing an “unnatural” data size will most likely cost you• 32-bit core will need to hold 64-bit data in multiple registers• Anything smaller than the natural size will cause shift/mask/sign-extend

operationsUse a “natural” data size unless you have compelling reason not to do so• Generally one natural unit moves in 1 clock• Smaller units may take several clocks to extract subfields• Larger units will take multiple loads/stores

Page 9: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Cost of unnatural data sizes

int test_int(int a, int b){

return(a+b);}

char test_char(char a, char b){

return(a+b);}

2 Cycles

// Arm Cortex-M (32 bit)// ADDS R0,R0,R1// UXTB R0,R0

1 Cycle// Arm Cortex-M (32 bit)// ADDS R0,R1,R0

Page 10: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Signed or unsigned?Think about signedness!

Signed:– Negative values possible– Arithmetic operations always performed– Operations will never be cheaper, but in many cases more expensive

Unsigned:– Negative values impossible– Bit operations are consistent– Arithmetic operations may be optimized to bit-operations.

C Promotes to signed unless directed otherwise• Use uint32_t etc. for control

+ - * / %

<< >> | & ^ ~

Page 11: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Don’t write “clever” code

• Some developers believe that making fewer source lines of C makes tighter code

– Code becomes difficult to understand– Huge “maintenance debt”

• Write code in a clear, logical manner– Helps the compiler understand code better– Better optimization from compiler

Page 12: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Don’t write “clever” code

In this example, the “clever” code produces more code because it invokes the need for a temporary variable in order to hold the 1 or 0 added to the variable str

Page 13: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Compiler influenceWhen you crank up the optimization, whole sections of code can be relocated• Looks like code has “disappeared”• Cannot set breakpoints

This can make debugging challengingMany developers debug at “low” or “no” optimization

Page 14: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Difficulty in debugging

• Things you can try to make debugging easier:

• Lower optimization level• Per file• Per function

• “Bite the bullet”• Try to follow how the program responds instead of where it goes• Monitor key variables in the code to see how they change

Page 15: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Does optimization cause bugs?

Exposes existing bugs in your code– Very particular about C semantics– Reduces redundancy in code– Code was incorrect to begin with– Only correct code is optimized correctly

Optimization is necessary– Otherwise, your code may not fit in

device or perform well– Write maintainable, non-tuned code– Trust compiler to optimize– Without optimization, compiler

handicapped– Optimized code is tested more

Page 16: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Debugging Optimized Code

Page 17: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

BreakpointsBreakpoints can be made to do some interesting things:

– Complex/conditional breakpoints can break execution only when a certain condition is satisfied

– Log breakpoints can be used in place of printf( ) statements to instrument your code

18

Page 18: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

WatchpointsData watchpoints are very useful to help you find:

– Code that has gone astray and starts clobbering data– When data is being unexpectedly altered– When data values exceed a threshold– Which pieces of code use a data area– When you are about to overflow your stack– When you are about to overflow an array/pointer

Page 19: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

ITM Events• Can be used to get data out of the MCU, such as:

– Status flags– Values of variables– Stack pointer value

• Can also help profile your code:ITM_EVENT8(1, 0);I2C1_Init();ITM_EVENT32(2,__get_SP() );

Page 20: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Interrupt trace• This can help you visualize how long your interrupts are taking to process

data• You can also see how deeply you are going into the NVIC• Can help you figure out how to shorten ISR response times• Can see how often an ISR is triggering

Page 21: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Demonstration

Page 22: Maximize Machine Learning performance with IAR Embedded … · 2020-03-02 · An example ML application • Let’s look at a Convolutional Neural Network (CNN) application that is

Summary• ML/AI in embedded applications is tightly

constrained and performance intensive• Best-in-class optimization makes it possible• Structuring your code effectively can help• We provide the tools to make you

successful