maximize machine learning performance with iar embedded … · 2020-03-02 · an example ml...
TRANSCRIPT
Maximize Machine Learning performancewith IAR Embedded Workbench
Aaron Bauch, IAR Sr. FAE
Agenda
Challenges of AI in embedded systems Helping the Optimizer Challenges with debugging optimized code Demo Summary
An example ML application• Let’s look at a Convolutional Neural Network (CNN) application that is
based on a CIFAR-10 example from Caffe and uses ReLU activation, pooling, and fully-connected functions.
• This network has:• 3 convolution layers
• Interspersed ReLU activation layers• Interspersed max pooling layers
• Fully-connected layer at the end
Challenges facing ML and AI applications
Size optimization• Get the smallest possible code• Important because IoT devices
tend to have small memoryfootprint
• Allows for more features to be added
• Generally has bad performance
Speed optimization• Get the fastest possible code• Nobody likes to wait• Better battery life• Good “first impression” with
customers
• Can be way too big for selected device
The limits of optimization
Optimization Results
No Opt Low Opt High Size High Speed MixCycles 421096744 405915070 271550851 110605342 102258144% Speed Vs None 100.00% 103.74% 155.07% 380.72% 411.80%RO Code(Bytes) 4880 4632 3894 4954 4244RO Data(Bytes) 36388 36388 36998 36998 36364RO Total(Bytes) 41268 41020 40892 41952 40608RW Data(Bytes) 96852 96852 96852 96852 96852
% code size Vs None 100.00% 94.92% 79.80% 101.52% 86.97%
A deeper look into optimizationHere are some common compiler optimizations and their effect on code:
Common sub-expressions Speed ↑ Size →Loop Unrolling Speed ↑ Size ↑Function Inlining Speed ↑ Size ?Code Motion Speed ↑ Size →Dead Code Elimination Speed → Size ↓Static Clustering Speed ↑ Size ↓Instruction Scheduling Speed ↑ Size →
Structuring your code• To make your code efficient and portable:• Isolate “device-dependent” code• Decide which parts really need speed, like ML engine• Optimize everything else for size
SetPort(Port,Pin,Status);
ComInterfaceSpeed Optimization
General CodeSize Optimization
HardwareDevice Driver Files
Generic Program
Files
Tuned Program
Files
Use “correct” data sizesUsing an “unnatural” data size will most likely cost you• 32-bit core will need to hold 64-bit data in multiple registers• Anything smaller than the natural size will cause shift/mask/sign-extend
operationsUse a “natural” data size unless you have compelling reason not to do so• Generally one natural unit moves in 1 clock• Smaller units may take several clocks to extract subfields• Larger units will take multiple loads/stores
Cost of unnatural data sizes
int test_int(int a, int b){
return(a+b);}
char test_char(char a, char b){
return(a+b);}
2 Cycles
// Arm Cortex-M (32 bit)// ADDS R0,R0,R1// UXTB R0,R0
1 Cycle// Arm Cortex-M (32 bit)// ADDS R0,R1,R0
Signed or unsigned?Think about signedness!
Signed:– Negative values possible– Arithmetic operations always performed– Operations will never be cheaper, but in many cases more expensive
Unsigned:– Negative values impossible– Bit operations are consistent– Arithmetic operations may be optimized to bit-operations.
C Promotes to signed unless directed otherwise• Use uint32_t etc. for control
+ - * / %
<< >> | & ^ ~
Don’t write “clever” code
• Some developers believe that making fewer source lines of C makes tighter code
– Code becomes difficult to understand– Huge “maintenance debt”
• Write code in a clear, logical manner– Helps the compiler understand code better– Better optimization from compiler
Don’t write “clever” code
In this example, the “clever” code produces more code because it invokes the need for a temporary variable in order to hold the 1 or 0 added to the variable str
Compiler influenceWhen you crank up the optimization, whole sections of code can be relocated• Looks like code has “disappeared”• Cannot set breakpoints
This can make debugging challengingMany developers debug at “low” or “no” optimization
Difficulty in debugging
• Things you can try to make debugging easier:
• Lower optimization level• Per file• Per function
• “Bite the bullet”• Try to follow how the program responds instead of where it goes• Monitor key variables in the code to see how they change
Does optimization cause bugs?
Exposes existing bugs in your code– Very particular about C semantics– Reduces redundancy in code– Code was incorrect to begin with– Only correct code is optimized correctly
Optimization is necessary– Otherwise, your code may not fit in
device or perform well– Write maintainable, non-tuned code– Trust compiler to optimize– Without optimization, compiler
handicapped– Optimized code is tested more
Debugging Optimized Code
BreakpointsBreakpoints can be made to do some interesting things:
– Complex/conditional breakpoints can break execution only when a certain condition is satisfied
– Log breakpoints can be used in place of printf( ) statements to instrument your code
18
WatchpointsData watchpoints are very useful to help you find:
– Code that has gone astray and starts clobbering data– When data is being unexpectedly altered– When data values exceed a threshold– Which pieces of code use a data area– When you are about to overflow your stack– When you are about to overflow an array/pointer
ITM Events• Can be used to get data out of the MCU, such as:
– Status flags– Values of variables– Stack pointer value
• Can also help profile your code:ITM_EVENT8(1, 0);I2C1_Init();ITM_EVENT32(2,__get_SP() );
Interrupt trace• This can help you visualize how long your interrupts are taking to process
data• You can also see how deeply you are going into the NVIC• Can help you figure out how to shorten ISR response times• Can see how often an ISR is triggering
Demonstration
Summary• ML/AI in embedded applications is tightly
constrained and performance intensive• Best-in-class optimization makes it possible• Structuring your code effectively can help• We provide the tools to make you
successful