lab-1: profiling/optimizing video decoder using adscjtsai/courses/soc/labs/soc11_lab1.pdf2/22...
TRANSCRIPT
Lab-1: Profiling/OptimizingVideo Decoder Using ADS
National Chiao Tung UniversityChun-Jen Tsai
3/3/2011
2/22
Profiling MPEG-4 SP Decoder Goal: Profiling and optimizing the MPEG-4 video
decoder, m4v_dec Tasks: Profile the video decoder under ADS Analyze the results and identify the hotspots Optimize the decoder based on the hotspot analysis Redraw the pie charts after your optimization
Please also write a report (two-column, no coversheet, 4 pages at most) to summarize youranalysis and optimization of the system model
3/22
Embedded Software Design Flow Take ARM-based systems for example:
*.c/.cppC/C++ source C libraries
*.sasm source
object libraries
C compiler assembler
linker
Librarian
*.oELF object file
*.axf image
axd
ARMulator
System models
developmentboard
All the tools are integrated in the IDE:ARM Developers Suite (ADS)
debug
4/22
Generations of ARM Toolchains ARM SDT –Software Development Tools Final version 2.5, 1998
ARM ADS –ARM Software Development Suite Final version 1.2, 2000 Still popular in the industry
ARM RVDS –RealView Development Suite Latest version 4.0 With emphasis on Electronic System Level (ESL)
design environment
5/22
Cross-Platform Development
serial, Ethernet,or JTAG cable
Host Computer
Development board
ADS IDE environment†
†You can obtain a 45-day full function ADS 1.2 trial CD image from the TAA tutorial can be downloaded from http://www.arm.com/support/tutorials/16213.html
6/22
Verification of Your Software During the development cycle, you often have to
debug your software; for embedded firmwaredevelopment, the process involves threecomponents: Debugger (runs on host computer): axd Debug agent: interface between debugger and your
code Target platform: the platform (simulated or emulated)
that executes your code
7/22
Debug Agent
A debug agent performs the actions requestedby the debugger, for example: setting breakpoints reading from memory writing to memory.
The debug agent is not the program beingdebugged, or the debugger itself
Examples: Angel, JTAG circuits
8/22
Target Platform Target platform can be real hardware or
simulator If simulator platform is used, the core component
is a instruction set simulator (ISS) For example, ARMulator is the famous simulator in
ARM ADS ARMulator also doubles as a platform simulator, but
not as powerful as simulators from other venders(such as CoWare)
9/22
ARM Debugging Setup†
Runs on PC Host
†AXD and armsd Debuggers Guide, Page 1-6.
Probably runson the samePC Host
10/22
ADS Workspace
source window
build messages
project window
11/22
AXD Desktop
disassemblywindow
sourcewindow
Console
Systemview
systemoutput
12/22
Profiling and CPU Cycle Analysis Profiling and CPU cycle analysis are two
different approaches to analyze your software Profiling gives you a per-function complexity analysis CPU cycle analysis gives you more insights into the
software regarding computation vs. memory accesses
Under ADS, you use AXD to do both For profiling, ADS generates some data and a
command line tool armprof is used to analyze data For cycle analysis you must display ARMulator
internal statistics counter in an ADS window
13/22
CPU Cycle Types of ARM Sequential (S cycle)
The ARM core requests a transfer to or from an address which iseither the same, or one word or one-half-word greater than thepreceding address.
Non-sequential (N cycle) The ARM core requests a transfer to or from an address which is
unrelated to the address used in the preceding address. Internal (I cycle)
The ARM core does not require a transfer, as it performing aninternal function, and no useful prefetching can be performed atthe same time
Coprocessor register transfer (C cycle) The ARM core wished to use the data bus to communicate with a
coprocessor, but does not require any action by the memorysystem.
14/22
The System Model Used in Labs For labs, we will use m4v_dec –an MPEG-4
video decoder†, as the system model Contains 28 files, 5212 lines of C code
Differences between m4v_dec and xvid 0.9: Simpler API Support for Simple Profile combined mode with resync
marker Decoder-only library Pure C implementation (thus, can be used as a
system model)†m4v_dec is based on version 0.9 of the GNU MPEG-4 codec project, xvid
(see http://www.xvid.org for latest xvid source).
15/22
About the Source Package There are two project files in the project directory,“m4v_dec”: “m4v_dec.mcp”is the project workspace file for ADS; double-
click this and ADS will bring up the Development IDE “Makefile”is the make file for eCos/gcc toolchain
In the “tools”directory, there is a Win32 program,vidview.exe, for playing the decoded video (output.yuv)
In the “bitstream”directory, there is a samplecompressed video bitstream, foreman_150.m4v
16/22
Video Decoder Block Diagram
+VLD Q-1 IDCT Use MC?Y
N
macroblock mode, motion vector
DCTcoefficient data
VLD: variable length decodingDC/AC–1: inverse DC/AC predictionQ–1: inverse quantizationIDCT: inverse transformMC: motion compensationBilinear: half-pel Interpolation
DC/AC-1
videobitstream
decodedimage
referenceimage
To output(display)
The functional block diagram of m4v_dec:
Bilinear
MC
17/22
About Optimization A sample result of profiling is as follows
Obviously, for optimization you want to start withthe idct() function in idct.c
IDCT
Inverse Quantization
Interpolation
Boundary Extension
Color Conversion
Motion Compensation
DC/AC Prediction
VLC Decoding
18/22
Main Decoder Modules You may want to take a deeper look of the
following files: bilinear8x8.c (interpolation) idct.c (inverse DCT) mbcoding.c (VLC decoding) quant_h263.c (inverse quantization) mem_transfer.c (motion compensation) mbprediction.c (DC/AC prediction)
19/22
Hint: Removing Floating Point Floating point operations in general can be
removed as follows:
main(int argc, char **argv){
double a, b;int c;
a = 3.14159;b = 1.41421;c = (int) floor(a+b+0.5);
}
main(int argc, char **argv){
int a, b;int c;
a = 6434; // 2048*3.14159b = 2896; // 2048*1.41421c = (a+b+1024)>>11;
}
20/22
After Removing FP Operations An example of optimized result:
IDCT
Inverse Quantization
Interpolation
Boundary Extension
Color Conversion
Motion Compensation
DC/AC Prediction
VLC Decoding
21/22
Necessary Charts in Your Report
In your report, you shall provide the followinginformation Draw a pie chart that shows major CPU load
distribution among functions For the top-10 functions which consume most CPU
time, draw a pie chart to show the distribution ofmemory cycles
22/22
References for This Lab You can find some pdf ebooks related to this lab in the
document folder of your ADS installation directory: For General ARM Programming
ADS Programming Guide (includes three manuals) Writing Efficient C for ARM (ARM App. Note 34)
For Profiling using AXD AXD and armsd Debuggers Guide
For Using ARMulator ADS Debug Target Guide Benchmarking with ARMulator (ARM App. Note 93) The ARMulator Configuration File (ARM App. Note 52)
For MPEG-4 Video Decoder Knowledge Class slides “Google”or “Wikipedia”