![Page 1: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/1.jpg)
1
Application Performance Analysis
of the Cortex-A9 MPCore
Bryan Lawrence
This project in ARM is in part
funded by ICT-eMuCo, a
European project supported
under the Seventh Framework
Programme (7FP) for research
and technological development
![Page 2: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/2.jpg)
2
Agenda
Motivation
Experimentation platforms
Performance exploration of different application classes
Performance evaluation of multiple concurrent applications
Summary and conclusion
![Page 3: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/3.jpg)
3
Phone ++ Upcoming Use Cases
Mobile Internet Browsing
Video conferencing
Gaming on the Go
Multi-player over 3G / 4G
Network
3D Navigation
![Page 4: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/4.jpg)
4
Mobile Phone Applications
Compute
Intensive
![Page 5: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/5.jpg)
5
Tablet Applications
Compute
Intensive
![Page 6: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/6.jpg)
6
Achieving Scalable Performance
Clock frequency of processor not the only metric of
performance
Scalable, energy efficient performance required from mobile
devices – phones, tablets to large enterprise computing
Can multicore processors provide a potential solution ?? .....
![Page 7: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/7.jpg)
7
Hardware Platforms
Versatile Express
ARM-NEC Cortex™-A9 processor
test-chip ~400MHz
Cortex-A9 x 4
4x NEON™/FPU
32KB I&D invidual L1 caches
512K L2 cache
1GB RAM (32b DDR2)
Early Partner Silicon
Cortex-A9 x 2 @ 1GHz
1GB RAM
![Page 8: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/8.jpg)
8
Video Decode / Encode
Hardware encoder/decoders are common in consumer
Video/audio codecs standards evolve rapidly
Many codecs are used infrequently to justify h/w
Consumer applications involve other video processing
Different from encode / decode (E.g. video editing)
Simultaneous encode / decode required for video
conferencing
![Page 9: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/9.jpg)
9
FFmpeg used for decode
X264 library used with FFmeg for video encode
CIF & VGA resolutions
Commonly used in video conf.
Movie trailers used
Order of computation more than video conf. Streams
Compression factor of 100 - 200
H.264 Decode / Encode
![Page 10: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/10.jpg)
10
H.264 Decode / Encode
Results for single core operation
Normalized logarithmic scales used
Encode is more compute intensive than decode (at least ~2-3 times)
Writing out decoded streams
to secondary storage media
limited by media bandwidth
![Page 11: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/11.jpg)
11
H.264 Decode / Encode
Concurrent video decode + encode
Important use case for video conferencing
Excellent scalability is observed for up to all 4 cores
Encoding is at least
2-3 times or more compute
intensive than decode
Ideally more resources
should be dedicated to
encode
![Page 12: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/12.jpg)
12
On2/Google VP8
Libvpx library used for decoding VP8 (from WebM project)
Libvpx uses multi-threading and actively takes advantage of
parallelizability available in the VP8 codec.
Comparative results obtained on Versatile Express and 1GHz
dual core platforms
![Page 13: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/13.jpg)
13
On2/Google VP8
Shows good scalability with the
number of cores.
Scalability is relatively independent
of the number of partitions in the
video frame
Saturation is observed for no. of
threads > no. of cores
Designers can query the platform
to fetch the no. of cores –
determine available paralelizability
1GHz dual-core
Versatile Express
![Page 14: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/14.jpg)
14
Compilation - ffmpeg
Code compilation has inherent
parallelism in terms of modules
Most build systems allow for this
compilation to be exploited
E.g. make –j 4
Compilation of FFmpeg and
Linux Kernel shown here
1GHz dual-core
Versatile Express
![Page 15: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/15.jpg)
15
Compilation – Linux Kernel
1GHz dual-core
Versatile Express
Almost linear speed-up is observed
with no. of cores for both cases
Effectively doubles (quadruples)
the utilized memory bandwidth
for 2 cores (4 cores)
![Page 16: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/16.jpg)
16
Browsers
Browser benchmark using collection of web-pages
similar to the mix found in common browsing
Speed-up of 1.54 times observed between single and
dual core execution
The ‘webcore’ fraction of the pie grows for multicore
execution
Normalized Performance Execution time decomposition
1.54x
![Page 17: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/17.jpg)
17
Multiple Concurrent Applications
Multitasking is becoming mainstream
in mobile devices today
Common combinations include
Browser + Audio playback
E.g. Internet Radio
Browser + background download
Independent applications can
benefit immensely from
parallelization
![Page 18: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/18.jpg)
18
Browser + Pandora Internet Radio
Speed up factor of 1.9
Super linear speed-up can
be observed sometimes
due to reduced cache
pollution from conflicting
applications
The speed-up can be
traded for energy by
slowing the cores down
(depends on the
fabrication process
technology used)
Normalized Performance
Execution time decomposition
1.9x
![Page 19: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/19.jpg)
19
Browser + Internet File Download
Speed up factor of 1.64x
Common use case
involves downloading an
App from an application
store or market-place
while browsing the
internet
Email synchronization in
the bakground also forms
a similar use case
Normalized Performance
Execution time decomposition
1.64x
![Page 20: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/20.jpg)
20
Cortex-A9 MP Benefits – Performance
Browser
(single app)
1
1.54
1 Core
2 Core
![Page 21: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/21.jpg)
21
Cortex-A9 MP Benefits – Richer Experience
Browser
(single app)
1
1.54
Browser +
Pandora
0.78
1.50
Browser +
Download
0.73
1.20
1 Core
2 Core
![Page 22: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/22.jpg)
22
Cortex-A9 MP Benefits – Richer Experience
Browser
(single app)
1
1.54
Browser +
Pandora
0.78
1.50
Browser +
Download
0.73
1.20
1 Core
2 Core
1.64x 1.9x
![Page 23: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/23.jpg)
23
Summary and Conclusion
This presentation demonstrates the scalability of the ARM
Cortex-A9 MPCore™ processor across various classes of
applications, on today’s currently available software
Better power/performance can be achieved using an efficient
low power ARM multicore processor, as compared to a single
processor at much higher freq.
Next generation software will make more intensive use of
threads, and scalability will improve further.
![Page 24: This project in ARM is in part funded by ICT-eMuCo, a ... · Compilation of FFmpeg and Linux Kernel shown here 1GHz dual-core Versatile Express . 15 Compilation](https://reader031.vdocuments.us/reader031/viewer/2022013015/5b8655767f8b9a2e3f8ca64e/html5/thumbnails/24.jpg)
24
Thank You
Please visit www.arm.com for ARM related technical details
For any queries contact < [email protected] >