![Page 1: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/1.jpg)
Software Performance Software Performance Tuning ProjectTuning Project
Monkey’s AudioMonkey’s Audio
Prepared by:Meni OrenbachRoman Kaplan
Advisors:Liat AtsmonKobi Gottlieb
![Page 2: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/2.jpg)
• Monkey’s Audio – a lossless audio codec
• Can Compress at different levels
• Can be decompressed back to a Wav file
• Used to save memory while maintaining all the original data
• Playable
MAC – Ape File EncoderMAC – Ape File Encoder
![Page 3: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/3.jpg)
PlatformPlatform And Benchmark UsedAnd Benchmark Used
• Platform: Intel Pentium Core i7 3GB of RAM and with a
Windows Vista operating System.
• Benchmark:- 238MB song.- Original Encoding Duration: 98.9 Sec
![Page 4: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/4.jpg)
Algorithm DescriptionAlgorithm Description
• The input file is read frame by frame
• Every frame contains a constant number of channels
• Channels encoded with dependency between them
• Every frame is encoded and immediately written
![Page 5: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/5.jpg)
The Encoding ProcessThe Encoding ProcessMultiThread
Here!
Frame 1
Frame 2
Frame n
Channel 1
Channel 2
Channel 8
Input File
Frame iChannel 1
Channel 2
Channel 8
Channel 1
Channel 2
Channel 8
Channel 1
Channel 2
Channel 8
·
·
·
·
·
·
···
···
···
···
Encoded Frame 1
Encoded Frame 2
Encoded Frame n
Encoded Frame i
·
·
·
·
·
·
Output File
MultiThread
Here!
MultiThread
Here!
![Page 6: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/6.jpg)
Function Data flowFunction Data flow
Encoding everyFrame
Encoding the error for
every channel
Most timeConsumingfunctions
Encode with a Predictor
Encoding everyFrame
![Page 7: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/7.jpg)
Optimization MethodOptimization Method
• Dealing with the most time consuming functions
• Two approaches were taken:– Multi-threading– SIMD
![Page 8: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/8.jpg)
Optimization Method 1: ThreadsOptimization Method 1: Threads• Monkey’s Audio was managed by a single thread
• Threads should maintain 1:1 bit compatibility
• Changing the flow of the program is required
![Page 9: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/9.jpg)
Changing The Program FlowChanging The Program Flow
Originally:• Each frame is encoded and written immediately
After The Change:• Each frame is encoded and written to a buffer• The buffer is filled through the encode process• Write the buffer once all previous frames have been
encoded and written
![Page 10: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/10.jpg)
Our ImplementationOur Implementation
We use the following threads:• Main thread
Transfers frame data to the encode thread
• Write thread Writes the encoded buffers to the output file
• Encode threads Encodes the frame it is given
Note: we use N+2 threads, when N is the number of threads available.
![Page 11: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/11.jpg)
Data Structures UsedData Structures Used
ThreadParam – a linked list of objects that contains the encoded data
EncodeParam – an object containing data needed to encode a frame
WriteParam – an object containing data needed to write to the output
FramePredictor - global array that signal dependency between frames
![Page 12: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/12.jpg)
Threads SchemaThreads Schema
Buffer
Buffer Counter
Frame index
Thread ID
Encode Done(T\F)
Mutexes
Next
Buffer
Buffer Counter
Frame index
Thread ID
Encode Done(T\F)
Mutexes
Next
Buffer
Buffer Counter
Frame index
Thread ID
Encode Done(T\F)
Mutexes
Next
HeadTail
Encode Thread 1
Encode Thread 2
Write Thread
Main Thread
![Page 13: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/13.jpg)
Dependencies Between FramesDependencies Between Frames
Once a frame finished encoding, there may bea left over of data, which is dealt with in 2 ways:
1. Writing the left over data after the encoded frame2. Re encode the left over data with the next frame
We always write the left over data after theencoded frame
![Page 14: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/14.jpg)
Dealing With DependenciesDealing With DependenciesBetween FramesBetween Frames
• Using the write thread to start a new encode thread
• Remove the ‘wrongly encoded’ frame from the list
• Keep encoding the rest normally
• Keep writing to the output file in the right order!
![Page 15: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/15.jpg)
The ProblemThe Problem• There is also a data leftover between frames
• This dependency is unpredictable
• It is impossible to maintain 1:1 bit compatibility
• We ‘guess’ the best value so we don’t lose data!
![Page 16: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/16.jpg)
Results: Vtune Thread ProfilerResults: Vtune Thread Profiler
![Page 17: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/17.jpg)
Results: Vtune Thread CheckerResults: Vtune Thread Checker
![Page 18: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/18.jpg)
MultiThreading ConclusionMultiThreading Conclusion
• Total speedup from using MT: x3.15!
MT SpeedUp
0
0.5
1
1.5
2
2.5
3
3.5
4
Original 8Cores nodep
8Cores withdep
4Cores
Original
8Cores no dep
8Cores with dep
4Cores
![Page 19: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/19.jpg)
Explaining The SpeedupExplaining The Speedup
•When considering Amdahl’s law we have 2 serial parts (reading the first frames and encoding the last frame) that takes about 8% of our benchmark so we get:
•In addition while implementing our solution, in order to deal the dependencies we added ~20% instruction, thus we expect:
exp
1 13.8
0.93(1 ) 1 0.93
4.8
ectedtP
PN
exp 43.15
1.2ected
MT not optimal
t
t
![Page 20: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/20.jpg)
Optimization Method 2: SIMDOptimization Method 2: SIMD
• Original Code is written using MMX technology
• Operations with only 16bit Integer arrays
• Two main functions we used SSE on: – Adapt()
– CalculateDotProduct()
Note: These functions written entirely in ASM
![Page 21: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/21.jpg)
AdaptAdapt()() - - ImprovementsImprovements
• Add and Sub instructions on arrays of 16 bit Integers (supported in MMX)
• Each iteration goes over 32 sequential array elements
• The input and output arrays were aligned to prevent ‘Split loads’
![Page 22: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/22.jpg)
AdaptAdapt() () – Main Loop– Main Loop
Old code movq mm0, [eax]
paddw mm0, [ecx]
movq [eax], mm0
movq mm1, [eax + 8]
... movq mm3, [eax + 24]
paddw mm3, [ecx + 24]
movq [eax + 24], mm3
New code (aligned)movdqa xmm0, [eax]
movdqa xmm2, [ecx]
paddw xmm0, xmm2
movdqa [eax], xmm0
movdqa xmm1, [eax + 16]
movdqa xmm3, [ecx + 16]
paddw xmm1, xmm3
movdqa [eax + 16], xmm1
Note: There is equivalent loop with SUB operations
MMX register is 8 byteSSE register is 16 byte
16 Vs. 12 instructions per
iteration
![Page 23: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/23.jpg)
SIMD -SIMD - CalculateDotProduct CalculateDotProduct()()
• Multiply-Add of an 16bit Integers array.
• Each iteration goes over 32 array elements.
• Speedup will be calculated for both functions together.
![Page 24: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/24.jpg)
CalculateDotProductCalculateDotProduct()()Old code
movq mm0, [eax]
pmaddwd mm0, [ecx]
paddd mm7, mm0
movq mm1, [eax + 8]
... movq mm3, [eax + 24]
pmaddwd mm3, [ecx + 24]
paddd mm7, mm3
New code (aligned) movdqa xmm0, [eax]
movdqa xmm4, [ecx]
pmaddwd xmm0, xmm4
paddd xmm7, xmm0
movdqa xmm1, [eax + 16]
movdqa xmm4, [ecx + 16]
pmaddwd xmm1, xmm4
paddd xmm7, xmm1
Multiply-Add
• Each iteration is Multiply-Adding 32 array elements
16 Vs. 12 instructions per iteration
![Page 25: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/25.jpg)
SIMD Speedup AchievedSIMD Speedup Achieved
• Adapt() local speedup: x1.72
Overall speedup: x1.2
• CalculateDotProduct() local speedup: x1.62
Overall speedup: x1.2
• Total speedup using SIMD: x1.4!
![Page 26: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/26.jpg)
Intel Tuning AssistantIntel Tuning Assistant
No Micro-Architectural problems found in the
optimized code.
![Page 27: Software Performance Tuning Project Monkey’s Audio](https://reader036.vdocuments.us/reader036/viewer/2022070400/5681351b550346895d9c7627/html5/thumbnails/27.jpg)
Final ResultsFinal Results
A total speedup of x4.017 was achieved by using only MT and SIMD
Overall SpeedUp
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Original MT SIMD ALL
Original
MT
SIMD
ALL