1. project goals project description ◦ what is musepack? ◦ using multithreading approach ◦...

Post on 19-Dec-2015

229 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Musepack Encoder Performance Tuning

Tal Rath and Eyal EnavMay 2008

Technion Softlab1

Project goals Project description

◦ What is Musepack?◦ Using multithreading approach◦ Applying SIMD ◦ Analyzing Micro-architecture problems

Results – Speedup overview Conclusions and recommendations Our benefits Next Steps

Agenda

2

◦ Speeding up and optimizing a Musepack encoder while maintaining a bitwise output compatibility:

◦ Examining the encoder’s structure and methods.

Analyzing encoder functions time distribution using Intel’s Vtune program.

Apply multithreading, SIMD instructions and other techniques in order to achieve speedup using Vtune.

◦ Returning the code back to open source community.

Project goals

4

Project Platform: Intel Core 2 Duo,2.4Ghz,64 Bit, 2 GB of RAM. Windows XP OS.

Speedup measurement:

Project description

Original ExecutionTimeSpeedup

NewExecutionTime

5

What is Musepack?

◦Musepack is an open source audio codec.

◦It is a lossy encoder.

◦Musepack has performed well in various listening tests at both lower and higher bitrates.

Project description

6

Thread Level Parallelism technique is used to reduce program execution time by executing multiple code sections on both cores simultaneously.

Amdahl’s law – if P is the proportion of parallel program, then the maximum speedup that can be achieved by using 2 processors is:

Therefore, P should be maximized.

Intel’s Vtune was used to target appropriate time consuming functions for multithreading.

Using Multithreading approach

2

1

1 P

8

Functions’ total timer events:

Psychoakustic_Modell’s time consumption is high, therefore, should be a target for multithreading.

Using Multithreading approach

9

Function contains two separate models with same instructions and different data.

Multithreading Psychoakustic Function – First Attempt

Each model

should be executed in a different

thread.

10

Problem: Very high dependency between models through local and global variables: Second model uses first one’s output.

Multithreading Psychoakustic – First attempt

11

Observation: Psychoakustic function contains left and right channel handling functions.

◦ These functions can be divided into two types:

◦ Single channel functions, for example:FunctionL(Left Param1,Left Param2,.., local param1,Local param2)

.

◦ Dual channel functions, for example:FunctionLR(Left Param1,Right Param1,…)

◦ Single channel functions does not access opposite channel’s local variables.

Timer events distribution: Single – 84% Dual - 16%

Multithreading Psychoakustic – Second attempt

12

Strategy:◦ One single channel function in each thread:

Multithreading Psychoakustic – Second attempt

Left

Right

Left LeftLeft

Time

Two Single channel functions

Two Single channel functions

Dual channel function

Dual channel function

Thread B

Thread A

13

Implementation:◦ Left channel local variables uses thread A while

right ones uses thread B. Shared variables, used by both threads, are being duplicated – one copy for each thread.

◦ Technical problem: Program contains a large amount of global variables.

◦ These are being accessed by both left and right single channel functions and supposed to be accessed from both threads simultaneously.

Multithreading Psychoakustic – Second attempt

A, About, ANSspec,_L ANSspec_,M ANSspec_,R ANSspec,_S, APE_Version, array, b, Bandwidth, Buffer, BufferBytes, BufferedBits, bump_exp, bump_start, Butfly, __C ,c, Ci_,opt ,CombPenalities, Cos,_Tab ,CosWin, CP_10000 ,CP_10079, CP_1250, CP_1251 CP_1252 CP_1253 CP_1254 CP_1255 CP_1256 CP_1257 CP_1258 CP_37, CP_42, CP_437 , CP_500, CVD_used, __D , d ,data,_finished DelInput DisplayUpdateTime

14

Solution - “Divide and Conquer approach”: Map all globals - Using globals marking script. Duplicate globals with which are being accessed by

functions in the deepest level of function call. After these functions are handled, proceed to a higher level. Process ends when the duplication of global variables, which

are being accessed from within the Upper level (Psychoakustic self code), is done.

Multithreading Psychoakustic- Global Variables

float g_var1 (global/static var)……Function A(){ g_var1 = value;}

Aligned 64 duplicated struct{float g_var1;}……Function A(thread num){ struct. g_var1 = value;}

Psychoakustic()

Deepest level

Upper level

15

aligned 64 structs (to avoid shared cache lines).

Multithreading - Results After Psychoakustic multithreading, two more functions have

been multithreaded, using the same mechanism.

Total threading speedup: 1.43X

◦ Parallel part: 73.2%.◦ Assuming serial part does not change, new exec time of multithreaded

part is 57% from it’s original time.

◦ Threading overhead:◦ Total program IC increased by 2.6%. Total timer event count increased by 0.62%. Intel Thread Checker found no errors.

(Thread Profiler) 16

Original encoder settings uses “Precise F.P. model” instead of “Fast mode F.P. model”.

Precise mode increases calculation time.

F.P. model was changed to “fast” (after consulting our instructors).

In the original program, sqrt instructions with single F.P. arguments was performed in double precision.

These instructions were changed to single precision.

Speedup gained so far: 1.77X

Output file has a bitwise compatibility only with original “Fast F.P. mode” file:◦ Around of value difference from “Precise mode” output is due to rounding.◦ Such minor differences can not be noticed by human ear.

Floating Point Issues

510 %

17

SIMD is a technique employed to achieve data level parallelism, SIMD instructions enable the execution of 4 F.P. instructions at a time.

Function self time distribution:

Sqrt function is the main target for SIMD Instructions usage.

SIMD Instructions

18

SIMD instructions were used in the four functions that call Sqrt instruction.

These functions were transformed into SIMD oriented functions – sqrt as well as other mathematical operations were performed by SIMD instructions.

In one of the functions, due to altering loop iteration number, Sqrt array was calculated in advance using SIMD instructions.

No calls to original Sqrt remained after applying SIMD.

SIMD Gained Speedup: 23% (With multithreading).

SIMD Instructions - implementation

19

Micro Architecture Issues Using VTune’s Tuning Assistance, several

micro architecture problems were discovered:

◦ RAT_STALLS.FLAGS – Indicates Partial flag stalls. About Events, each one causes ~10 cycles stalls ~4 sec. Possible solution: command substitution such as INC to ADD. Events occur in ‘fread’ function, therefore can not be modified.

◦ LOAD_BLOCK.OVERLAP_STORE – load instructions are blocked, Cause can be 4K (Page size) aliasing or load-store block overlap. Possible solution: increase 4K sized arrays by block size and use 64

Byte alignment. Solution was applied – Results are Unnoticeable.

910

20

2.03

Speedup Overview

21

Multithreading◦ Can produce a significant program acceleration.

◦ Global variables can be an obstacle in the process of multithreading.

SIMD instructions ◦ Enhance speedup.

◦ Can be implemented only on specific code parts.

◦ Sometimes, implementation should be “creative”.

Micro architecture◦ In this Program no major problems were found.

◦ Vtune tuning assistance is a powerful tool for micro architecture problems tracking.

Conclusions

22

Making adjustments for quad core processor by creating 4 threads.

Designing a multithreading assistance program that will trace and handle global variables using suggested algorithm.

Optional Future Steps

23

Improving our expertise for identifying the dominant factors in a process and handling it.

Enhancing our knowledge regarding multithreading technique.

Learning how to use SIMD instructions.

Being exposed to a few micro architecture problems.

Our Benefit

24

The EndThank you

25

top related