estimating multimedia instruction performance based on workload characterization and measurement...

19
Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Gheewala, A.; Peir, J.-K.; Yen-Kuang Chen; Lai, K.; IEEE International Workshop on Workload Characte rization Pages: 98 - 106 Nov. 2002

Post on 21-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement

Gheewala, A.; Peir, J.-K.; Yen-Kuang Chen; Lai, K.; IEEE International Workshop on Workload Characterization

Pages: 98 - 106

Nov. 2002

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 2/19

Abstract The increasing popularity in multimedia applications

provokes microprocessors to include media-enhancement instructions. In this paper, we describe a methodology to estimate performance improvement of a new set of media instructions on emerging applications based on workload characterization and measurement. Application programs are characterized into a sequential segment, a vectorizable segment, and extra data moves for utilizing the SIMD capability of new media instructions.

Techniques based on benchmarking and measurements on existing systems are used to estimate the execution time of each segment. Based on the measurement results, the speedup and the additional data moves of using the new media instructions can be estimated to help processor architects and designers evaluate different design tradeoffs.

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 3/19

Outline

What’s the problem Introduction Methodology foundation and analysis Proposed performance estimation methodology Experimental results and evaluation Conclusions

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 4/19

What’s the Problem

Traditional performance evaluation of a new set of media instructions is a time-consuming process Requires detailed processor models to handle both

regular and new SIMD media instructions Needs to generate executable binary codes for the new

media-extension instructions to drive simulator

It’s essential to quickly estimate the speedup of applications with a few additional media instructions to assess tradeoffs for new media instructions

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 5/19

Introduction

The proposed methodology Based on timing measurement on existing systems

Where the new SIMD instructions are not available Execution time of the following segments can be derived

Sequential segment Vectorized segment

code segment that can be vectorized by a set of new SIMD instructions

Data move segment Explicit data move code segment in using new SIMD instructions

Execution time of an application with SIMD instructions can be estimated from the three segments Only need existing hardware No cycle-accurate simulator is required

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 6/19

Estimating Speedup for MMX

Amdahl’s law can estimate the speedup of an application

f is fraction of the program that can be vectorized n is the ideal speedup of f

Modify Amdahl’s law to accommodate the MMX technology

O is portion of the code in the vectorizable segment that can’t be replaced by MMX instructions Such as program constructs loop controls and procedure calls

D represents the fraction of the data move instructions Explicitly data move instruction to/from MMX register

m is the speedup of the data moves

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 7/19

SIMD with Data Rearrangement Data Arrangement in Registers for Matrix Multiplication

Packed Multiply-and-Add (PMADDWD) Performs four 16 bits multiplications and two 32 bits additions

Packed-Add (PADDD) Performs two 32 bits additions

16 16

32

32

32

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 8/19

SIMD with Data Rearrangement (cont.) Another Way of Data Arrangement in Registers

More natural data arrangement

Invent new PADDD to accomplish this Adds the high-order and low-order 32 bits of each of the two source

registers

16 16

32 32

32

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 9/19

Workload Characterization and Measurement Four types of code

Equivalent C-code (executable on existing system) Application program written in C

MMX-code (un-executable on existing system) Develops with new SIMD and data move instructions

Pseudo MMX-code (executable on existing system) Replaces new SIMD with equivalent MMX-like C instructions Includes all the data moves as that in the MMX-code

Cripple code (executable on existing system) Removes new SIMD in MMX-code without replacement

Important assumption Four SIMD computation instructions are assumed to be new to the

current MMX ISA PMADDWD, PADDD, PSUBD, PSRAD

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 10/19

Workload Characterization and Measurement

Replaces the corresponding new SIMD instructions with the

equivalent C instructions

Keeps all the data move instructions as that in the

original MMX-code

Portion of the MMX-code and its equivalent pseudo MMX-code from IDCT

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 11/19

Timing Components of Four Types of Code

Sequential segment (1-f)

Vectorizable portion of the C-code (f-O) Unvectorizable portion (O)

Data-move segment (D)

Execution time for the individual components can be derived except for the new SIMD instructions

Main target for improvement with new

SIMD instructions

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 12/19

Performance Projection and Verification Individual Timing Components Derivation

Data-move segment (D) Difference of execution time between equivalent C-code and pseu

do MMX-code Vectorizable portion of the C-code (f-O)

Difference of execution time between Cripple code and pseudo MMX-code

Unvectorizable portion (O) Difference of execution time between vectorizable portion of the

C-code (f-O) and original vectorizable segment (f)

Total execution time and speedup estimation Sequential segment execution time (1-f) Unvectorizable portion execution time (O) Execution time spent on new SIMD instructions (f-O) / n Data-move segment execution time (D)

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 13/19

Performance Projection and Verification Steps for estimating speedup factor (n) of the new SIMD

Step1: Assembly code examined for each new SIMD instruction

Explicit data-move instructionsPMADDWD +

=

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 14/19

Performance Projection and Verification

Step2: Estimates execution latency of the assembly Execution latency of each assembly instruction is specified in

the architectural book Finally, obtains the estimated speedup factor (n)

Step3: Repeats the above steps for new SIMD instructions Obtains the respective speedup of each new SIMD instruction

Step4: Calculates the weighted average speedup According to the number of occurrences of each new SIMD

instruction in the application

Thus, we can estimate the time spent on all the new SIMD instructions : (f-O) / n

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 15/19

IDCT Case Study Results

Estimated Speedup Factor (n) for New SIMD Instructions

8.09=

New SIMD computation instruction equivalent C code

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 16/19

IDCT Case Study Results (cont.)

IDCT Performance Measurement and Project

Sequential Unvectorizable New MMXData moves+ + +

=

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 17/19

Experimental Results and Evaluation

Overall speedup is close 1.5 with 2 times of performance improvement for the new SIMD instructions

Overall speedup is over 2.5 given 10 times improvement of the new SIMD instructions

Overall speedup

Execution time

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 18/19

Experimental Results and Evaluation (cont.)

Overall speedup reduces from 2.9 to 2.7 with 30% more data move overhead

Overall speedup increases from 2.9 to 3.1 if data move overhead can be reduced by 30%

Execution time

Overall speedup

112/04/18 Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement 19/19

Conclusions

Presents a performance estimation method for using new media instructions Base on characterize media workload with benchmarking

and measurement on existing systems No cycle-accurate simulator is required

Given a range of performance improvement of the new media instructions, the proposed method can estimate a range of overall speedup