bottlenecks of simd haibin wang wei tong. paper bottlenecks in multimedia processing with simd style...

Bottlenecks of SIMD

Haibin Wang

Wei tong

Bottlenecks in Multimedia Processing with SIMD Style

Extensions and Architectural Enhancements One IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 8, AUGUST 2003

Deepu Talla, Member, IEEE ,Lizy Kurian John, Senior Member, IEEE, and Doug Burger, Member, IEEE

Outline

Introduction Bottlenecks Analysis MediaBreeze Architecture Summary

Introduction

It is popular to use multimedia SIMD extensions to speed up media processing, but the efficiency is not very high.

75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions.

Introduction

The bottlenecks are caused by the loop structure and the access patterns of the media program.

So instead of exploiting more data-level parallelism, the paper focuses on improving the efficiency of the instructions supporting the core computation.

Introduction

This paper has two major contributions: Firstly, it focuses on the supporting

instructions to enhance the performance of SIMD which is an innovation.

Secondly, it gives a method to reduce and eliminate supporting instructions with the MediaBreeze architecture.

Nested Loop

The analysis of loop architecture

The sub-block is very small which leads to the limited DLP because it needs many supporting instructions.

There are 5 loops for every block which waste so much time on braches.

You need to reorganize the data to use SIMD

Access patterns

The addressing sequences are complex and big part which need lots of supporting instructions to generate them.

Using general-purpose instruction sets to generate multiple addressing sequences is not very efficient.

The overhead instructions

Address generation: address calculation Address transformation: data movement,

data reorganization Loads and Stores: memory Branches : control transfer, for-loop

Architecture

Instruction Structure

Breeze Instruction Mapping of 1D-DCT

Full Map

. five branches, . three loads and one store, . four address value generation (one on each stream with each address generation representing multiple RISC instructions), . one SIMD operation (2-way to 16-way parallelism depending on each data element size), . one accumulation of SIMD result and one SIMD reduction operation, four SIMD data reorganization (pack/unpack, permute, etc.) operations, and . shifting and saturation of SIMD results.

Performance Evaluation

cfa,dct, motest,scale G711, decrypt Aud, jpeg, ijpeg

Any improvement?

Why not higher efficiency in cfa?

Memory latency! Solution?

Prefetch!

Evaluation

Advantage: Eliminating and reducing overhead. Much better than normal SIMD extension. 0.3% processor area, less 1% total power consumption. Drawback: Complicated instruction. Who will design a compiler for this?

bottlenecks of simd haibin wang wei tong. paper bottlenecks in multimedia processing with simd style...

Documents

4. synchronous parallelism sample simd systems · • dap...

customizing wide-simdarchitectures for...

haibin ling and david jacobs, deformation invariant image...

logistics bottlenecks

data-parallel execution using simd instructions ·...

box2d with simd in javascript - github...

intel simd architecture

simd compression and the intersection of sorted … · simd...

finding bottlenecks

introduction simd...

automatic simd vectorization of fast fourier transforms for...

simd and associative computing

performance bottlenecks

nersc threading workshop threading workshop tcg micro ssg...

bottlenecks exposed

addressing procurement bottlenecks -...

simd models pda sp07

communicative bottlenecks lead to maximal...

simd image processor

bottlenecks in multimedia processing with simd style...