structure layout optimizations in the open64 compiler: design, implementation and measurements

19
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

Upload: pascal

Post on 05-Jan-2016

64 views

Category:

Documents


1 download

DESCRIPTION

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements. Gautam Chakrabarti and Fred Chow PathScale, LLC. Outline. Motivation Types of structure layout optimizations Criteria for structure layout optimizations Implementation details - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Structure Layout Optimizations in the

Open64 Compiler: Design, Implementation

and Measurements

Gautam Chakrabarti

and

Fred Chow

PathScale, LLC.

Page 2: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 2

Outline

Motivation

Types of structure layout optimizations

Criteria for structure layout optimizations

Implementation details

Performance results

Future work

Conclusion

Page 3: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 3

Motivation

Poor data locality in many applications

High data cache miss rates

Growing gap between processor and memory speeds

Our Approach

Change layout of data structures

Requires whole-program optimization

Use Inter-Procedural Analysis and Optimizations (IPA)

Our Aim

Make applications more cache-friendly

Page 4: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 4

IPA

Summarization

Analysis

Optimization

Page 5: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 5

Types of Structure Layout Optimizations

Structure splitting Structure peeling

struct struct_A{ double d1; double d2; int i; float f; long long l; char c; struct struct_A * next;};

struct struct_A{ double d1; double d2; int i; float f; long long l; char c;};

Page 6: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 6

Structure Splitting Example

struct new_struct_A{ double d1; int i; long long l; struct new_struct_A * next; struct cold_sub_struct_A * p;};

struct struct_A{ double d1; double d2; int i; float f; long long l; char c; struct struct_A * next;};

struct cold_sub_struct_A{ double d2; float f; char c;};

Page 7: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 7

Structure Peeling Example

struct new_struct_A{ double d1; int i; long long l;};

struct struct_A{ double d1; double d2; int i; float f; long long l; char c;};

struct cold_sub_struct_A{ double d2; float f; char c;};

Page 8: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 8

Criteria for structure layout optimizations

Legality Analysis Type cast Address of a field is

taken Escaped types Parameter types Full visibility to IPA Alignment restrictions

Profitability Analysis Hotness Affinity Field accesses at loop

level Size

Page 9: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 9

Implementation Details

Step 1: Type information summarization (IPL)

Step 2: Symbol table merging (IPA)

Step 3: Legality and profitability analysis (IPA analysis)

Step 4: Transforming the program (IPA optimization)

Page 10: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 10

Implementation Details: Type information

summarization Information summarization in IPL Framework for computing static profiles using heuristics New TY flag TY_NO_SPLIT SUMMARY_TY_INFO SUMMARY_LOOP

For each DO_LOOP, WHILE_DO, DO_WHILE Bit-vector to track field accesses of up to N structure for each loop Considers field accesses immediately inside loop

These fields are considered affine to each other

Execution count of statements immediately inside loop

From statically estimated profiles or from runtime feedback

Page 11: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 11

Implementation Details: IPA Analysis

Inter-procedurally update statically estimated execution count of

PUs

Update statically estimated loop frequencies in

SUMMARY_LOOP

Consider SUMMARY_LOOP from the hottest P PUs

Determine candidates for structure-layout transformation

Determine new layout of structures

Page 12: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 12

Implementation Details: IPA Analysis Example

F4 F3 F2 F1 BV

L1 22 22 0101

L2 14 0010

L3 12 12 0101

L4 8 8 1100

L5 6 6 0101

F4 F3 F2 F1

AG1 40 40

AG2 14

AG3 8 8

Li — Loops

Fj — Fields in a struct

AGk — Affinity groups

Page 13: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 13

Implementation Details: Transforming the

program

struct S struct T{ { // N fields // AG1 fields struct T * p; // AG2 fields // M fields };}; // peel T

struct S{ // N fields struct T1 * p1; struct T2 * p2; // M fields};

New type definitions

Field table update

Field access statements

New symbols

Assignment statements

Example:

struct T1 struct T2{ { // AG1 fields // AG2 fields}; };

Page 14: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 14

Implementation Details: Transforming the

program (continued)

Function calls to memory management routinesExample:

p = (T *) malloc (N * sizeof (T))

if (p == NULL)

exit (1);

Detect memory management routine calls involving transformed type T

Replicate call, assignment statements Update size of memory being allocated Handle comparisons involving pointer p

Page 15: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 15

Performance Results

Compilations options: -Ofast at 32-bit ABI

Speedup due to structure layout optimizations

Benchmarks AMD

Opteron™

(2.8GHz,

4GB, 1MB)

AMD

Barcelona(2.

0GHz, 8GB,

512KB)

Intel®

EM64T(3.4G

Hz, 4GB,

1MB)

Intel®

Core™(3.0

GHz, 4GB,

4MB)

SiCortex

MIPS®(500MHz,

4GB, 256KB)

Geometric

Mean

179.art 134% 66% 56% 47% 41% 62.5%

181.mcf 24% 23% 23% 31% 13% 22.0%

462.libquantum 32% 17% 40% 72% 62% 39.6%

Geometric Mean 46.9% 29.6% 37.2% 47.2% 32.1% 37.9%

Page 16: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 16

Performance Results (continued)

Compilations options: -Ofast at 64-bit ABI

Speedup due to structure layout optimizations

Benchmarks AMD

Opteron™

(2.8GHz,

4GB, 1MB)

AMD

Barcelona(2.

0GHz, 8GB,

512KB)

Intel®

EM64T(3.4G

Hz, 4GB,

1MB)

Intel®

Core™(3.0

GHz, 4GB,

4MB)

SiCortex

MIPS®(500MHz,

4GB, 256KB)

Geometric

Mean

179.art 169% 66% 53% 60% 45% 69.3%

181.mcf 25% 35% 12% 30% 7% 18.6%

462.libquantum 82% 51% 75% 70% 69% 68.6%

Geometric Mean 70.2% 49.0% 36.3% 50.1% 27.9% 44.6%

Page 17: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 17

Performance Results (continued)

Compilations options: -Ofast at 64-bit ABI

Multiple copies of 462.libquantum running on multi-core chip

Platform: Quad-core AMD Barcelona (2.0 GHz, 8GB, 512KB, 2MB)

3rd level cache shared among 4 cores

Speedup from structure layout optimizations

Benchmark 1 copy 2 copies 4 copies

462.libquantum 51% 69% 123%

Page 18: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 18

Future Work

Tune static profile estimation

Less restrictions

Integrate with field-reordering

Page 19: Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Open64 Workshop 2008 19

Conclusion

A framework for performing structure layout transformations

is now available in the Open64 compiler.

The superior infrastructure in the Open64 compiler helped us

implement the optimizations cleanly and with relatively less

effort.

Substantial speedups are possible on some of the CPU2000

and CPU2006 SPEC benchmarks.

Structure layout optimization is a required feature for a

compiler to remain competitive.