floating point analysis using dyninst mike lam university of maryland, college park jeff...

30
Floating Point Floating Point Analysis Analysis Using Dyninst Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Upload: bernadette-newman

Post on 26-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Floating Point AnalysisFloating Point AnalysisUsing DyninstUsing Dyninst

Mike LamUniversity of Maryland, College Park

Jeff Hollingsworth, Advisor

Page 2: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

BackgroundBackground• Floating point represents real numbers as (± sgnf × 2exp)

o Sign bito Exponento Significand (“mantissa” or “fraction”)

• Finite precisiono Single-precision: 24 bits (~7 decimal digits)o Double-precision: 53 bits (~16 decimal digits)

032 16 8 4

Significand (23 bits)Exponent (8 bits)

IEEE Single

2

03264 16 8 4

Significand (52 bits)Exponent (11 bits)

IEEE Double

Page 3: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

MotivationMotivation• Finite precision causes round-off error

o Compromises certain calculationso Hard to detect and diagnose

• Increasingly important as HPC scaleso Computation on streaming processors is faster in single

precisiono Data movement in double precision is a bottleneck

o Need to balance speed (singles) and accuracy (doubles)

3

Page 4: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Our GoalOur Goal

Automated analysis techniques to inform developers about floating

point behavior and make recommendations regarding the

use of floating point arithmetic.

4

Page 5: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

FrameworkFrameworkCRAFT: Configurable Runtime Analysis for

Floating-point Tuning

•Static binary instrumentationo Read configuration settingso Replace floating-point instructions with new codeo Rewrite modified binary

•Dynamic analysiso Run modified program on representative data seto Produce results and recommendations

5

Page 6: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Previous WorkPrevious Work

• Cancellation detectiono Reports loss of precision due to subtractiono Paper appeared in WHIST‘11

• Range trackingo Reports min/max values

• Replacemento Implements mixed-precision configurationso Paper to appear in ICS’13

6

Page 7: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Mixed PrecisionMixed Precision

7

1: LU ← PA2: solve Ly = Pb3: solve Ux0 = y4: for k = 1, 2, ... do5: rk ← b – Axk-1

6: solve Ly = Prk

7: solve Uzk = y8: xk ← xk-1 + zk

9: check for convergence10: end for

Red text indicates steps performed in double-precision (all other steps are single-precision)

Mixed-precision linear solver algorithm

• Use double precision where necessary• Use single precision everywhere else• Can be difficult to implement

Page 8: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

ConfigurationConfiguration

8

Page 9: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

downcast conversion

• In-place replacemento Narrowed focus: doubles singleso In-place downcast conversiono Flag in the high bits to indicate replacement

03264 16 8 4

Double

03264 16 8 4ReplacedDouble

7 F F 4 D E A D

Non-signalling NaN 032 16 8 4

Single

9

ImplementationImplementation

Page 10: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

ExampleExample

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0

2 mulsd -0x78(%rsp) %xmm0

3 addsd -0x4f02(%rip) %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

10

Page 11: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0check/replace -0x78(%rsp) and %xmm0

2 mulss -0x78(%rsp) %xmm0check/replace -0x4f02(%rip) and %xmm0

3 addss -0x20dd43(%rip) %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

11

ExampleExample

Page 12: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Block Editing Block Editing (PatchAPI)(PatchAPI)

12

double single conversion

original instruction in block

block splits

initialization cleanupcheck/replace

Page 13: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Automated SearchAutomated Search

• Manual mixed-precision analysiso Hard to use without intuition regarding potential

replacements

• Automatic mixed-precision analysiso Try lots of configurations (empirical auto-tuning)o Test with user-defined verification routine and data seto Exploit program control structure: replace larger structures

(modules, functions) firsto If coarse-grained replacements fail, try finer-grained

subcomponent replacements

13

Page 14: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

System OverviewSystem Overview

14

Page 15: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

NAS ResultsNAS Results

15

Benchmark(name.CLASS)

Candidates Configurations Tested

Instructions Replaced

% Static % Dynamic

bt.W 6,647 3,854 76.2 85.7

bt.A 6,682 3,832 75.9 81.6

cg.W 940 270 93.7 6.4

cg.A 934 229 94.7 5.3

ep.W 397 112 93.7 30.7

ep.A 397 113 93.1 23.9

ft.W 422 72 84.4 0.3

ft.A 422 73 93.6 0.2

lu.W 5,957 3,769 73.7 65.5

lu.A 5,929 2,814 80.4 69.4

mg.W 1,351 458 84.4 28.0

mg.A 1,351 456 84.1 24.4

sp.W 4,772 5,729 36.9 45.8

sp.A 4,821 5,044 51.9 43.0

Page 16: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

AMGmk ResultsAMGmk Results

16

• Algebraic MultiGrid microkernel• Multigrid method is highly adaptive

• Good candidate for replacement

• Automatic search• Complete conversion (100% replacement)

• Manually-rewritten version• Speedup: 175 sec to 95 sec (1.8X)

• Conventional x86_64 hardware

Page 17: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

SuperLU ResultsSuperLU Results

17

• Package for LU decomposition and linear solves• Reports final error residual

• Both single- and double-precision versions

• Verified manual conversion via automatic search• Used error from provided single-precision version as threshold

• Final config matched single-precision profile (99.9% replacement)

Threshold Instructions Replaced

% Static % Dynamic

Final Error

1.0e-03 99.1 99.9 1.59e-04

1.0e-04 94.1 87.3 4.42e-05

7.5e-05 91.3 52.5 4.40e-05

5.0e-05 87.9 45.2 3.00e-05

2.5e-05 80.3 26.6 1.69e-05

1.0e-05 75.4 1.6 7.15e-07

1.0e-06 72.6 1.6 4.7e7-07

Page 18: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

RetrospectiveRetrospective• Twofold original motivation

o Faster computation (raw FLOPs)o Decreased storage footprint and memory bandwidth

o Domains vary in sensitivity to these parameters

• Computation-centric analysiso Less insight for memory-constrained domainso Sometimes difficult to translate instruction-level

recommendations to source code-level transformations

• Data-centric analysiso Focus on data motion, which is closer to source code-level

structures

18

Page 19: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Current ProjectCurrent Project• Memory-based replacement

o Perform all computation in double precisiono Save storage space by storing single-precision values in

some cases

• Implementationo Register-based computation remains double-precisiono Replace movement instructions (movsd)

o Memory to register: check and upcasto Register to memory: downcast if configured

o Searching for replaceable writes instead of computes

19

Page 20: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Preliminary ResultsPreliminary Results

20

Benchmark

(name.CLASS)

Candidates

Writes Replaced

% Static % Dynamic

cg.W 284 95.4 77.5

ep.W 226 96.0 28.4

ft.W 452 94.2 45.0

lu.W 1,782 68.3 81.3

mg.W 558 96.2 86.4

sp.W 1,607 80.7 84.7All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux.

Page 21: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

21

Page 22: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

22

Page 23: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

23

Page 24: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

24

Page 25: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

25

Page 26: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

26

Page 27: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

27

Page 28: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Future WorkFuture Work

• Case studies

• Search convergence study

28

Page 29: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

ConclusionConclusion

Automated binary instrumentation techniques can be used to implement

mixed-precision configurations for floating point code, and memory-based

replacement provides actionable results.

29

Page 30: Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Thank you!Thank you!

sf.net/p/crafthpc

30