dynamic precision numerics using a variable-precision … · | 12 unum format is variable length...

20
ARITH’26 | BOCCO Andrea | 11 June 2019 DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW COPROCESSOR

Upload: others

Post on 12-Mar-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

ARITH’26 | BOCCO Andrea | 11 June 2019

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION

UNUM TYPE I HW COPROCESSOR

| 2

INTRODUCTION: STATE OF THE ART

➢ Variable Precision (VP) computing has been investigated to improve

convergence of algorithms. It has been investigated in:

▪ Software (SW): GMP[2] and MPFR[3]

▪ Slow, they might not met requirements in high speed applications

▪ Hardware (HW):▪ Kulisch[4] : large fixed point accumulator

▪ Schulte and Swartzlander[5] : mantissas divided in multiple words

➢ None of the previous works show how to store efficiently VP Floating

Point (FP) number in main memory

▪ They support IEEE 754 FP format in main memory

[1] IEEE754-2008 2008. IEEE Standard for Floating-Point Arithmetic. IEEE 754-2008 https://doi.org/10.1109/IEEESTD.2008.4610935

[2] Torbjörn Granlund and the GMP development team. 2012. GNU MP: The GNU Multiple Precision Arithmetic Library. https://gmplib.org/

[3] Laurent Fousse, et al. MPFR: A Multiple precision Binary Floating-point Library with Correct Rounding. https://doi.org/10.1145/1236463.1236468

[4] Ulirich Kulisch. 2013. Computer arithmetic and validity: Theory, implementation, and applications

[5] M. J. Schulte and E. E. Swartzlander. 2000. A family of variable precision interval arithmetic processors. https://doi.org/10.1109/12.859535

| 3

INTRODUCTION: MY WORK

Our previous work[6]: a VP FP hardware accelerator:

• Supports the UNUM type I format in

main memory

• Does computation internally with another

(hardware friendly) FP format

• Supports Interval Arithmetic (IA)

This work:

▪ Refines the UNUM type I FP format.

▪ Proposes a new VP FP architecture.

▪ Proposes a new programming model.

▪ Benchmarks our system.

[6] A. Bocco, Y. Durand, F. Dinechin, 2019, SMURF: Scalar Multiple-precision UNUM RISC-V Floating-point Accelerator for Scientific Computing.

Rocket tile

UNUM

co-proc

RoCC

LSU

FPU

LSU$

L1

R

A

M

Scratchpad

$

L1

R

A

M

1

2

3

4

5RISC-VRocket

Chip

| 4

OUTLINE

• Choice of the memory format: the UNUM type I

• Refinements on the UNUM type I FP format

• The adopted VP FP Architecture

• The programming model

• System benchmark: gauss elimination solver

• Conclusions

| 5

OUTLINE

• Choice of the memory format: the UNUM type I

• Refinements on the UNUM type I FP format

• The adopted VP FP Architecture

• The programming model

• System benchmark: gauss elimination solver

• Conclusions

| 6

CHOICE OF THE MEMORY FORMAT: THE UNUM TYPE I

We decided to use the UNUM type I FP format in main memory

• It is 6 sub-fields self-descriptive FP format

3 more that conventional IEEE 754 FP numbers

• WHY?

• UNUM is a VP FP format

• It self-encodes the exponent and fraction field lengths

However UNUM type I has some peculiarities to be fixed:

• How to organize UNUM arrays in main memory

• How to organize the UNUM fields in memory

s e f u es-1 fs-1

sign exponent fraction ubit exponent

size

fraction

size

es bits fs bits

| 7

OUTLINE

• Choice of the memory format: the UNUM type I

• Refinements on the UNUM type I FP format

• The adopted VP FP Architecture

• The programming model

• System benchmark: gauss elimination solver

• Conclusions

| 8

REFINEMENTS ON THE UNUM TYPE I FP FORMAT:

- UNUM FIELD ORGANIZATION

For a UNUM/ubound which spans multiple addresses in main memory it is

important to have the descriptor fields present in the lower addresses.

➢ We have re-organized the order of the fields for UNUM and ubound

left right left right left right

s u es-1 fs-1 s u es-1 fs-1 e e f f

s u es-1 fs-1 e f

2

1

LSB MSB

@1’:

pFF--FF

00--00

U1

?

?

?

?

?

?

p

@1’:

FF--FF

00--00

U1

?

@2’:U2 ?

| 9

REFINEMENTS ON THE UNUM TYPE I FP FORMAT:

- UNUM ARRAY ORGANIZATION

Handling a two-element UNUM array on main memory with p bits parallelism

U2_0 U2_1 U2_2

U1_0 U1_1

p p

2p 3p0 p

p

U2 :

U1 :

bit

length

p

@2’:

@1’:

FF--FF

00--00 1

U1_1

U1_0

U2_1

U2_0

U2_2

@2’’:

@1’:

pFF--FF

00--00 2

U1_1

U1_0

U2_2

U2_1

U2_0 U3_2

U3_1

U3_0

U3_2

U3_1

U3_0

!

U3=U1*U2

Array support:

Guarantee affine

addressing

scheme

| 10

OUTLINE

• Choice of the memory format: the UNUM type I

• Refinements on the UNUM type I FP format

• The adopted VP FP Architecture

• The programming model

• System benchmark: gauss elimination solver

• Conclusions

| 11

• 1 integer register file (iRF): 32 integer general purpose register

(GPR) + pc, in the main processor.

• 1 g-bound register file (gRF): 32 entries, in the co-processor.

• UNUMs/u-bounds are strictly considered as memory formats:

• Load operations:• Load UNUMs/u-bounds from the main memory, and converts them into internal g-bounds.

• Store operations:• Convert internal g-bounds (entries of the internal gRF) into u-bounds. Store the latter the

main memory.

• The coprocessor internal parallelism is fixed to 64 bits

• Coprocessor’s status registers:

• DUE

• SUE

• MBB

• WGP

THE ADOPTED VP FP ARCHITECTURE

Rocket tile

UNUM

co-proc

RoCC

LSU

FPU

LSU$

L1

R

A

M

Scratchpad

$

L1

R

A

M

1

2

3

4

5RISC-V

Rocket

ChipNEW!

| 12

UNUM format is variable length (up to a maximum length)

▪ It is impossible to have compacted arrays having random access to its

elements

➢ We define the Maximum Byte Budget (MBB) as the maximum length

that a UNUM number can have in main memory

➢ The user can address VP FP numbers specifying their length with Byte

granularity.

THE MBB: MAXIMUM BYTE BUDGET

LSU

g0

g1

g2

g3

g4

G2U BMF

u0

u1

u2

u3

u4

u’0

u’1

u’2

u’3

u’4

MBB

MBB

MBB

| 13

s u es-1 fs-1

1a) 0 1 1-----1 1-----1

2a) 1 1 1-----1 1-----1

3a) 0 0 1-----1 1-----1

4a) 1 0 1-----1 1-----1

5a) 0 1 1-----1 1-----1

6a) 1 1 1-----1 1-----1

7a) 0 1 es-1 fs-1

8a) 1 1 es-1 fs-1

9a) s u es-1 fs-1

1b) 0 1 1--------1 1--------1

2b) 1 1 1--------1 1--------1

3b) 0 0 1--------1 1--------1

4b) 1 0 1--------1 1--------1

5b) 0 1 es-1 fs-1

6b) 1 1 es-1 fs-1

7b) s u es-1 fs-1

s u es-1 fs-1

0

-∞↓

+∞) right

(-∞ left

x

+∞↓

1--------------1

1------1

1------------1

e

1--------------1

fs_maxes_max

1---------------------------------1

1---------------------1

1------------------------1

f

1---------------------------------1

sNaN

qNaN

1--------------1

1--------------1

1---------------------------------1

1---------------------------------1

1--------------1

1--------------1

1-------------------------------10

1-------------------------------10

UN

US

ED

BIT

S

fss’’ess’’ bit

length

MBB*8

fses

1------1

1------------1

e

1---------------------1

1------------------------1

f

-∞↓

+∞) right

(-∞ left

x

+∞↓

sNaN

qNaN

+∞) right

(-∞ left

fss’ess’

UNUSED BITS

THE BMF: BOUNDED MEMORY FORMAT

MBB

>=

max unum lengh

MBB

<

max unum lengh

| 14

OUTLINE

• Choice of the memory format: the UNUM type I

• Refinements on the UNUM type I FP format

• The adopted VP FP Architecture

• The programming model

• System benchmark: gauss elimination solver

• Conclusions

| 15

01: k = 0

02: while convergence not reached do

03: for i := 1:n do

04: =0

05: for j := 1:n do

06: if j ≠ i then

07: 𝝈 += 𝒂𝒊𝒋𝒙𝒋(𝒌)

08: end

09: end

10: 𝒙𝒊(𝒌+𝟏)

=𝟏

𝒂𝒊𝒊(𝒃𝒊 − 𝝈)

11: end

12: k=k+1

13: end

Rocket tile

UNUM

co-proc

RoCC

LSU

FPU

LSU

Scratchpad

$

L1

R

A

M1

2

3

RISC-V

Our hardware is best suited for VP kernels which exploit three

different storage types:

• The external (main memory) storage

• The intermediate (L1 cache) storage

• The internal (register-level) storage

THE COPROCESSOR PROGRAMMING MODEL

bĀ x· =

x

Legend:Outermost loop

Intermediate loop

Innermost loop

UNUM

co-proc

𝝈

| 16

OUTLINE

• Choice of the memory format: the UNUM type I

• Refinements on the UNUM type I FP format

• The adopted VP FP Architecture

• The programming model

• System benchmark: gauss elimination solver

• Conclusions

| 17

SYSTEM BENCHMARK: GAUSS ELIMINATION SOLVER

Our system benchmarked with a Gauss elimination solver, both in

UNUM (scalar) and ubound (interval), showed:

• A gain of up to 65 decimal digits on IEEE double

• The result precision is constrained by the adopted precision in memory.

• Intervals do not converge always but it is useful in the computational

error estimation (Ax-b).

• A speed up of 4-10x with respect to the MPFR software library

| 18

OUTLINE

• Choice of the memory format: the UNUM type I

• Refinements on the UNUM type I FP format

• The adopted VP FP Architecture

• The programming model

• System benchmark: gauss elimination solver

• Conclusions

| 19

CONCLUSIONS

This work proposes a Variable Precision (VP) Floating Point (FP) computing

system, based on RISC-V, for high performance computing servers as an

alternative to VP FP software routines.

• It supports UNUM/ubound format in main memory

• It supports several Unum Environments: from (1,1) to (4,8), up to 256 mantissa bits

• It supports a dedicated internal format in its Register File

• 32 intervals; Each interval endpoint can have up to 512 mantissa bits

• With the adopted memory format (BMF) it supports VP FP in main memory

• User can decide the memory footprint of data with a Byte definition

• With the adopted programming model, it is possible to extend VP FP high

precision variables in main memory.

• The result precision can be significantly improved.

• Its flops performances are better than software libraries (MPFR) and they

stays within the same range of a regular fixed-precision IEEE FPU.

Leti, technology research institute

Commissariat à l’énergie atomique et aux énergies alternatives

Minatec Campus | 17 rue des Martyrs | 38054 Grenoble Cedex | France

www.leti.fr

THANK YOU FOR

YOUR ATTENTION!

Contacts:

Andrea BOCCO

[email protected]