(slides) efficient evaluation methods of elementary functions suitable for simd computation

38
International Supercomputing Conference 2010 1 Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation Naoki Shibata Shiga University

Upload: naoki-shibata-

Post on 12-Dec-2014

5.038 views

Category:

Technology


3 download

DESCRIPTION

Naoki Shibata : Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation, Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10., Volume 25, Numbers 1-2, pp. 25-32, 2010, DOI: 10.1007/s00450-010-0108-2 (May. 2010). http://www.springerlink.com/content/340228x165742104/ http://freshmeat.net/projects/sleef Data-parallel architectures like SIMD (Single Instruction Multiple Data) or SIMT (Single Instruction Multiple Thread) have been adopted in many recent CPU and GPU architectures. Although some SIMD and SIMT instruction sets include double-precision arithmetic and bitwise operations, there are no instructions dedicated to evaluating elementary functions like trigonometric functions in double precision. Thus, these functions have to be evaluated one by one using an FPU or using a software library. However, traditional algorithms for evaluating these elementary functions involve heavy use of conditional branches and/or table look-ups, which are not suitable for SIMD computation. In this paper, efficient methods are proposed for evaluating the sine, cosine, arc tangent, exponential and logarithmic functions in double precision without table look-ups, scattering from, or gathering into SIMD registers, or conditional branches. We implemented these methods using the Intel SSE2 instruction set to evaluate their accuracy and speed. The results showed that the average error was less than 0.67 ulp, and the maximum error was 6 ulps. The computation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.

TRANSCRIPT

Page 1: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

1

Efficient Evaluation Methods ofElementary Functions

Suitable for SIMD Computation

Efficient Evaluation Methods ofElementary Functions

Suitable for SIMD Computation

Naoki ShibataShiga University

Naoki ShibataShiga University

Page 2: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

2

Overview• Methods for evaluating following functions usin

g SIMD instructions in double precision– sin, cos, tan– log, exp– asin, acos, atan

• Fast– Two times as fast as FPU evaluation

• Accurate– Maximum error is within 6 ulps from the true value

Page 3: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

3

Overview (contd.)• Advantages against existing methods

– No conditional branches– No gathering/scattering– No table look-ups

x

y

SIM

D o

pera

tion

SIM

D o

pera

tion

SIM

D o

pera

tion

SIM

D o

pera

tion

sin x

sin y

Apply exactly same seriesexactly same series of SIMD operations to x and y,and we get sin x and sin y.

Page 4: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

4

Outline• Overview• Background

– Cares needed for SIMD optimization– Related works

• Proposed method– Trigonometric functions– Inverse trigonometric functions– Exponential function– Logarithmic function

• Evaluation– Accuracy– Speed– Code size

• Conclusion

Page 5: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

5

Background• SIMD instructions are now pervasive

– SSE in x86 processors– Altivec in Power / PowerPC processors– NEON in ARM processors– Cell Broadband Engine– Many GPU models

• Length of SIMD registers is going to be extended– 256 bits in Sandy Bridge– 512 bits in Larrabee (or Knights Ferry?) GPUs

Page 6: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

6

Background (contd.)

• SIMD inst. set does not include instructions for evaluating elmentary functions.– sin, cos, tan, log, exp, asin, acos, atan

• Two possibilities– FPU

• Elementary functions are available in limited architectures.

– Software library • Many are not optimized for SIMD calculation

Page 7: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

7

Cares needed in SIMD optimization• We need special cares for SIMD optimization

– Memory access and conditional branches are slow• with modern processor models with long pipeline

– Table look-up is slow– Gathering and scattering operations are slow

• Some traditionally slow operations are not slow anymore– Division and Square root can now be evaluated

using one instruction each– No extra register needed

• We need new, specialized algorithms to make efficient uses of SIMD unit

Page 8: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

8

Gathering and scatteringoperations are slow

• We need to compose/decompose each element of vector register– At least one instruction is needed for each

element– SIMD ALU may be idle during this operation

• Register spills may happen– This requires extra execution of instructions– This may cause memory access

Page 9: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

9

Table look-up is slow• Table look-up is frequently used in traditional

implementation for evaluating elementary functions– Both in HW and SW

x

y

x

y

Scatter

Look-up

Look-up

x'

y'

x

yGather

Table

This part must be repeated for each

element

Page 10: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

10

Division and SQRT are not too slow

• 69 clocks of latency and throughput for 1 execution of SQRTPD instruction[*]

• 69 clocks of latency and throughput for 1 execution of DIVPD instruction[*]

• Memory access latency is 165 clocks(C2Q 9300, measured by CPU-Z)

• The good point is that they do not require extra instruction execution and registers[*] Intel 64 and IA-32 Architectures Optimization Reference Manual

Page 11: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

11

Challenge• Usually, evaluation of elementary functions is

performed using calculation with extra precision– e.g. x86 FPU does evaluation using 80bit calculation,

and round the result into 64bit.

• With extra-precision calculation, obtaining accurate output is not very hard.

• But in the proposed method, 64bit-precision calculation is only available

• Thus, error in each step of evaluation leads to the error in the output

So, we cannot tolerate error in each step

Page 12: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

12

Related Works• GNU C Library and FPU emulator in

Linux OS utilize conditional branches– Thus, not suitable for SIMD computation

• There are many researches on evaluating elementary functions on hardware– Many of them utilize table look-ups– Not suitable for SIMD computation

• There are several multiple-precision floating-point computation libraries– The design is very different from our implementation

Page 13: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

13

Outline• Overview• Background

– Cares needed for SIMD optimization– Related works

• Proposed method– Trigonometric functions– Inverse trigonometric functions– Exponential function– Logarithmic function

• Evaluation– Accuracy– Speed– Code size

• Conclusion

Page 14: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

14

Trigonometric functions• Finds sin x and cos x at the same time• Consists of two steps

– Step 1 : Argument is reduced within 0 and /4

utilizing the symmetry and periodicity

– Step 2 : Evaluate sin and cos functions on the reduced argument

• tan x can be evaluated by simply calculating (sin x / cos x)

Page 15: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

15

• Given argument d, find s and an integer q so that 0 <= s < /4

• Is it as easy as just dividing d with /4? – No!• Finding q is easy, but s may be inaccurate because of

cancellation error– when d is a large number close to a multiple of /4

• Some implementations like FPU on x86 processors exhibits this problem

Reducing argument within 0 and /4

Page 16: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

16

Cancellation error• Cancellation error happens when we subtract

one number from another where– Two numbers are very close– Both numbers have limited precision, and already

rounded

• Suppose that we are calculating 1 – x, with 5 digits of precision.– True value of x is 0.999985 which is rounded to 0.99999

– True value : 1 – 0.999985 = 0.000015

– Rounded : 1 – 0.99999 = 0.00001

– In this case, the results has only 1 digit of accuracy. Loss of accuracy is caused by cancellation

Page 17: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

17

Solution for this problem• Basic idea : calculate this part with extra

precision utilizing properties of IEEE 754• q is assumed to be expressed in 26 bits,

which is a half of mantissa• /4 is split into three parts

so that all of the bits in the lower half of the mantissa are zero,

Page 18: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

18

Solution for this problem (contd.)• Multiplying these numbers with q does

not produce an error– because, lower half of these numbers are zero

• Subtracting the resulting numbers in descending order from d does not produce cancellation error– Results of each step are all accurate

• In this way, we can obtain s accurately.

Page 19: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

19

• Simply summing up the terms of the Taylor series does not produce accurate result– Because of accumulation of errors

• Using another kind of reduction is common– Utilizing triple angle formula of sine

– e.g. First divide x by 3, evaluate Taylor series, and then apply triple angle formula

– This speeds up calculation by reducing x

Step 2 in evaluating trigonometric functions

Page 20: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

20

• This method is not good for our case.– When sin x is small, difference between two

terms are so large and it produces rounding error

• So, we use double angle formula of cosine, instead

• sin x can be found by the following formula

Step 2 (contd.)

Page 21: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

21

Step 2 (contd.)• But, we have another problem

– When x is close to 0, cos2 x gets close to 1, and we have cancellation error

• So, we use double angle formula of (1 – cos x)• Let f(x) be (1 - cos x), and we get

• And, these do not produce cancellation or rounding errors

Page 22: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

22

Inverse trigonometric functions

• We are going to find atan x, and obtain asin x and acos x from atan x

• Again, just evaluating terms of Taylor series produces accumulation of errors

• There seems no good way of argument reduction is for inverse trigonometric functions– We propose a new way of argument reduction for atan

Page 23: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

23

New argument reduction forevaluating arc tangent (1/3)

• Suppose x = atan d, thus tan x = d• If d <= 1, we evaluate atan function on the arg

ument of 1/d instead of d• Then subtract the result from /2

• Now, we can assume that d > 1 and /4 < x < /2

Page 24: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

24

New argument reduction forevaluating arc tangent (2/3)• We use following formulas

• We use (13) to calculate cotangent of argument d• Then we repeatedly use (14) to find cotangent of (d /

2n)– which is actually "enlarging" cot d

• By (13), enlarging cot d corresponds to reducing tan d

(13)

(14)

Page 25: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

25

New argument reduction forevaluating arc tangent (3/3)

• Suppose that atan 2 = , tan = 2

• Apply (13) to get• Apply (14) to get• Apply (13) to get• = 0.618 is less than 2, thu

s we reduced the argument

(13)

(14)

After argument reduction, we calculate Taylor series of atan

Page 26: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

26

asin and acos functions• We have problem if we simply use the

following functions

– These produce cancellation errors when |x| is close to 1

• This problem can be avoided by small modifications of the formulas

Page 27: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

27

Exponential function• Consists of two steps

– Similar to trigonometric functions

• Step 1 : Argument is reduced within 0 and loge 2

• Step 2 : Further reduce the argument, and evaluate Taylor series

Page 28: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

28

Step 1• We find s and an integer q so that 0

< s <= loge 2

• This step is very similar to the first step for trigonometric functions

• The same problem arises, and we solve the problem in the same way

Page 29: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

29

Step 2• We can use the following formula to further

reduce argument

• And the evaluate Taylor series

• But, if x is close to 0, we will have rounding errors– Since difference between 1 and other terms are large

• So, we find (exp x) – 1, instead of exp x• And the remaining part is similar to sin and

cos

Page 30: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

30

Logarithmic function• We use the following series, rather than

Taylor series

• This series converges faster than Taylor series

• Just evaluating this series is enough.• This series is well known.

Page 31: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

31

Outline• Overview• Background

– Cares needed for SIMD optimization– Related works

• Proposed method– Trigonometric functions– Inverse trigonometric functions– Exponential function– Logarithmic function

• Evaluation– Accuracy– Speed– Code size

• Conclusion

Page 32: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

32

Evaluation• Accuracy

– We compared the output by proposed method and output by MPFR library

– We measured the evaluation accuracy within a few ranges for each function

• Speed– We compared the speed of the proposed

methods, an FPU and MPFR Library– We used Core i7 920 (2.66GHz), Core 2

Duo E7200 (2.53GHz)

Page 33: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

33

Accuracy

Accuracy results for sin, cos and tan

Accuracy results for atan

Accuracy results for asin and acos

Page 34: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

34

Accuracy (contd.)

Accuracy results for exp

Accuracy results for log

In all cases, error does not exceed 6 ulps

Page 35: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

35

Speed

Proposed method is about two times as fast as FPU calculation

Page 36: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

36

Code size

• The total code size is very small• Suitable for Cell B.E.

– which has only 256K bytes of directly accessible scratch pad memory in each SPE.

Page 37: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

37

Conclusion• We proposed efficient methods for evaluating e

lementary functions in double precision• It does not include table lookups, scattering fro

m, or gathering into SIMD registers, or conditional branches.

• The average and maximum errors were less than 0.67 ulp and 6 ulps, respectively.

• The evaluation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.

Page 38: (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

International Supercomputing Conference 2010

38

Thank you• An implementation of the proposed method is now available as public domain software.

• Contact

http://freshmeat.net/projects/sleef