(slides) efficient evaluation methods of elementary functions suitable for simd computation
DESCRIPTION
Naoki Shibata : Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation, Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10., Volume 25, Numbers 1-2, pp. 25-32, 2010, DOI: 10.1007/s00450-010-0108-2 (May. 2010). http://www.springerlink.com/content/340228x165742104/ http://freshmeat.net/projects/sleef Data-parallel architectures like SIMD (Single Instruction Multiple Data) or SIMT (Single Instruction Multiple Thread) have been adopted in many recent CPU and GPU architectures. Although some SIMD and SIMT instruction sets include double-precision arithmetic and bitwise operations, there are no instructions dedicated to evaluating elementary functions like trigonometric functions in double precision. Thus, these functions have to be evaluated one by one using an FPU or using a software library. However, traditional algorithms for evaluating these elementary functions involve heavy use of conditional branches and/or table look-ups, which are not suitable for SIMD computation. In this paper, efficient methods are proposed for evaluating the sine, cosine, arc tangent, exponential and logarithmic functions in double precision without table look-ups, scattering from, or gathering into SIMD registers, or conditional branches. We implemented these methods using the Intel SSE2 instruction set to evaluate their accuracy and speed. The results showed that the average error was less than 0.67 ulp, and the maximum error was 6 ulps. The computation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.TRANSCRIPT
International Supercomputing Conference 2010
1
Efficient Evaluation Methods ofElementary Functions
Suitable for SIMD Computation
Efficient Evaluation Methods ofElementary Functions
Suitable for SIMD Computation
Naoki ShibataShiga University
Naoki ShibataShiga University
International Supercomputing Conference 2010
2
Overview• Methods for evaluating following functions usin
g SIMD instructions in double precision– sin, cos, tan– log, exp– asin, acos, atan
• Fast– Two times as fast as FPU evaluation
• Accurate– Maximum error is within 6 ulps from the true value
International Supercomputing Conference 2010
3
Overview (contd.)• Advantages against existing methods
– No conditional branches– No gathering/scattering– No table look-ups
x
y
SIM
D o
pera
tion
SIM
D o
pera
tion
SIM
D o
pera
tion
SIM
D o
pera
tion
sin x
sin y
Apply exactly same seriesexactly same series of SIMD operations to x and y,and we get sin x and sin y.
International Supercomputing Conference 2010
4
Outline• Overview• Background
– Cares needed for SIMD optimization– Related works
• Proposed method– Trigonometric functions– Inverse trigonometric functions– Exponential function– Logarithmic function
• Evaluation– Accuracy– Speed– Code size
• Conclusion
International Supercomputing Conference 2010
5
Background• SIMD instructions are now pervasive
– SSE in x86 processors– Altivec in Power / PowerPC processors– NEON in ARM processors– Cell Broadband Engine– Many GPU models
• Length of SIMD registers is going to be extended– 256 bits in Sandy Bridge– 512 bits in Larrabee (or Knights Ferry?) GPUs
International Supercomputing Conference 2010
6
Background (contd.)
• SIMD inst. set does not include instructions for evaluating elmentary functions.– sin, cos, tan, log, exp, asin, acos, atan
• Two possibilities– FPU
• Elementary functions are available in limited architectures.
– Software library • Many are not optimized for SIMD calculation
International Supercomputing Conference 2010
7
Cares needed in SIMD optimization• We need special cares for SIMD optimization
– Memory access and conditional branches are slow• with modern processor models with long pipeline
– Table look-up is slow– Gathering and scattering operations are slow
• Some traditionally slow operations are not slow anymore– Division and Square root can now be evaluated
using one instruction each– No extra register needed
• We need new, specialized algorithms to make efficient uses of SIMD unit
International Supercomputing Conference 2010
8
Gathering and scatteringoperations are slow
• We need to compose/decompose each element of vector register– At least one instruction is needed for each
element– SIMD ALU may be idle during this operation
• Register spills may happen– This requires extra execution of instructions– This may cause memory access
International Supercomputing Conference 2010
9
Table look-up is slow• Table look-up is frequently used in traditional
implementation for evaluating elementary functions– Both in HW and SW
x
y
x
y
Scatter
Look-up
Look-up
x'
y'
x
yGather
Table
This part must be repeated for each
element
International Supercomputing Conference 2010
10
Division and SQRT are not too slow
• 69 clocks of latency and throughput for 1 execution of SQRTPD instruction[*]
• 69 clocks of latency and throughput for 1 execution of DIVPD instruction[*]
• Memory access latency is 165 clocks(C2Q 9300, measured by CPU-Z)
• The good point is that they do not require extra instruction execution and registers[*] Intel 64 and IA-32 Architectures Optimization Reference Manual
International Supercomputing Conference 2010
11
Challenge• Usually, evaluation of elementary functions is
performed using calculation with extra precision– e.g. x86 FPU does evaluation using 80bit calculation,
and round the result into 64bit.
• With extra-precision calculation, obtaining accurate output is not very hard.
• But in the proposed method, 64bit-precision calculation is only available
• Thus, error in each step of evaluation leads to the error in the output
So, we cannot tolerate error in each step
International Supercomputing Conference 2010
12
Related Works• GNU C Library and FPU emulator in
Linux OS utilize conditional branches– Thus, not suitable for SIMD computation
• There are many researches on evaluating elementary functions on hardware– Many of them utilize table look-ups– Not suitable for SIMD computation
• There are several multiple-precision floating-point computation libraries– The design is very different from our implementation
International Supercomputing Conference 2010
13
Outline• Overview• Background
– Cares needed for SIMD optimization– Related works
• Proposed method– Trigonometric functions– Inverse trigonometric functions– Exponential function– Logarithmic function
• Evaluation– Accuracy– Speed– Code size
• Conclusion
International Supercomputing Conference 2010
14
Trigonometric functions• Finds sin x and cos x at the same time• Consists of two steps
– Step 1 : Argument is reduced within 0 and /4
utilizing the symmetry and periodicity
– Step 2 : Evaluate sin and cos functions on the reduced argument
• tan x can be evaluated by simply calculating (sin x / cos x)
International Supercomputing Conference 2010
15
• Given argument d, find s and an integer q so that 0 <= s < /4
• Is it as easy as just dividing d with /4? – No!• Finding q is easy, but s may be inaccurate because of
cancellation error– when d is a large number close to a multiple of /4
• Some implementations like FPU on x86 processors exhibits this problem
Reducing argument within 0 and /4
International Supercomputing Conference 2010
16
Cancellation error• Cancellation error happens when we subtract
one number from another where– Two numbers are very close– Both numbers have limited precision, and already
rounded
• Suppose that we are calculating 1 – x, with 5 digits of precision.– True value of x is 0.999985 which is rounded to 0.99999
– True value : 1 – 0.999985 = 0.000015
– Rounded : 1 – 0.99999 = 0.00001
– In this case, the results has only 1 digit of accuracy. Loss of accuracy is caused by cancellation
International Supercomputing Conference 2010
17
Solution for this problem• Basic idea : calculate this part with extra
precision utilizing properties of IEEE 754• q is assumed to be expressed in 26 bits,
which is a half of mantissa• /4 is split into three parts
so that all of the bits in the lower half of the mantissa are zero,
International Supercomputing Conference 2010
18
Solution for this problem (contd.)• Multiplying these numbers with q does
not produce an error– because, lower half of these numbers are zero
• Subtracting the resulting numbers in descending order from d does not produce cancellation error– Results of each step are all accurate
• In this way, we can obtain s accurately.
International Supercomputing Conference 2010
19
• Simply summing up the terms of the Taylor series does not produce accurate result– Because of accumulation of errors
• Using another kind of reduction is common– Utilizing triple angle formula of sine
– e.g. First divide x by 3, evaluate Taylor series, and then apply triple angle formula
– This speeds up calculation by reducing x
Step 2 in evaluating trigonometric functions
International Supercomputing Conference 2010
20
• This method is not good for our case.– When sin x is small, difference between two
terms are so large and it produces rounding error
• So, we use double angle formula of cosine, instead
• sin x can be found by the following formula
Step 2 (contd.)
International Supercomputing Conference 2010
21
Step 2 (contd.)• But, we have another problem
– When x is close to 0, cos2 x gets close to 1, and we have cancellation error
• So, we use double angle formula of (1 – cos x)• Let f(x) be (1 - cos x), and we get
• And, these do not produce cancellation or rounding errors
International Supercomputing Conference 2010
22
Inverse trigonometric functions
• We are going to find atan x, and obtain asin x and acos x from atan x
• Again, just evaluating terms of Taylor series produces accumulation of errors
• There seems no good way of argument reduction is for inverse trigonometric functions– We propose a new way of argument reduction for atan
International Supercomputing Conference 2010
23
New argument reduction forevaluating arc tangent (1/3)
• Suppose x = atan d, thus tan x = d• If d <= 1, we evaluate atan function on the arg
ument of 1/d instead of d• Then subtract the result from /2
• Now, we can assume that d > 1 and /4 < x < /2
International Supercomputing Conference 2010
24
New argument reduction forevaluating arc tangent (2/3)• We use following formulas
• We use (13) to calculate cotangent of argument d• Then we repeatedly use (14) to find cotangent of (d /
2n)– which is actually "enlarging" cot d
• By (13), enlarging cot d corresponds to reducing tan d
(13)
(14)
International Supercomputing Conference 2010
25
New argument reduction forevaluating arc tangent (3/3)
• Suppose that atan 2 = , tan = 2
• Apply (13) to get• Apply (14) to get• Apply (13) to get• = 0.618 is less than 2, thu
s we reduced the argument
(13)
(14)
After argument reduction, we calculate Taylor series of atan
International Supercomputing Conference 2010
26
asin and acos functions• We have problem if we simply use the
following functions
– These produce cancellation errors when |x| is close to 1
• This problem can be avoided by small modifications of the formulas
International Supercomputing Conference 2010
27
Exponential function• Consists of two steps
– Similar to trigonometric functions
• Step 1 : Argument is reduced within 0 and loge 2
• Step 2 : Further reduce the argument, and evaluate Taylor series
International Supercomputing Conference 2010
28
Step 1• We find s and an integer q so that 0
< s <= loge 2
• This step is very similar to the first step for trigonometric functions
• The same problem arises, and we solve the problem in the same way
International Supercomputing Conference 2010
29
Step 2• We can use the following formula to further
reduce argument
• And the evaluate Taylor series
• But, if x is close to 0, we will have rounding errors– Since difference between 1 and other terms are large
• So, we find (exp x) – 1, instead of exp x• And the remaining part is similar to sin and
cos
International Supercomputing Conference 2010
30
Logarithmic function• We use the following series, rather than
Taylor series
• This series converges faster than Taylor series
• Just evaluating this series is enough.• This series is well known.
International Supercomputing Conference 2010
31
Outline• Overview• Background
– Cares needed for SIMD optimization– Related works
• Proposed method– Trigonometric functions– Inverse trigonometric functions– Exponential function– Logarithmic function
• Evaluation– Accuracy– Speed– Code size
• Conclusion
International Supercomputing Conference 2010
32
Evaluation• Accuracy
– We compared the output by proposed method and output by MPFR library
– We measured the evaluation accuracy within a few ranges for each function
• Speed– We compared the speed of the proposed
methods, an FPU and MPFR Library– We used Core i7 920 (2.66GHz), Core 2
Duo E7200 (2.53GHz)
International Supercomputing Conference 2010
33
Accuracy
Accuracy results for sin, cos and tan
Accuracy results for atan
Accuracy results for asin and acos
International Supercomputing Conference 2010
34
Accuracy (contd.)
Accuracy results for exp
Accuracy results for log
In all cases, error does not exceed 6 ulps
International Supercomputing Conference 2010
35
Speed
Proposed method is about two times as fast as FPU calculation
International Supercomputing Conference 2010
36
Code size
• The total code size is very small• Suitable for Cell B.E.
– which has only 256K bytes of directly accessible scratch pad memory in each SPE.
International Supercomputing Conference 2010
37
Conclusion• We proposed efficient methods for evaluating e
lementary functions in double precision• It does not include table lookups, scattering fro
m, or gathering into SIMD registers, or conditional branches.
• The average and maximum errors were less than 0.67 ulp and 6 ulps, respectively.
• The evaluation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.
International Supercomputing Conference 2010
38
Thank you• An implementation of the proposed method is now available as public domain software.
• Contact
http://freshmeat.net/projects/sleef