Title: A gentle introduction to floating point arithmetic
1A gentle introduction to floating point arithmetic
- Ho Chun Hok (cho_at_doc.ic.ac.uk)
- Custom Computing Group Seminar25 Nov 2005
2IEEE 754 floating point standard
vsize
esize
fsize
Let v
- Normal numbers (when exponent gt 0 and lt max
exponent) - v (-1)s x 2exponent x (1.fraction)
- Subnormal numbers (when exponent 0)
- v (-1)s x 2exponent x (0.fraction)
- Special numbers (when exponent max exponent)
- Infinity, Nan (not a number)
- precisions
- Single esize 8, fsize 23, vsize 32
- Double esize 11, fsize 52, vsize 64
- Double extended, vsize gt 64
- Operations
- , -, x, /, sqrt, f2i, i2f, compare, f2d, d2f
- Rounding
- Nearest even, inf, -inf, towards 0
3What IEEE 754 standard supposes to be
- Approximation to real number with expected error
- the epsilon can of any real number can be
determined when mapping to floating point number - Results of all operations can be correctly
rounded in case of inexact result - Ensure some math properties hold (in general),
- xyyx, -(-a) a, agtb cgt0 ? ac gt bc,
- x0 x, yy gt 0
- All exception can be detected
- Using exception flags
- Same results across different machines
4How to ensure the standard?
- Processor?
- Rounding numbers in different mode
- Gradual underflow
- Raise exceptions
- Operating System?
- Handle exception
- Handle function which may not be supported in
hardware (what if a processor cannot handle
subnormal number) - Keep track of the floating point unit state,
(precision, rounding mode) - Programming Language?
- Well-defined semantic for floating point (yes, we
have infamous JAVA language) - Compiler?
- Preserve the semantic defined in that language
- Programmer?
- read What Every Computer Scientist Should Know
About Floating-Point Arithmetic
5Case study 1
- int main (void)
- double ref,index
- double tmp
- int i
- ref (double) 169.0/ (double) 170.0
- for(i0ilt250i)
- indexi
- if(ref (double) (index/(index1.0)) )
break -
- printf("id\n", i)
- return 0
-
6Visual C compiler, running on P-M
- Same result on lulu (pentium 3) and irina (Xeon)
7GCC, running on Pentium 4 (skokie)
8VCC fld qword ptr ebp-10h fadd
qword ptr __real_at_8_at_3fff8000000000000000
(00426028) fdivr qword ptr ebp-10h fcomp
qword ptr ebp-8 fnstsw ax test
ah,40h je main5Fh (0040106f) jmp
main61h (00401071)
- gcc
- fld1
- faddp st,st(1)
- fldl 0xfffffff0(ebp)
- fdivp st,st(1)
- fldl 0xfffffff8(ebp)
- fxch st(1)
- fucompp
- fnstsw ax
- and 0x45,ah
- cmp 0x40,ah
- je 0x80483d2 ltmain102gt
VCC use normal stack (ebp) (64-bits) to store the
result, and compare with a 64bit double precision
value GCC use advanced FPU register stack (st)
(80-bits) to store the result, and compare with a
64bit double precision value
9Case study 1
- Its compiler issue
- Using more precision to calculate the
intermediate result is a good idea - Compiler should convert the 80-bit floating point
number to 64-bit before comparison - And its programmer issue too
- Equality test between FP variables is dangerous
- We can detect the problem before it hurts.
- It is not easy to compliance with the standard
10Case study 2
- Calculate
- When x is large, result 0, rather than
- Beware, even everything compliance with standard,
the standard cannot guarantee the result is
always correct - Again, programmer should detect this before it
hurts - Define routine to trap the exception
- Exceptions are not errors as long as they are
handled correctly
11Case Study 3
- Jean-Michel Mullers Recurrence
12Using double extended precision (80-bits)
- x2 5.590164e00
- x3 5.633431e00
- x4 5.674649e00
- x5 5.713329e00
- x6 5.749121e00
- x7 5.781811e00
- x8 5.811314e00
- x9 5.837660e00
- x10 5.861018e00
- x11 5.882514e00
- x12 5.918471e00
- x13 6.240859e00
- x14 1.115599e01
- x15 5.279838e01
- x16 9.469105e01
- x17 9.966651e01
- x18 9.998007e01
- x19 9.999881e01
Converge to 100.0, it seems correct
13Case Study 3
- This series can converge to either 5, 6, and 100
- Depends on the value of x0, x1
- If x0 11/2, x1 61/11, the series should be
converged to 6 - Little round off error may affect the result
dramatically - We can calculate the result analytically by
substituting - In general, its very difficult to detect this
error
14Case Study 4
- Table makers dilemma
- If we want n-digit accuracy for elementary
function like sine, cosine, we (in most case)
need to calculate the digit up to n2 digit - What if the last 2-digit is 10?
- We can calculate last 3 digit
- What if the last 3-digit is 100?
- We can calculate last 4 digit
- What if
- The result of most elementary function library
(e.g. libm) is not correctly rounded in some
cases
15Conclusion
- We have a standard representation for floating
point number - Comforting the standard requires collaboration
between different parties - Even if we have standard-compliance platform,
cautions must be taken when underflow,
overflow, and make sure the algorithm is
numerically stable - When using elementary function, dont expect the
result can be comparable between different
machine - Elementary function is NOT included in the IEEE
standard - Floating point, when use properly, can do
something serious