Title: Bridged, ThreePath Fused MultiplyAdders
1Bridged, Three-Path Fused Multiply-Adders
- A proposal for the improvement of on-chip FMAs
Eric Quinnell, M.S.E.E. under the supervision
of Professor Earl E. Swartzlander, Jr.
2Qualification Committee
- Dr. Earl E. Swartzlander, Jr.
- Dr. Adnan Aziz
- Dr. Jacob Abraham
- Dr. Tony Ambler
- Dr. Jason Arbaugh
- Mr. Carl Lemonds
3Outline
- Introduction
- Problem Statement
- Previous Work
- Proposed Work
- Expected Results
- Implementation
- Conclusion
- Questions
4Introduction
- Proposed project in IEEE-754 (1985)
double-precision format - 52-bit significand/mantissa (fraction)
- 11-bit exponent
- 1-bit sign
- Follows format of previous papers on subject
- Focus on arithmetic, operand execution
- Exceptions, specials, denorms, NaN, infinity not
considered
5Fused Multiply-Add
- Principal paper and patent under Montoye et al.
from IBM in 1990 - Equation found in any polynomial
- Used in DSP, FFT, graphics, division,
transcendentals, dot-products, advanced
mathematics - Faster than FADD, FMUL
- Only one rounding stage
D (A x B) C
6Fused Multiply-Add
RISC System/6000
Montoye et al., IBM 1990
7Industrial Use
- IBM RS/6000
- IBM PowerPC 603 604 series
- HP PA 8000 series
- MIPS R10000
- ARM VFP10
- Intel Itanium
8Problem Statement
- All industrial FMAs use RS/6000 architecture as
base - Many FMAs entirely replace FADD, FMUL. This taxes
the stand-alone instructions. (Bad backwards
compatibility) - FADD (A x 1.0) B
- FMUL (A x B ) 0.0
- FMA has weak success in industry
9Problem Statement
In order for the FMA unit to have a future in
processing and to continue the benefits of its
use, a new architecture that both reduces latency
and remains compatible with old applications must
be designed.
10Previous work Power PC 603e
11Previous Work HAL SPARC64
- pseudo-FMA forwards finished multiplies
directly to the FPA - FMA data is rounded twice, hence pseudo
12Previous Work Lang/Bruguera
- Combine addition/rounding stage
- Critical path is through LZA. Data waits at
161-bit normalizer for shifting instructions
13Previous Work Seidel Multi-Path
- 5-cases and paths
- Speculatively compute in parallel.
- Select path at the end based on correct exponent
difference - Stemmed from the dual-path FPA
14Previous Work Seidel Multi-Path
15Previous Work Xiao
- 3-input LZA equations to speed up critical path
of Lang/Bruguera
16Proposed Work Three-Path FMA
- Variation on Seidel multiple-path suggestion by
reducing 5-cases to 3-paths - Uses a Lang/Bruguera improvement to combine
addition/rounding with a Xiao 3-input LZA for
near path - Architecture designed for reduced latency
17(No Transcript)
18Proposed Work Bridged FMA
- Variation of SPARC pseudo FMA.
- Keep full multiplier and adder in FPU.
- Bridge the two by re-using resources when FMA
instruction is called - Architecture designed for backwards compatibility
with legacy code
19(No Transcript)
20Proposed Work Bridged, 3-Path FMA
- Combination of three-path FMA and bridged FMA
- Three-path for reduced latency
- Bridge for re-use of components, backwards
compatibility - Only needs to share multiplier
21(No Transcript)
22Expected Results
- 3-path FMA architecture expected to be fastest,
lowest latency FMA to date - Bridged FMA expected to perform execution without
adding latency to multiplier/adder - 3-path, bridged expected to provide both reduced
latency and hardware reuse, providing a full
execution hardware set option for future FPUs.
23Implementation
- All proposed hardware, as well as an RS/6000,
will be implemented for latency, area, and power
comparison - Implementation will be done using AMD 45 and 65nm
HSpice models, timing tools, floorplanning, power
estimation, routing, and parasitic extraction - Tool licensing agreement in writing. Tool use
will be considered a PR donation from AMD for
support of UT dissertation research.
24Implementation Schedule
25Conclusion
- FMAs are a rising power in industrial level
computing, with several chips already putting
them to use - No reduced latency improvements to the RS/6000
architecture have been adopted in the 16 years
since its introduction - Single add or multiplication instructions are
currently latency taxed in FMAs - Proposed three-path and bridged architectures
decrease FMA latency significantly and remain
single add/multiply compatible - Proposed architectures to be implemented on AMD
45nm/65nm technology to prove theorized gains
26Break and Questions
27Courses Graduate
28Courses Upper Division (BSEE)