Bridged, ThreePath Fused MultiplyAdders

About This Presentation

Title:

Bridged, ThreePath Fused MultiplyAdders

Description:

Bridged, Three-Path Fused Multiply-Adders. A proposal ... Proposed project in IEEE-754 (1985) double-precision format. 52-bit significand/mantissa (fraction) ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 29

Provided by: Electrical55

Category:

more less

Transcript and Presenter's Notes

Title: Bridged, ThreePath Fused MultiplyAdders

1
Bridged, Three-Path Fused Multiply-Adders

A proposal for the improvement of on-chip FMAs

Eric Quinnell, M.S.E.E. under the supervision
of Professor Earl E. Swartzlander, Jr.
2
Qualification Committee

Dr. Earl E. Swartzlander, Jr.
Dr. Adnan Aziz
Dr. Jacob Abraham
Dr. Tony Ambler
Dr. Jason Arbaugh
Mr. Carl Lemonds

3
Outline

Introduction
Problem Statement
Previous Work
Proposed Work
Expected Results
Implementation
Conclusion
Questions

4
Introduction

Proposed project in IEEE-754 (1985)
double-precision format
52-bit significand/mantissa (fraction)
11-bit exponent
1-bit sign
Follows format of previous papers on subject
Focus on arithmetic, operand execution
Exceptions, specials, denorms, NaN, infinity not
considered

5
Fused Multiply-Add

Principal paper and patent under Montoye et al.
from IBM in 1990
Equation found in any polynomial
Used in DSP, FFT, graphics, division,
transcendentals, dot-products, advanced
mathematics
Faster than FADD, FMUL
Only one rounding stage

D (A x B) C
6
Fused Multiply-Add
RISC System/6000
Montoye et al., IBM 1990
7
Industrial Use

IBM RS/6000
IBM PowerPC 603 604 series
HP PA 8000 series
MIPS R10000
ARM VFP10
Intel Itanium

8
Problem Statement

All industrial FMAs use RS/6000 architecture as
base
Many FMAs entirely replace FADD, FMUL. This taxes
the stand-alone instructions. (Bad backwards
compatibility)
FADD (A x 1.0) B
FMUL (A x B ) 0.0
FMA has weak success in industry

9
Problem Statement
In order for the FMA unit to have a future in
processing and to continue the benefits of its
use, a new architecture that both reduces latency
and remains compatible with old applications must
be designed.
10
Previous work Power PC 603e
11
Previous Work HAL SPARC64

pseudo-FMA forwards finished multiplies
directly to the FPA
FMA data is rounded twice, hence pseudo

12
Previous Work Lang/Bruguera

Combine addition/rounding stage
Critical path is through LZA. Data waits at
161-bit normalizer for shifting instructions

13
Previous Work Seidel Multi-Path

5-cases and paths
Speculatively compute in parallel.
Select path at the end based on correct exponent
difference
Stemmed from the dual-path FPA

14
Previous Work Seidel Multi-Path
15
Previous Work Xiao

3-input LZA equations to speed up critical path
of Lang/Bruguera

16
Proposed Work Three-Path FMA

Variation on Seidel multiple-path suggestion by
reducing 5-cases to 3-paths
Uses a Lang/Bruguera improvement to combine
addition/rounding with a Xiao 3-input LZA for
near path
Architecture designed for reduced latency

17
(No Transcript)
18
Proposed Work Bridged FMA

Variation of SPARC pseudo FMA.
Keep full multiplier and adder in FPU.
Bridge the two by re-using resources when FMA
instruction is called
Architecture designed for backwards compatibility
with legacy code

19
(No Transcript)
20
Proposed Work Bridged, 3-Path FMA

Combination of three-path FMA and bridged FMA
Three-path for reduced latency
Bridge for re-use of components, backwards
compatibility
Only needs to share multiplier

21
(No Transcript)
22
Expected Results

3-path FMA architecture expected to be fastest,
lowest latency FMA to date
Bridged FMA expected to perform execution without
adding latency to multiplier/adder
3-path, bridged expected to provide both reduced
latency and hardware reuse, providing a full
execution hardware set option for future FPUs.

23
Implementation

All proposed hardware, as well as an RS/6000,
will be implemented for latency, area, and power
comparison
Implementation will be done using AMD 45 and 65nm
HSpice models, timing tools, floorplanning, power
estimation, routing, and parasitic extraction
Tool licensing agreement in writing. Tool use
will be considered a PR donation from AMD for
support of UT dissertation research.

24
Implementation Schedule
25
Conclusion

FMAs are a rising power in industrial level
computing, with several chips already putting
them to use
No reduced latency improvements to the RS/6000
architecture have been adopted in the 16 years
since its introduction
Single add or multiplication instructions are
currently latency taxed in FMAs
Proposed three-path and bridged architectures
decrease FMA latency significantly and remain
single add/multiply compatible
Proposed architectures to be implemented on AMD
45nm/65nm technology to prove theorized gains