CGaAs PowerPC FXU presentation

About This Presentation

Transcript and Presenter's Notes

Title: CGaAs PowerPC FXU

1
CGaAs PowerPC FXU

Alan J. Drake, Todd D. Basso, Spencer M. Gold,
Keith L. Kraver, Phiroze N. Parakh, Claude R.
Gauthier, P. Sean Stetson, and Richard B. Brown
University of Michigan

2
Overview

CGaAs overview
PUMA project overview
Architectural studies
CMOS vs CGaAs implementations
Design methodology
Test results
Conclusions

3
Motivation for CGaAs

BiCMOS alternative
N and P transistors
Low power
High speed
Radiation hard microelectronics
Motorola CELESTRI
Over 1000 new LEO satellites in lt 5 years

4
CGaAs Cross-section
TiW Schottky gate
No Gate or Field Oxide
Oxygen isolation
Ohmic metal (Ni)
30Å i-GaAs
n
n
p
p
250Å i-AlGaAs
150Å i-InGaAsChannel
30Å i-GaAs
Si-delta doping
p-
p-
2000Å i-GaAs
PFET
NFET
i-AlGaAs
LTG GaAs
LEC semi-insulating GaAs
Brown98
No Wells
Fast Recombination Time
5
CGaAs Advantages

Radiation-hard
Total dose (108 rads)
SEU (10-10 upsets/bit-day)
Latchup (1012 rads)
Potential high-speed
HFET operation
Low-power
VDD 0.9 V to 1.5 V
Versatile
Complementary transistors
SCFL, P-DCFL, Domino, and Complementary Logic

6
CGaAs Disadvantages

High VT
?0.55 V
Low Idsat
Gate leakage
Course design rules
Large ohmic contacts
Conservative active spacing
LDD style P doping
Course gate metal and M1
N to P sneak paths
Low integration levels

7
History of GaAs at U of M

Aurora project
MIPS architecture
R2000 and R3000
Vitesse GaAs E/D MESFET technology
HGaAs II and HGaAs III
Aurora III (1995)
Dual-issue, 0.8 IPC
500 K transistors, 300 MHz, 40 W (simulated)

8
PUMA Project Goals

Develop CGaAs digital circuit techniques
Complementary and domino logic
Automated design tools
Optimize microarchitecture suitable for CGaAs
Low transistor integration level
500K1Million transistors
Evaluate CGaAs for digital VLSI applications
Speed
Power
Area

9
PUMA System Block Diagram
PCI BUS
64 MB SDRAM
PC-board
PCI
(128)
CMOS chip
(128)

(128)
(128)
MMU
CGaAs chip
1 MB L2 i-cache
1 MB L2 d-cache
(128)
(128)
(32)
(128)
CGaAs uProcessor
(32)
16 KB L1 d-cache
MCM-D on ceramic
10
Microarchitecture Optimizations

Fixed-Point Unit (FXU)
32-bit implementation
PUMA ISA
Subset of PowerPC ISA, integer instructions
GPR and essential supervisor registers
Limited exception model
Traditionally high-performance features
Cache
Instruction prefetching
Branch prediction
Superscalar execution

11
Baseline Microarchitecture
L1-Icache size line size assoc. miss penalty
Fetch prefetch branch pred.
RF
Decode
ROB entries
FPU rs latency
BRU rs latency
ALU rs latency
LSU rs latency
L1-Dcache size line size assoc. miss penalty
12
Cache Optimization Parameters

Optimize cache parameters
Size
512KB32MB
Line width
416 Bytes
Associativity
DM8 way
Instruction pre-fetching
08 entries stream buffer
Latency
04 cycles
Optimize load/store
Load forwarding

13
Superscalar Optimization

Dependencies
Accurate branch prediction
ISA
System configuration
Application program
Implementation details
Modified Tomasulo algorithm
Reorder buffer
Parameters
Degree of superscalar
15 instructions wide
Reservation stations
16 stations
Reorder buffer size
232 entries

14
Branch Prediction

Two level prediction
Nine schemes analyzed
Single PHT
Global schemes
Per-set schemes
Per-address schemes
Custom simulation program
Branch level
Millions of instructions
Optimization criteria
Cost in bits
1288192 bits

1st Level Branch History Register BHR
15
Branch Prediction
16
Instruction Translation

Compound PowerPC instructions
Multiple operations
Multiple source and destination registers
Unit-operations
One unit of computational time
Modify one architectural register
15 overhead in raw count
-2.8 IPC performance loss

17
FXU Logical Pipeline

8-stage pipeline
X has 3-cycle latency for load and store
instructions
Forwarding via CDB and ROB

ICA
ICH
IB
D
Reorder Buffer
CDB
S
X
W1
W2
18
Proposed FXU Microarchitecture

500,000 transistors
Dual-issue OOO superscalar
2 reservation stations per functional unit
8-entry reorder buffer
1 KB on-chip icache / 16 KB off-chip dcache
(3-cycle)
256 B gshare (87)
0.76 IPC
27 improvement over pipeline
High efficiency (1.52 IPC/M-transistors)

19
CMOS FXU I Microprocessor

Purpose
Test architecture
Debug design tools
Characteristics
TSMC 0.35?m 3M CMOS
830K transistors
9.9x9.9 mm (7.5x6.8 mm core)
Development
9 months
10 graduate students

20
CMOS FXU I Block Diagram
BP
BTB
Fetch
L1-Icache
Decode
1 KB
128 bit
MMU
RF
Dispatch
ROB
BRU
ALU
LSU
32 bit
4 KB
21
CGaAs FXU II Block Diagram
BP
BTB
Fetch
L1-Icache
Decode
1 KB
128 bit
MMU
RF
Dispatch
ROB
BRU
ALU
LSU
32 bit
16 KB
22
CGaAs FXU II Microprocessor

Characteristics
Motorola 0.5 ?m 3M CGaAs
380K transistors
13.1x11.4 mm
Area I/O interface
16 KB Dcache
Performance
0.51 IPC
25 MHz
274 mW _at_ 1.3 V

23
PUMA Design Flow
Behavioral RTL model

Mixed tool set
Custom tools
Verilog checker
Random Test Generator
RAM compiler
Commercial tools
Verilog/VCS
EPOCH
TDS
Mentor IC Station
HSpice
PowerPC simulator

Manualsynthesis
Verification
Error
Yes
No
Gate-levelRTL model
Physical design
Fabrication
Testing
24
Design Verification and Testing
Testprogram
Random Test Generator
PowerPCcompiler
VCD output
RTL model
PowerPCinstructionsimulator
Checker
TDS
reg
mem
Standard Vectors
Conversionscript
Debug
Error
No
Yes
Test Vectors
HP82000
25
Physical Design Methodology
Verification
Verilog
Logic Synthesis
Structural Mapping
Place and Route
Floorplanning
Parasitic Extraction
geodb
Timing
EPOCH
Chip layout
Chip netlist
26
Physical Design Methodology
Verification
Verilog
Logic Synthesis
Structural Mapping
Place and Route
Floorplanning
Parasitic Extraction
geodb
Timing
EPOCH
Parasitic Extraction
Chip layout
Chip netlist
Cell Check
Verification
Mentor IC
Power/Clock Distribution
DRC
LVS
Final Layout
27
Standard Cell Generation
HSpice

HSpice analog analysis
14 input gates
Optimized transistor sizes
Epoch Cell generation
Used CGaAs design rules
Basic CMOS cell generated
IC Station modification
Gate connection
DRC errors
Cell library
30 types of complimentary gates
Drive strength
16x standard gates
164x buffers

EPOCH
geodb
Generate Cells
Masterport
Standard Cells
Standard Cells
IO Pads (gdsii)
Optimize Cells
Mentor IC
28
Process-Independent RAM Compiler

Generates RAM macrocells
Objectives
RAM comparisons
Performance
Area
Multiple processes
Cost/Benefit Analysis
Optimize CGaAs embedded RAM
Components Created
CGaAs Icache
2 KB test RAM

29
RAM Compilation Methodology
Input Parameters
Process Description RAM Configuration
User-specifiedtarget cycle time
Power
Delay
Near-optimalpower-delay curve
RAM layout SPICE netlist
30
Test Features

Test limitations
Transistor count
I/O pin count
Test block
Pad ring
32-bit inverter ring oscillator
32-bit nand ring oscillator
32-bit shift register
Clock tree output
Other features
Scan paths (FXU II only)
ALU, Decoder, Dispatch
Functionality Disables
Icache, Dcache, Exceptions, Pipeline

31
Test Flow

Test block
Disable caches and pipelining
Single instruction tests
Multiple instruction tests
Branch tests
Critical path
Enable pipeline
Enable instruction cache
Enable data cache
Power, frequency, voltage
Scan testing

32
FXU II VoltageFrequency Plot
33
FXU II VoltageCurrent Plot
34
Other Results

Problems
Caches
Data output
Sources
Gate leakage currents
Process parameters
?n,p, Idn,p, VTp
Pseudo-DCFL RAM Decoders

35
Conclusions

CGaAs
FlexibleComplementary, P-DCFL, Domino
Low power
Radiation hard
Immature process
CGaAs FXU II
25 MHz _at_ 1.3V
First attempt in new process
Contributions
Cost-efficient microarchitecture
Technology independent RAM compiler
Portable verification/test environment
Future work
SOI PUMA

36
(No Transcript)
37
CGaAs V. CMOS (Device)
38
CGaAs Delay Versus VDD
CGaAs, TFSOI, and CMOS Performance Comparison
Delay
VDD (V)
Brown98
39
CGaAs Pseudo-DCFL

Active load p-type transistor
Ratioed
Poor VOL, noise margins
Static power dissipation
High speed
Cost-effective

40
CGaAs Complementary Logic

Dual networks
Moderate speed
No static power dissipation
Switching power
Good noise margins
Expensive
Some tool compatibility

41
CGaAs Domino Logic

Two-phase operation
High speed
Complex functions
Cost-effective
High power
Non-inverting

42
CGaAs Digital Logic Families
43
FXU II Path Distribution
44
Computation Efficiency
45
CGaAs V. CMOS (Fan out)
46
CGaAs V. CMOS (Geometry)

CMOS is metal limited (FXU I)
CGaAs is transistor limited (FXU II)

47
RAM Compiler Methodology
Input Parameters
Process DescriptionIC design rulesSPICE
modelsSheet resistancesParasitic
capacitancesElectromigration rules RAM
ConfigurationCapacityAspect ratioRead/write
configTarget cycle time
User-specifiedtarget cycle time
Power
Delay
Near-optimalpower-delay curve
48
2 KB RAM Chip

Motorola 0.5 ?m CGaAs
Test chip
Same design as FXU II cache
8 x 2048 bit RAM
Pseudo-DCFL decoder

Write a Comment

User Comments (0)

About PowerShow.com

CGaAs PowerPC FXU PowerPoint PPT Presentation