Title: CGaAs PowerPC FXU
1CGaAs PowerPC FXU
- Alan J. Drake, Todd D. Basso, Spencer M. Gold,
Keith L. Kraver, Phiroze N. Parakh, Claude R.
Gauthier, P. Sean Stetson, and Richard B. Brown - University of Michigan
2Overview
- CGaAs overview
- PUMA project overview
- Architectural studies
- CMOS vs CGaAs implementations
- Design methodology
- Test results
- Conclusions
3Motivation for CGaAs
- BiCMOS alternative
- N and P transistors
- Low power
- High speed
- Radiation hard microelectronics
- Motorola CELESTRI
- Over 1000 new LEO satellites in lt 5 years
4CGaAs Cross-section
TiW Schottky gate
No Gate or Field Oxide
Oxygen isolation
Ohmic metal (Ni)
30Ã… i-GaAs
n
n
p
p
250Ã… i-AlGaAs
150Ã… i-InGaAsChannel
30Ã… i-GaAs
Si-delta doping
p-
p-
2000Ã… i-GaAs
PFET
NFET
i-AlGaAs
LTG GaAs
LEC semi-insulating GaAs
Brown98
No Wells
Fast Recombination Time
5CGaAs Advantages
- Radiation-hard
- Total dose (108 rads)
- SEU (10-10 upsets/bit-day)
- Latchup (1012 rads)
- Potential high-speed
- HFET operation
- Low-power
- VDD 0.9 V to 1.5 V
- Versatile
- Complementary transistors
- SCFL, P-DCFL, Domino, and Complementary Logic
6CGaAs Disadvantages
- High VT
- ?0.55 V
- Low Idsat
- Gate leakage
- Course design rules
- Large ohmic contacts
- Conservative active spacing
- LDD style P doping
- Course gate metal and M1
- N to P sneak paths
- Low integration levels
7History of GaAs at U of M
- Aurora project
- MIPS architecture
- R2000 and R3000
- Vitesse GaAs E/D MESFET technology
- HGaAs II and HGaAs III
- Aurora III (1995)
- Dual-issue, 0.8 IPC
- 500 K transistors, 300 MHz, 40 W (simulated)
8PUMA Project Goals
- Develop CGaAs digital circuit techniques
- Complementary and domino logic
- Automated design tools
- Optimize microarchitecture suitable for CGaAs
- Low transistor integration level
- 500K1Million transistors
- Evaluate CGaAs for digital VLSI applications
- Speed
- Power
- Area
9PUMA System Block Diagram
PCI BUS
64 MB SDRAM
PC-board
PCI
(128)
CMOS chip
(128)
(128)
(128)
MMU
CGaAs chip
1 MB L2 i-cache
1 MB L2 d-cache
(128)
(128)
(32)
(128)
CGaAs uProcessor
(32)
16 KB L1 d-cache
MCM-D on ceramic
10Microarchitecture Optimizations
- Fixed-Point Unit (FXU)
- 32-bit implementation
- PUMA ISA
- Subset of PowerPC ISA, integer instructions
- GPR and essential supervisor registers
- Limited exception model
- Traditionally high-performance features
- Cache
- Instruction prefetching
- Branch prediction
- Superscalar execution
11Baseline Microarchitecture
L1-Icache size line size assoc. miss penalty
Fetch prefetch branch pred.
RF
Decode
ROB entries
FPU rs latency
BRU rs latency
ALU rs latency
LSU rs latency
L1-Dcache size line size assoc. miss penalty
12Cache Optimization Parameters
- Optimize cache parameters
- Size
- 512KB32MB
- Line width
- 416 Bytes
- Associativity
- DM8 way
- Instruction pre-fetching
- 08 entries stream buffer
- Latency
- 04 cycles
- Optimize load/store
- Load forwarding
13Superscalar Optimization
- Dependencies
- Accurate branch prediction
- ISA
- System configuration
- Application program
- Implementation details
- Modified Tomasulo algorithm
- Reorder buffer
- Parameters
- Degree of superscalar
- 15 instructions wide
- Reservation stations
- 16 stations
- Reorder buffer size
- 232 entries
14Branch Prediction
- Two level prediction
- Nine schemes analyzed
- Single PHT
- Global schemes
- Per-set schemes
- Per-address schemes
- Custom simulation program
- Branch level
- Millions of instructions
- Optimization criteria
- Cost in bits
- 1288192 bits
1st Level Branch History Register BHR
15Branch Prediction
16Instruction Translation
- Compound PowerPC instructions
- Multiple operations
- Multiple source and destination registers
- Unit-operations
- One unit of computational time
- Modify one architectural register
- 15 overhead in raw count
- -2.8 IPC performance loss
17FXU Logical Pipeline
- 8-stage pipeline
- X has 3-cycle latency for load and store
instructions - Forwarding via CDB and ROB
ICA
ICH
IB
D
Reorder Buffer
CDB
S
X
W1
W2
18Proposed FXU Microarchitecture
- 500,000 transistors
- Dual-issue OOO superscalar
- 2 reservation stations per functional unit
- 8-entry reorder buffer
- 1 KB on-chip icache / 16 KB off-chip dcache
(3-cycle) - 256 B gshare (87)
- 0.76 IPC
- 27 improvement over pipeline
- High efficiency (1.52 IPC/M-transistors)
19CMOS FXU I Microprocessor
- Purpose
- Test architecture
- Debug design tools
- Characteristics
- TSMC 0.35?m 3M CMOS
- 830K transistors
- 9.9x9.9 mm (7.5x6.8 mm core)
- Development
- 9 months
- 10 graduate students
20CMOS FXU I Block Diagram
BP
BTB
Fetch
L1-Icache
Decode
1 KB
128 bit
MMU
RF
Dispatch
ROB
BRU
ALU
LSU
32 bit
4 KB
21CGaAs FXU II Block Diagram
BP
BTB
Fetch
L1-Icache
Decode
1 KB
128 bit
MMU
RF
Dispatch
ROB
BRU
ALU
LSU
32 bit
16 KB
22CGaAs FXU II Microprocessor
- Characteristics
- Motorola 0.5 ?m 3M CGaAs
- 380K transistors
- 13.1x11.4 mm
- Area I/O interface
- 16 KB Dcache
- Performance
- 0.51 IPC
- 25 MHz
- 274 mW _at_ 1.3 V
23PUMA Design Flow
Behavioral RTL model
- Mixed tool set
- Custom tools
- Verilog checker
- Random Test Generator
- RAM compiler
- Commercial tools
- Verilog/VCS
- EPOCH
- TDS
- Mentor IC Station
- HSpice
- PowerPC simulator
Manualsynthesis
Verification
Error
Yes
No
Gate-levelRTL model
Physical design
Fabrication
Testing
24Design Verification and Testing
Testprogram
Random Test Generator
PowerPCcompiler
VCD output
RTL model
PowerPCinstructionsimulator
Checker
TDS
reg
mem
Standard Vectors
Conversionscript
Debug
Error
No
Yes
Test Vectors
HP82000
25Physical Design Methodology
Verification
Verilog
Logic Synthesis
Structural Mapping
Place and Route
Floorplanning
Parasitic Extraction
geodb
Timing
EPOCH
Chip layout
Chip netlist
26Physical Design Methodology
Verification
Verilog
Logic Synthesis
Structural Mapping
Place and Route
Floorplanning
Parasitic Extraction
geodb
Timing
EPOCH
Parasitic Extraction
Chip layout
Chip netlist
Cell Check
Verification
Mentor IC
Power/Clock Distribution
DRC
LVS
Final Layout
27Standard Cell Generation
HSpice
- HSpice analog analysis
- 14 input gates
- Optimized transistor sizes
- Epoch Cell generation
- Used CGaAs design rules
- Basic CMOS cell generated
- IC Station modification
- Gate connection
- DRC errors
- Cell library
- 30 types of complimentary gates
- Drive strength
- 16x standard gates
- 164x buffers
EPOCH
geodb
Generate Cells
Masterport
Standard Cells
Standard Cells
IO Pads (gdsii)
Optimize Cells
Mentor IC
28Process-Independent RAM Compiler
- Generates RAM macrocells
- Objectives
- RAM comparisons
- Performance
- Area
- Multiple processes
- Cost/Benefit Analysis
- Optimize CGaAs embedded RAM
- Components Created
- CGaAs Icache
- 2 KB test RAM
29RAM Compilation Methodology
Input Parameters
Process Description RAM Configuration
User-specifiedtarget cycle time
Power
Delay
Near-optimalpower-delay curve
RAM layout SPICE netlist
30Test Features
- Test limitations
- Transistor count
- I/O pin count
- Test block
- Pad ring
- 32-bit inverter ring oscillator
- 32-bit nand ring oscillator
- 32-bit shift register
- Clock tree output
- Other features
- Scan paths (FXU II only)
- ALU, Decoder, Dispatch
- Functionality Disables
- Icache, Dcache, Exceptions, Pipeline
31Test Flow
- Test block
- Disable caches and pipelining
- Single instruction tests
- Multiple instruction tests
- Branch tests
- Critical path
- Enable pipeline
- Enable instruction cache
- Enable data cache
- Power, frequency, voltage
- Scan testing
32FXU II VoltageFrequency Plot
33FXU II VoltageCurrent Plot
34Other Results
- Problems
- Caches
- Data output
- Sources
- Gate leakage currents
- Process parameters
- ?n,p, Idn,p, VTp
- Pseudo-DCFL RAM Decoders
35Conclusions
- CGaAs
- FlexibleComplementary, P-DCFL, Domino
- Low power
- Radiation hard
- Immature process
- CGaAs FXU II
- 25 MHz _at_ 1.3V
- First attempt in new process
- Contributions
- Cost-efficient microarchitecture
- Technology independent RAM compiler
- Portable verification/test environment
- Future work
- SOI PUMA
36(No Transcript)
37CGaAs V. CMOS (Device)
38CGaAs Delay Versus VDD
CGaAs, TFSOI, and CMOS Performance Comparison
Delay
VDD (V)
Brown98
39CGaAs Pseudo-DCFL
- Active load p-type transistor
- Ratioed
- Poor VOL, noise margins
- Static power dissipation
- High speed
- Cost-effective
40CGaAs Complementary Logic
- Dual networks
- Moderate speed
- No static power dissipation
- Switching power
- Good noise margins
- Expensive
- Some tool compatibility
41CGaAs Domino Logic
- Two-phase operation
- High speed
- Complex functions
- Cost-effective
- High power
- Non-inverting
42CGaAs Digital Logic Families
43FXU II Path Distribution
44Computation Efficiency
45CGaAs V. CMOS (Fan out)
46CGaAs V. CMOS (Geometry)
- CMOS is metal limited (FXU I)
- CGaAs is transistor limited (FXU II)
47RAM Compiler Methodology
Input Parameters
Process DescriptionIC design rulesSPICE
modelsSheet resistancesParasitic
capacitancesElectromigration rules RAM
ConfigurationCapacityAspect ratioRead/write
configTarget cycle time
User-specifiedtarget cycle time
Power
Delay
Near-optimalpower-delay curve
482 KB RAM Chip
- Motorola 0.5 ?m CGaAs
- Test chip
- Same design as FXU II cache
- 8 x 2048 bit RAM
- Pseudo-DCFL decoder