Cell Broadband Processor presentation

About This Presentation

Transcript and Presenter's Notes

Title: Cell Broadband Processor

1
Cell Broadband Processor

Daniel Bagley
Meng Tan

2
Agenda

General Intro
History of development
Technical overview of architecture
Detailed technical discussion of components
Design choices
Other processors like the cell
Programming for the cell

3
History of Development

Sony Playstation2
Announce March 1999
Released March 2000 in Japan
128bit Emotion Engine
294mhz, MIPS CPU
Single Precision FP Optimizations
6.2gflops

4
History Continued

Partnership between Sony, Toshiba, IBM
Summer of 2000 High level development talks
Initial goal of 1000x PS2 Power
March 2001, Sony-IBM-Toshiba design center opened
400m investment.

5
Overall Goals for Cell

High performance in multimedia apps
Real time performance
Power consumption
Cost
Available by 2005
Avoid memory latency issues associated with
control structures

6
The Cell itself

Power PC based main core (PPE)
Multiple SPEs
On die memory controller
Inter-core transport bus
High speed IO

7
Cell Die Layout
8
Cell Implementation

Cell is an architecture
Preliminary PS3 Implementation
1 PPE
7 SPE (1 Disabled for yield increase)
221 mm² die size on a 90 nm process
Clocked at 3-4ghz
256GFLOPS Single Precision _at_ 4ghz

9
Why a Cell Architecture

Follows a trend in computing architecture
Natural extension of dual and multi-core
Extremely low hardware overhead
Software controllable
Specialized hardware more useful for multimedia

10
Possible Uses

Playstation3 (Obviously)
Blade servers (IBM)
Amazing single precision FP performance
Scientific applications
Toshiba HDTV products

11
Power Processing Element

PowerPC instruction set with AltiVec
Used for general purpose computing and
controlling SPEs
Simultaneous Multithreading
Separate 32 KB L1 Caches and unified 512 KB L2
Cache

12
PPE (cont.)

Slow but power efficient PowerPC instruction set
implementation
Two issue in-order instruction fetch
Conspicuous lack of instruction window
Compare to conventional PowerPC implementations
(G5)
Performance depends on SPE utilization

13
Synergistic Processing Element (SPE)

Specialized hardware
Meant to be used in parallel
(7 on PS3 implementation)
On chip memory (256kb)
No branch prediction
In-order execution
Dual issue

14
SPE Architecture

0.99µm2 on 90nm Process
128 registers (128 bits wide)
Instructions assumed to be 4x 32bit
Variant of VMX instruction set
Modified for 128 registers
On chip memory is NOT a cache

15
SPE Execution

Dual issue, in-order
Seven execution units
Vector logic
8 single precision operations per cycle
Significant performance hit for double precision

16
SPE Execution Diagram
17
SPE Local Storage Area

NOT a cache
256kb, 4 x 64kb ECC single port SRAM
Completely private to each SPE
Directly addressable by software
Can be used as a cache, but only with software
controls
No tag bits, or any extra hardware

18
SPE LS Scheduling

Software controlled DMA
DMA to and from main memory
Scheduling a HUGE problem
Done primarily in software
IBM predicts 80-90 usage ideally
Request queue handles 16 simultaneous requests
Up to 16 kb transfer each
Priority DMA, L/S, Fetch
Fetch / execute parallelism

19
SPE Control Logic

Very little in comparison
Represents shift in focus
Complete lack of branch prediction
Software branch prediction
Loop unrolling
18 cycle penalty
Software controlled DMA

20
SPE Pipeline

Little ILP, and thus little control logic
Dual issue
Simple commit unit (no reorder buffer or other
complexities)
Same execution unit for FP/int

21
SPE Summary

Essentially small vector computer
Based on Altivec/VMX ISA
Extensions for DMA and LS management
Extended for 128x 128bit registerfile
Uniquely suited for real time applications
Extremely fast for certain FP operations
Offload a large amount on to compiler / software.

22
Element Interconnect Bus

4 concentric rings connecting all Cell elements
128-bit wide interconnects

23
EIB (cont.)

Designed to minimize coupling noise
Rings of data traveling in alternating directions
Buffers and repeaters at each SPE boundary
Architecture can be scaled up with increased bus
latency

24
EIB (cont.)

Total bandwidth at 200GB/s
EIB controller located physically in center of
chip between SPEs
Controller reserves channels for each individual
data transfer request
Implementation allows for SPE extension
horizontally

25
Memory Interface

Rambus XDR memory to keep Cell at full
utilization
3.2 Gbps data bandwidth per device connected to
XDR interface
Cell uses dual channel XDR with four devices and
16-bit wide buses to achieve 25.2 GB/s total
memory bandwidth

26
Input / Output Bus

Rambus FlexIO Bus
IO interface consists of 12 unidirectional byte
lanes
Each lane supports 6.4 GB/s bandwidth
7 outbound lanes and 5 inbound lanes

27
Design Choices

In-order execution
Abandoning ILP
ILP 10-20 increase per generation
Reducing control logic
Real time responsiveness
Cache Design
Software configuration on SPE
Standard L2 cache on PPE

28
Cell Programming Issues

No Cell compiler in existence to manage
utilization of SPEs at compile time
SPEs do not natively support context switching.
Must be OS managed.
SPEs are vector processors. Not efficient for
general-purpose computation.
PPEs and SPEs use different instruction sets.

29
Cell Programming (cont.)

Functional Offload Model
Simplest model for Cell programming
Optimize existing libraries for SPE computation
Requires no rebuild of main application logic
which runs on PPE

30
Cell Programming (cont.)

Device Extension Model
Take advantage of SPE DMA
Use SPEs as interfaces to external devices

31
Cell Programming (cont.)

Computational Acceleration Model
Traditional super-computing methods using Cell
Shared memory or message passing paradigm for
accelerating inherently parallel math operations
Can overwrite intensive math libraries without
rewriting applications

32
Cell Programming (cont.)

Streaming model
Use Cell processor as one large programmable
pipeline
Partition algorithms into logically sensible
steps. Execute each separately, in serial, on
separate processors.

Cell Broadband Processor PowerPoint PPT Presentation