Cell Broadband Processor - PowerPoint PPT Presentation

About This Presentation
Title:

Cell Broadband Processor

Description:

Summer of 2000 High level development talks. Initial goal of 1000x PS2 Power ... Implementation allows for SPE extension horizontally. Memory Interface ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 35
Provided by: daniel344
Category:

less

Transcript and Presenter's Notes

Title: Cell Broadband Processor


1
Cell Broadband Processor
  • Daniel Bagley
  • Meng Tan

2
Agenda
  • General Intro
  • History of development
  • Technical overview of architecture
  • Detailed technical discussion of components
  • Design choices
  • Other processors like the cell
  • Programming for the cell

3
History of Development
  • Sony Playstation2
  • Announce March 1999
  • Released March 2000 in Japan
  • 128bit Emotion Engine
  • 294mhz, MIPS CPU
  • Single Precision FP Optimizations
  • 6.2gflops

4
History Continued
  • Partnership between Sony, Toshiba, IBM
  • Summer of 2000 High level development talks
  • Initial goal of 1000x PS2 Power
  • March 2001, Sony-IBM-Toshiba design center opened
  • 400m investment.

5
Overall Goals for Cell
  • High performance in multimedia apps
  • Real time performance
  • Power consumption
  • Cost
  • Available by 2005
  • Avoid memory latency issues associated with
    control structures

6
The Cell itself
  • Power PC based main core (PPE)
  • Multiple SPEs
  • On die memory controller
  • Inter-core transport bus
  • High speed IO

7
Cell Die Layout
8
Cell Implementation
  • Cell is an architecture
  • Preliminary PS3 Implementation
  • 1 PPE
  • 7 SPE (1 Disabled for yield increase)
  • 221 mm² die size on a 90 nm process
  • Clocked at 3-4ghz
  • 256GFLOPS Single Precision _at_ 4ghz

9
Why a Cell Architecture
  • Follows a trend in computing architecture
  • Natural extension of dual and multi-core
  • Extremely low hardware overhead
  • Software controllable
  • Specialized hardware more useful for multimedia

10
Possible Uses
  • Playstation3 (Obviously)
  • Blade servers (IBM)
  • Amazing single precision FP performance
  • Scientific applications
  • Toshiba HDTV products

11
Power Processing Element
  • PowerPC instruction set with AltiVec
  • Used for general purpose computing and
    controlling SPEs
  • Simultaneous Multithreading
  • Separate 32 KB L1 Caches and unified 512 KB L2
    Cache

12
PPE (cont.)
  • Slow but power efficient PowerPC instruction set
    implementation
  • Two issue in-order instruction fetch
  • Conspicuous lack of instruction window
  • Compare to conventional PowerPC implementations
    (G5)
  • Performance depends on SPE utilization

13
Synergistic Processing Element (SPE)
  • Specialized hardware
  • Meant to be used in parallel
  • (7 on PS3 implementation)
  • On chip memory (256kb)
  • No branch prediction
  • In-order execution
  • Dual issue

14
SPE Architecture
  • 0.99µm2 on 90nm Process
  • 128 registers (128 bits wide)
  • Instructions assumed to be 4x 32bit
  • Variant of VMX instruction set
  • Modified for 128 registers
  • On chip memory is NOT a cache

15
SPE Execution
  • Dual issue, in-order
  • Seven execution units
  • Vector logic
  • 8 single precision operations per cycle
  • Significant performance hit for double precision

16
SPE Execution Diagram
17
SPE Local Storage Area
  • NOT a cache
  • 256kb, 4 x 64kb ECC single port SRAM
  • Completely private to each SPE
  • Directly addressable by software
  • Can be used as a cache, but only with software
    controls
  • No tag bits, or any extra hardware

18
SPE LS Scheduling
  • Software controlled DMA
  • DMA to and from main memory
  • Scheduling a HUGE problem
  • Done primarily in software
  • IBM predicts 80-90 usage ideally
  • Request queue handles 16 simultaneous requests
  • Up to 16 kb transfer each
  • Priority DMA, L/S, Fetch
  • Fetch / execute parallelism

19
SPE Control Logic
  • Very little in comparison
  • Represents shift in focus
  • Complete lack of branch prediction
  • Software branch prediction
  • Loop unrolling
  • 18 cycle penalty
  • Software controlled DMA

20
SPE Pipeline
  • Little ILP, and thus little control logic
  • Dual issue
  • Simple commit unit (no reorder buffer or other
    complexities)
  • Same execution unit for FP/int

21
SPE Summary
  • Essentially small vector computer
  • Based on Altivec/VMX ISA
  • Extensions for DMA and LS management
  • Extended for 128x 128bit registerfile
  • Uniquely suited for real time applications
  • Extremely fast for certain FP operations
  • Offload a large amount on to compiler / software.

22
Element Interconnect Bus
  • 4 concentric rings connecting all Cell elements
  • 128-bit wide interconnects

23
EIB (cont.)
  • Designed to minimize coupling noise
  • Rings of data traveling in alternating directions
  • Buffers and repeaters at each SPE boundary
  • Architecture can be scaled up with increased bus
    latency

24
EIB (cont.)
  • Total bandwidth at 200GB/s
  • EIB controller located physically in center of
    chip between SPEs
  • Controller reserves channels for each individual
    data transfer request
  • Implementation allows for SPE extension
    horizontally

25
Memory Interface
  • Rambus XDR memory to keep Cell at full
    utilization
  • 3.2 Gbps data bandwidth per device connected to
    XDR interface
  • Cell uses dual channel XDR with four devices and
    16-bit wide buses to achieve 25.2 GB/s total
    memory bandwidth

26
Input / Output Bus
  • Rambus FlexIO Bus
  • IO interface consists of 12 unidirectional byte
    lanes
  • Each lane supports 6.4 GB/s bandwidth
  • 7 outbound lanes and 5 inbound lanes

27
Design Choices
  • In-order execution
  • Abandoning ILP
  • ILP 10-20 increase per generation
  • Reducing control logic
  • Real time responsiveness
  • Cache Design
  • Software configuration on SPE
  • Standard L2 cache on PPE

28
Cell Programming Issues
  • No Cell compiler in existence to manage
    utilization of SPEs at compile time
  • SPEs do not natively support context switching.
    Must be OS managed.
  • SPEs are vector processors. Not efficient for
    general-purpose computation.
  • PPEs and SPEs use different instruction sets.

29
Cell Programming (cont.)
  • Functional Offload Model
  • Simplest model for Cell programming
  • Optimize existing libraries for SPE computation
  • Requires no rebuild of main application logic
    which runs on PPE

30
Cell Programming (cont.)
  • Device Extension Model
  • Take advantage of SPE DMA
  • Use SPEs as interfaces to external devices

31
Cell Programming (cont.)
  • Computational Acceleration Model
  • Traditional super-computing methods using Cell
  • Shared memory or message passing paradigm for
    accelerating inherently parallel math operations
  • Can overwrite intensive math libraries without
    rewriting applications

32
Cell Programming (cont.)
  • Streaming model
  • Use Cell processor as one large programmable
    pipeline
  • Partition algorithms into logically sensible
    steps. Execute each separately, in serial, on
    separate processors.

33
Cell Programming (cont.)
  • Asymmetric Thread Runtime Model
  • Abstract Cell architecture away from programmer.
  • Use OS to use processors to each run different
    threads.

34
Sample Performance
  • Demonstration physics engine for real-time game
  • http//www.research.ibm.com/cell/whitepapers/cell_
    online_game.pdf
  • 182 Compute to DMA ratio on SPEs
  • For the right tasks, Cell architecture can be
    extremely efficient.
Write a Comment
User Comments (0)
About PowerShow.com