New Trends in Designing Processors - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

New Trends in Designing Processors

Description:

Is dual ported to allow read-modify-write operations each cycle ... Kathryn S. McKinley, Michael Dahlin, Lizy Kurian John, Calvin Lin, Charles R. ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 30
Provided by: anag1
Category:

less

Transcript and Presenter's Notes

Title: New Trends in Designing Processors


1
New Trends in Designing Processors
  • ASIC Seminar
  • Instructor Dr S.M. Fakhraie
  • Presented by Amir Naghdinezhad
  • Spring 2006

This is a class presentation. All data are copy
righted to respective authors as listed in the
references and have been used here for
educational purpose only
2
Outline
  • Introduction
  • Raw
  • Imagine
  • Smart memories
  • Trips
  • Conclusion
  • References

3
Introduction
  • In the 1970s
  • Memory was expensive
  • So CISC architectures with
  • Dense instruction encoding
  • Variable-length instructions
  • Small numbers of registers
  • In 1980s
  • An entire RISC processor could fit on a single
    chip
  • RISC processors attained high performance despite
    reduction in complexity

4
Introduction
  • Past 20 years
  • Aggressive pipelining and compiler scheduling
  • 40 per year performance scaling10
  • With only small ISA changes
  • Future architectures
  • Pipeline depth limits
  • Acceleration of clock speeds and power limits
  • Increasing delays through global on-chip wires
  • So new designs are required

5
Raw
  • At MIT university
  • A general-purpose architecture with interrupts,
    caches and context switches
  • Attacks the wire-delay problem
  • Enables the programmer or compiler to directly
    program the wiring resources
  • Composed of 16 identical programmable tiles
  • All signals registered at tile boundaries
  • One clock cycle delay
  • No global signals

6
Raw Architecture
The Raw microprocessor 1
7
Raw Tile Architecture
  • A tile contains
  • An 8-stage in-order single-issue MIPS-style
    processing pipeline
  • A 4-stage single-precision pipelined FPU
  • A 32KB data cache
  • 96KB of instruction caches
  • Two types of communication routers static and
    dynamic

8
Raw Tile Interconnections
  • Four 32bit full duplex on-chip networks
  • Two static
  • To route operands among local and remote ALUs
  • To route data streams among tiles, DRAM, I/O
    ports
  • Two dynamic
  • Cache misses, interrupts and dynamic messages
  • Each tile is connected only to its four neighbors

Raw tile architecture 2
9
Raw Compute Processor
Raw compute processor pipeline 1
10
Raw Performance Survey
Performance 3
11
Raw Fabrication
  • 180 nm, 6-metal copper ASIC process
  • 3.6 GFLOPS peak
  • 18.23mm x 18.23mm
  • Clock
  • 420MHz (actual)
  • Power
  • 10 watts (power save mode)
  • 18 watts typical
  • 35 watts max

Raw die layout 3
12
Imagine A Stream Processor
  • At Stanford University
  • A programmable stream processor for media
    applications
  • Imagine is controlled by a host processor
  • A peak performance of 20 GFLOPS5
  • With
  • 128-Kbyte stream register file
  • 48 floating-point arithmetic units in eight
    arithmetic clusters
  • A streaming memory system with four SDRAM
    channels
  • A microcontroller, a network interface and a
    stream controller

13
Imagine Stream Processors
  • A bridge between inflexible special purpose and
    programmable architectures
  • Are DSPs, targeted at high-performance embedded
    applications.
  • Contain clusters of functional units, supporting
    hundreds of arithmetic units.
  • Exploit
  • Instruction Level Parallelism (ILP)
  • Data Parallelism (DP)
  • Task parallelism (TP) (kernel execution and
    stream data transfers)

14
Imagine Stream Processors
  • The idea is organizing an application into
    streams and kernels
  • A stream contains a set of elements of the same
    type.
  • Simple or complex.
  • A kernel is the computational unit that works on
    streams.
  • Can have one or more input and output streams
  • Complex calculations ranging from a few to
    thousands of operations per input element

15
Imagine Architecture
There are eight VLIW computation clusters
arranged in a SIMD array.
The Imagine chip is controlled by a host
processor.
Streams of data are stored in Stream Register
File (SRF), which can transfer data to and from
LRFs.
Operands for arithmetic operations are kept
locally in Local Register Files (LRFs) near the
ALUs.
Global data is stored on off-chip memory.
Each Imagine chip has a network interface to
allow high speed communication among Imagine
chips.
The memory system of Imagine allows multiple
streaming memory accesses to occur simultaneously.
Imagine Architecture 57
16
Imagine Fabrication
  • 150 nm, static CMOS standard-cell technology.
  • Die size of 1.44 cm2
  • 2.8 and 6.2billion operations per second
  • Clock
  • 500 MHz
  • 2.4 Gflops per watt4
  • Pentium 4 achieves a peak performance of 12
    Gflops at 80 watts4

Imagine die layout 4
17
Smart Memories
  • At Stanford university
  • A multiprocessor system
  • Processing units are in form of Tiles
  • 64 tiles on a chip
  • A group of four tiles, forms a Quad
  • Reduces the number of global network interfaces
  • The memories, the wires, and the computational
    model can all be altered to match the
    applications.

18
Smart Memories Architecture
Smart memories chips 8
19
Smart Memories Tile Architecture
  • A reconfigurable memory system
  • 16 independent 8KB(102464b) mat
  • Each 64b word
  • Has an extra valid bit and a 4-bit configurable
    control field
  • Is dual ported to allow read-modify-write
    operations each cycle
  • Can be flash cleared via special opcodes
  • Contains logic in the output read path for
    comparisons

20
Smart Memories Tile Architecture
  • A processor core
  • A 64-bit processing engine
  • Two integer clusters
  • An ALU, register file, and load/store unit for
    each
  • One floating point (FP) cluster
  • A quad network interface
  • Connects the different memory mats to processor
  • Supports up to eight concurrent references

21
Smart Memories Tile Architecture
Smart memories tile 8
22
Smart Memories Latency and Bandwidth
  • Peak bandwidth with 1GHz clock9
  • To/from tile memories
  • 16GB/s per mat
  • 128GB/s per tile memory system
  • To/from tile
  • 64GB/s
  • Quad network bandwidth
  • 64GB/s

23
Trips
  • Tera-op, Reliable, Intelligently adaptive
    Processing System
  • At Austin university
  • An Edge (Explicit Data Graph Execution)
    Architecture
  • Conveys the compile-time dependence graph through
    the ISA
  • Direct instruction communication
  • The hardware delivers a producer instructions
    output directly as an input to a consumer
    instruction
  • Eliminates the majority of a conventional
    processors register writes
  • More energy-efficient delivery from producing to
    consuming instructions
  • The compiler groups instructions into blocks of
    instructions

24
Trips Architecture
  • Two processing cores
  • Each is a 16-wide out-of-order issue
  • A 4 4 ALUs with buffers
  • Four register file banks
  • Four instruction cache banks
  • Four data Instruction banks
  • Four ports into the L2 cache network
  • Up to eight blocks executing concurrently
  • 2 Mbytes of integrated L2 cache
  • Organized as 32 banks
  • Connected with a routing network.

Trips architecture 10
25
Trips Architecture
Trips processor core architecture10
26
Trips Fabrication
  • IBM CU-11 process (130nm)
  • 18x18 mm chip area
  • 533MHz clock rate
  • 5 TFLOPS in 35nm, 32 GFLOPS in a 130nm

Trips die layout 11
27
Conclusion
28
References
  • M. B. Taylor, et al. The Raw Microprocessor A
    Computational Fabric for Software Circuits and
    General-Purpose Programs. IEEE Micro (Mar 2002),
    pp. 25--35.
  • M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff,
    I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson,
    J. Kim, J. Psota, A. Saraf, N. Shnidman, V.
    Strumpen, M.I. Frank, S. Amarasinghe and A.
    Agarwal, Evaluation of the Raw Microprocessor An
    Exposed-Wire-Delay Architecture for ILP and
    Streams, Proceedings of the International
    Symposium on Computer Architecture (ISCA), June,
    2004.
  • http//www.cag.csail.mit.edu/raw/
  • U. J. Kapasi, S. Rixner, W. J. Dally, B.
    Khailany, J. H. Ahn, P. Mattson, and J. D. Owens,
    "Programmable stream processors," IEEE Computer,
    vol. 36, no. 8, pp. 54--62, August 2003.
  • Brucek Khailany, William J. Dally, Scott Rixner,
    Ujval J. Kapasi, Peter Mattson, Jin Namkoong,
    John D. Owens, Brian Towles, and Andrew Chang.
    "Imagine Media Processing with Streams." IEEE
    Micro, Mar/April 2001
  • http//cva.stanford.edu/imagine/

29
References
  • S. Sardashti, Designing a Stream Processor,
    seminar report, university of Tehran, June 2005.
  • K. Mai, , et al., "Smart Memories A Modular
    Reconfigurable Architecture," Proc. 27th Int'l
    Symp. Computer Architecture (ISCA 00), ACM Press,
    2000, pp. 161-171.
  • http//www-vlsi.stanford.edu/smart_memories/
  • Doug Burger, Stephen W. Keckler, Kathryn S.
    McKinley, Michael Dahlin, Lizy Kurian John,
    Calvin Lin, Charles R. Moore, James Burrill,
    Robert G. McDonald, William Yode Scaling to the
    End of Silicon with EDGE Architectures. IEEE
    Computer 37(7) 44-55 (2004)
  • http//www.cs.utexas.edu/users/cart/trips/
Write a Comment
User Comments (0)
About PowerShow.com