Title: New Trends in Designing Processors
1New Trends in Designing Processors
- ASIC Seminar
- Instructor Dr S.M. Fakhraie
- Presented by Amir Naghdinezhad
- Spring 2006
This is a class presentation. All data are copy
righted to respective authors as listed in the
references and have been used here for
educational purpose only
2Outline
- Introduction
- Raw
- Imagine
- Smart memories
- Trips
- Conclusion
- References
3Introduction
- In the 1970s
- Memory was expensive
- So CISC architectures with
- Dense instruction encoding
- Variable-length instructions
- Small numbers of registers
- In 1980s
- An entire RISC processor could fit on a single
chip - RISC processors attained high performance despite
reduction in complexity
4Introduction
- Past 20 years
- Aggressive pipelining and compiler scheduling
- 40 per year performance scaling10
- With only small ISA changes
- Future architectures
- Pipeline depth limits
- Acceleration of clock speeds and power limits
- Increasing delays through global on-chip wires
- So new designs are required
5Raw
- At MIT university
- A general-purpose architecture with interrupts,
caches and context switches - Attacks the wire-delay problem
- Enables the programmer or compiler to directly
program the wiring resources - Composed of 16 identical programmable tiles
- All signals registered at tile boundaries
- One clock cycle delay
- No global signals
6Raw Architecture
The Raw microprocessor 1
7Raw Tile Architecture
- A tile contains
- An 8-stage in-order single-issue MIPS-style
processing pipeline - A 4-stage single-precision pipelined FPU
- A 32KB data cache
- 96KB of instruction caches
- Two types of communication routers static and
dynamic
8Raw Tile Interconnections
- Four 32bit full duplex on-chip networks
- Two static
- To route operands among local and remote ALUs
- To route data streams among tiles, DRAM, I/O
ports - Two dynamic
- Cache misses, interrupts and dynamic messages
- Each tile is connected only to its four neighbors
Raw tile architecture 2
9Raw Compute Processor
Raw compute processor pipeline 1
10Raw Performance Survey
Performance 3
11Raw Fabrication
- 180 nm, 6-metal copper ASIC process
- 3.6 GFLOPS peak
- 18.23mm x 18.23mm
- Clock
- 420MHz (actual)
- Power
- 10 watts (power save mode)
- 18 watts typical
- 35 watts max
Raw die layout 3
12Imagine A Stream Processor
- At Stanford University
- A programmable stream processor for media
applications - Imagine is controlled by a host processor
- A peak performance of 20 GFLOPS5
- With
- 128-Kbyte stream register file
- 48 floating-point arithmetic units in eight
arithmetic clusters - A streaming memory system with four SDRAM
channels - A microcontroller, a network interface and a
stream controller
13Imagine Stream Processors
- A bridge between inflexible special purpose and
programmable architectures - Are DSPs, targeted at high-performance embedded
applications. - Contain clusters of functional units, supporting
hundreds of arithmetic units. - Exploit
- Instruction Level Parallelism (ILP)
- Data Parallelism (DP)
- Task parallelism (TP) (kernel execution and
stream data transfers)
14Imagine Stream Processors
- The idea is organizing an application into
streams and kernels - A stream contains a set of elements of the same
type. - Simple or complex.
- A kernel is the computational unit that works on
streams. - Can have one or more input and output streams
- Complex calculations ranging from a few to
thousands of operations per input element
15 Imagine Architecture
There are eight VLIW computation clusters
arranged in a SIMD array.
The Imagine chip is controlled by a host
processor.
Streams of data are stored in Stream Register
File (SRF), which can transfer data to and from
LRFs.
Operands for arithmetic operations are kept
locally in Local Register Files (LRFs) near the
ALUs.
Global data is stored on off-chip memory.
Each Imagine chip has a network interface to
allow high speed communication among Imagine
chips.
The memory system of Imagine allows multiple
streaming memory accesses to occur simultaneously.
Imagine Architecture 57
16Imagine Fabrication
- 150 nm, static CMOS standard-cell technology.
- Die size of 1.44 cm2
- 2.8 and 6.2billion operations per second
- Clock
- 500 MHz
- 2.4 Gflops per watt4
- Pentium 4 achieves a peak performance of 12
Gflops at 80 watts4
Imagine die layout 4
17Smart Memories
- At Stanford university
- A multiprocessor system
- Processing units are in form of Tiles
- 64 tiles on a chip
- A group of four tiles, forms a Quad
- Reduces the number of global network interfaces
- The memories, the wires, and the computational
model can all be altered to match the
applications.
18Smart Memories Architecture
Smart memories chips 8
19Smart Memories Tile Architecture
- A reconfigurable memory system
- 16 independent 8KB(102464b) mat
- Each 64b word
- Has an extra valid bit and a 4-bit configurable
control field - Is dual ported to allow read-modify-write
operations each cycle - Can be flash cleared via special opcodes
- Contains logic in the output read path for
comparisons
20Smart Memories Tile Architecture
- A processor core
- A 64-bit processing engine
- Two integer clusters
- An ALU, register file, and load/store unit for
each - One floating point (FP) cluster
- A quad network interface
- Connects the different memory mats to processor
- Supports up to eight concurrent references
21Smart Memories Tile Architecture
Smart memories tile 8
22Smart Memories Latency and Bandwidth
- Peak bandwidth with 1GHz clock9
- To/from tile memories
- 16GB/s per mat
- 128GB/s per tile memory system
- To/from tile
- 64GB/s
- Quad network bandwidth
- 64GB/s
23Trips
- Tera-op, Reliable, Intelligently adaptive
Processing System - At Austin university
- An Edge (Explicit Data Graph Execution)
Architecture - Conveys the compile-time dependence graph through
the ISA - Direct instruction communication
- The hardware delivers a producer instructions
output directly as an input to a consumer
instruction - Eliminates the majority of a conventional
processors register writes - More energy-efficient delivery from producing to
consuming instructions - The compiler groups instructions into blocks of
instructions
24Trips Architecture
- Two processing cores
- Each is a 16-wide out-of-order issue
- A 4 4 ALUs with buffers
- Four register file banks
- Four instruction cache banks
- Four data Instruction banks
- Four ports into the L2 cache network
- Up to eight blocks executing concurrently
- 2 Mbytes of integrated L2 cache
- Organized as 32 banks
- Connected with a routing network.
Trips architecture 10
25Trips Architecture
Trips processor core architecture10
26Trips Fabrication
- IBM CU-11 process (130nm)
- 18x18 mm chip area
- 533MHz clock rate
- 5 TFLOPS in 35nm, 32 GFLOPS in a 130nm
Trips die layout 11
27Conclusion
28References
- M. B. Taylor, et al. The Raw Microprocessor A
Computational Fabric for Software Circuits and
General-Purpose Programs. IEEE Micro (Mar 2002),
pp. 25--35. - M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff,
I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson,
J. Kim, J. Psota, A. Saraf, N. Shnidman, V.
Strumpen, M.I. Frank, S. Amarasinghe and A.
Agarwal, Evaluation of the Raw Microprocessor An
Exposed-Wire-Delay Architecture for ILP and
Streams, Proceedings of the International
Symposium on Computer Architecture (ISCA), June,
2004. - http//www.cag.csail.mit.edu/raw/
- U. J. Kapasi, S. Rixner, W. J. Dally, B.
Khailany, J. H. Ahn, P. Mattson, and J. D. Owens,
"Programmable stream processors," IEEE Computer,
vol. 36, no. 8, pp. 54--62, August 2003. - Brucek Khailany, William J. Dally, Scott Rixner,
Ujval J. Kapasi, Peter Mattson, Jin Namkoong,
John D. Owens, Brian Towles, and Andrew Chang.
"Imagine Media Processing with Streams." IEEE
Micro, Mar/April 2001 - http//cva.stanford.edu/imagine/
29References
- S. Sardashti, Designing a Stream Processor,
seminar report, university of Tehran, June 2005. - K. Mai, , et al., "Smart Memories A Modular
Reconfigurable Architecture," Proc. 27th Int'l
Symp. Computer Architecture (ISCA 00), ACM Press,
2000, pp. 161-171. - http//www-vlsi.stanford.edu/smart_memories/
- Doug Burger, Stephen W. Keckler, Kathryn S.
McKinley, Michael Dahlin, Lizy Kurian John,
Calvin Lin, Charles R. Moore, James Burrill,
Robert G. McDonald, William Yode Scaling to the
End of Silicon with EDGE Architectures. IEEE
Computer 37(7) 44-55 (2004) - http//www.cs.utexas.edu/users/cart/trips/