Title: Implementing Advanced Intelligent Memory
1Implementing Advanced Intelligent Memory
Josep Torrellas, U of Illinois IBM Watson
Ctr.David Padua and Dan Reed, U of Illinois
torrella_at_watson.ibm.com, padua_at_cs.uiuc.edu,
reed_at_cs.uiuc.edu
September 1998
2Technological Opportunity
We can fabricate a large silicon area of Merged
Logic and Dram (MLD)
Question How to exploit this capability best
to advance computing?
3Pieces of the Puzzle
256 Mbit MLD process with 0.25um
Includes logic running at 200 MHz
E.g. 2 IBM PowerPC 603 with 8KB ID caches
take 10 of the chip
IBM Cmos-7LD technology available Fall 98
Japanese manufacturers (NEC,Fujitzu) are in the
lead
- In a couple of years 512 Mbit MLD process at
0.18um
4Key Applications Clamor for HW
- Data Mining (decision trees and neural networks)
- Computational Biology (DNA sequence matching)
- Financial Modeling (stock options, derivatives)
- Molecular Dynamics (short-range forces)
- Plus the typical ones MPEG, TPCD, speech
recognition
All are Data Intensive Applications
5Our Solution Principles
1. Extract high bandwidth from DRAM
gt Many simple processing units
2. Run legacy codes w/ high performance
gt Do not replace off-the-shelf uP in workstation
gt Take place of memory chip. Same interface as
DRAM
gt Intelligent memory defaults to plain DRAM
3. Small increase in cost over DRAM
gt Simple processing units, still dense
4. General purpose
gt Do not hardwire any algorithm. No special
purpose
6Architecture Proposed
P.Host
L1,L2 Cache
P.Mem
Cache
Plain
DRAM
P.Array
DRAM
FlexRAM
Network
7Proposed Work
- Design an architecture based on key IBM
applications
- Fabricate chips using IBM Cmos 7LD technology
- Build a workstation w/ an intelligent memory
system
- Build a language and compiler for the
intelligent memory
- Demonstrate significant speedups on the
applications
8Example App DNA Matching
BLAST code from NIH web site
sample DNA
database of DNA chains
Problem Find areas of database DNA chains that
match (modulo some mutations)
the sample DNA chain
9How the Algorithm Works
1. Pick 4 consecutive aminoacids from the sample
bbcf
2. Generate 50 most-likely mutations
becf
10Example App DNA Matching
3. Compare them to every position in the database
DNAs
becf
4. If match is found try to extend it
sample DNA
becf
?
?
database of DNA chains
becf
11P.Arrays
2
- Total of 64 per chip (90 mm )
- SPMD engines, not SIMD. Cycling at 200 MHz
- 32-bit datapath, integer only, including MPY. 28
instruc.
- Organized as a ring, no need for a mesh
1 Mbyte of DRAM memory. Can also access the
memory of N and S neighbors
2 1-Kbyte row buffers to capture data locality
8 Kbyte of SRAM I-memory shared by 4 P.Arrays
12P.Array Design
ALU
Switches
Input Reg.
R.Reg.
Sense AMP/Col. Dec
Controller
Port 0
DRAM Block
Port 1
Addr. Gen.
Port 2
Switches
Instr. Mem
ROW Decoder
Broadcast Bus
13P.Mem
- IBM 603 Power PC with 8 KB D 8 KB I cache
2
- Also included memory interface
14DRAM Memory
- 512 Mbit (64 Mbyte) with 0.18um
- Organized as 64 banks of 1 MB each (one per
P.Array)
- Internal memory bandwidth 102 Gbytes/s at 200
MHz
- Memory access time at 200 MHz
- 2 cycles for row buffer hit
- 4 cycles for miss
15Chip Architecture
16Basic Block
512 row x 4k columns
2Mb Block
256kB Block
256kB Block
Memory Control Block
Memory Control Block
PArray
PArray
PArray
PArray
Mutiplier
Memory Control Block
Memory Control Block
8kB Instruction Memory (4-port SRAM)
8Mb Block
17Language Compiler
- High-level C-like explicitly parallel language
that - exposes the architecture
- Compiler that automatically translates it into
- structured assembly
- Libraries of Intelligent Memory Operations
(IMOs) - written in assembly
18Intelligent Memory Ops
- General-purpose operations such as
- Arithmetic/logic/symbolic array operations
- Set operations. Iterators over elements of a set
- Regular/irregular structure search and update
- (CAM operations)
- Domain-specific operations e.g. FFT
19Performance Evaluation
- Hardware performance monitoring
- embedded in the chip
- Software tools to extract and interpret
- performance info
20Preliminary Results
1
2
0
Uniprocessor
1
0
0
1 FlexRAM
8
0
Relative Execution Time
4 FlexRAM
6
0
4
0
2
0
0
MPEG2
Chroma/Keying
21Current Status
- Identified and wrote all applications
- Designed architecture based on apps IBM
technology
- Conceived ideas behind language/compiler
- Need to do chip layout and fabrication
- development of the compiler
- Funds needed for processor core (P.Mem)
- chip fabrication
- hardware and
software engineers
22Conclusion
- A promising technology (MLD)
- Key applications of industrial interest
- Real chance to transform the computing landscape
23Current Research Work
Josep Torrellas, U of Illinois IBM Watson Ctr.
torrella_at_cs.uiuc.edu http//iacoma.cs.uiuc.edu
September 1998
24Current Research Projects
- 1. Illinois Aggressive COMA (I-ACOMA) Scalable
- NUMA and COMA architectures
- 2. FlexRAM Avanced Intelligent Memory
- 3. Speculative Parallelization Hardware
- 4. Database Workload characterization TPC-C,
- TPC-D, Data mining
gt All projects are in collaboration with IBM
Watson
gt Project 4 is also in collaboration with Intel
Oregon
25Publications 1997 and 98
1.Architectural Advances in DSMs A Possible
Road Ahead by Josep Torrellas, Ninth SIAM
Conference on Parallel Processing for Scientific
Computing Spring 1999. 2.A Direct-Execution
Framework for Fast and Accurate Simulation of
Superscalar Processors by Venkata Krishnan
and Josep Torrellas, International Conference on
Parallel Architectures and Compilation
Techniques (PACT), October 1998. 3.Hardware
and Software Support for Speculative Execution of
Sequential Binaries on a Chip-Multiprocessor
by Venkata Krishnan and Josep Torrellas,
International Conference on Supercomputing (ICS),
July 1998. 4.Comparing Data Forwarding and
Prefetching for Communication-Induced Misses in
Shared-Memory MPs by David Koufaty and
Josep Torrellas, International Conference on
Supercomputing (ICS), July 1998.
5.Cache-Only Memory Architectures by
Fredrik Dahlgren and Josep Torrellas, IEEE
Computer Magazine, to appear 1998.
6.Executing Sequential Binaries on a
Multithreaded Architecture with Speculation
Support by Venkata Krishnan and Josep
Torrellas, Workshop on Multi-Threaded Execution,
Architecture and Compilation (MTEAC'98), January
1998. 7.A Clustered Approach to
Multithreaded Processors by Venkata
Krishnan and Josep Torrellas, International
Parallel Processing Symposium, March 1998.
8.Hardware for Speculative Run-Time
Parallelization in Distributed Shared-Memory
Multiprocessors by Ye Zhang, Lawrence
Rauchwerger, and Josep Torrellas, Fourth
International Symposium on High-Performance
Computer Architecture, February 1998.
9.Enhancing Memory Use in Simple Coma
Multiplexed Simple Coma by Sujoy Basu and
Josep Torrellas, Fourth International Symposium
on High-Performance Computer Architecture,
February 1998. 10.How Processor-Memory
Integration Affects the Design of DSMs by
Liuxi Yang, Anthony-Trung Nguyen, and Josep
Torrellas, Workshop on Mixing Logic and DRAM
Chips that Compute and Remember, June 1997.
11.Efficient Use of Processing Transistors for
Larger On-Chip Storage Multithreading by
Venkata Krishnan and Josep Torrellas, Workshop
on Mixing Logic and DRAM Chips that Compute and
Remember, June 1997. 12.The Memory
Performance of DSS Commercial Workloads in
Shared-Memory Multiprocessors by Pedro
Trancoso, Josep-L. Larriba-Pey, Zheng Zhang, and
Josep Torrellas, Third International Symposium on
High-Performance Computer Architecture, January
1997. 13.Reducing Remote Conflict Misses
NUMA with Remote Cache versus COMA by
Zheng Zhang and Josep Torrellas, Third
International Symposium on High-Performance
Computer Architecture, January 1997.
14.Speeding up the Memory Hierarchy in Flat COMA
Multiprocessors by Liuxi Yang and Josep
Torrellas, Third International Symposium on
High-Performance Computer Architecture, January
1997.