Title: Toward an Advanced Intelligent Memory System
1Toward an Advanced Intelligent Memory System
FlexRAM
Josep Torrellas
University of Illinois
http//iacoma.cs.uiuc.edu
torrellas_at_cs.uiuc.edu
2People Involved
Students
Other faculty
Michael Huang
David Padua
Joe Renau
H. V. Jagadish
Seung Yoo
Daniel Reed
Jaejin Lee
3Technological Landscape
Merged Logic and DRAM (MLD)
- IBM, Mitsubishi, Samsung, Toshiba and others
- Powerful e.g. IBM SA-27E ASIC (Feb 99)
- 0.18 um (chips for 1 Gbit DRAM)
- IBM PowerPC 603 proc 16 KB I, D caches 3
- Further advances in the horizon
Opportunity How to exploit MLD best?
4Terminology
Processor In Memory (PIM)
Intelligent Memory or Intelligent RAM (IRAM)
5Key Applications Benefit from HW
- Data Mining (decision trees and neural networks)
- Computational Biology (protein sequence matching)
- Financial Modeling (stock options, derivatives)
- Molecular Dynamics (short-range forces)
- Decision Support Systems (TPC-D)
All these are Data Intensive Applications
6Example App DNA Matching
- Problem Find areas of database DNA chains that
match (modulo some mutations) the sample DNA
chains
7How the Algorithm Works
- Pick 4 consecutive aminoacids from sample
- Generate 50 most-likely mutations
8Example App DNA Matching
- Compare them to every positions in the database
DNAs
- If match is found try to extend it
9How to Use MLD
1. Main compute engine of the machine
Incremental gains
- Include a vector processor
- or multiple processors
Hard to program
UC Berkeley IRAM Notre Dame Execube,
Petaflops MIT Raw Stanford Smart Memories
10How to Use MLD (II)
2. Co-processor, special-purpose processor
- Process data beside the disk
Stanford Imagine UC Berkeley ISTORE
11How to Use MLD (III)
3. Our approach take the place of memory
chips in a workstation or server
- PIM chip processes the memory-intensive parts
- of the program
Illinois FlexRAM UC Davis Active Pages USC-ISI
DIVA
12Our Solution Principles
- Extract high bandwidth from DRAM
- Many simple processing units
- Run legacy codes with high performance
- Do not replace off-the-shelf uP in workstation
- Take place of memory chip. Same interface as DRAM
- Intelligent memory defaults to plain DRAM
- Small increase in cost over DRAM
- Simple processing units, still dense
- General purpose
- Do not hardwire any algorithm. No Special purpose
13Architecture Proposed
14The FlexRAM Memory System
Can exploit multiple levels of parallelism
For a high-end workstation
- 1 P.Host processor (e.g. Merced, IBM GP)
- 100s of P.Mems in memory (e.g. IBM PowerPC 603)
- 100,000s of very simple P.Arrays in memory
15Chip Organization
16Memory in one FlexRAM Chip
- 64 Mbytes of DRAM organized as 16Mx32 bits
- Organized in 64 1-Mbyte banks
- 2 2-Kbyte row buffers (no P.Array cache)
- P.Array access to memory 10 ns (row hit) or 20
ns (miss)
- On-chip memory bandwidth 102 Gbytes/second
17Memory in one FlexRAM Chip
Group of 4 P.Arrays share one 8-Kbyte,
4-ported SRAM instruction memory
- Aggressive access time 1 cycle 2.5 ns
18P.Array
- 64 P.Arrays per chip. Not SIMD but SPMD
- 32-bit integer arithmetic 16 registers
- No caches, no floating point
- 4 P.Arrays share one multiplier
- 28 different 16-bit instructions
- Can access own 1 Mbyte of DRAM plus DRAM of
- left and right neighbors. Connection forms a
ring
- Broadcast and notify primitives Barrier
19P.Mem
- 2-issue static superscalar like IBM PowerPC 603
- Communication with P.Arrays
- Broadcast/notify or plain write/read to memory
- Communication with other P.Mems
- Memory in all chips is visible
- Access via the inter-chip network
- Must flush caches to ensure data coherence
20Issues
Communication P.Mem-P.Host
- P.Mem cannot be the master of bus
- P.Host starts P.Mems by writing register in
Rambus interf.
- P.Host polls a register in Rambus interf. of
master P.Mem
- If P.Mem not finished memory controller retries.
Retries - are invisible to P.Host
Virtual memory
- P.Mems and P.Arrays use virtual memory
- They share a range of virtual addresses with
P.Host
21Chip Architecture
22Basic Block
23Area Estimation (mm )
2
VERY CONSERVATIVE
PowerPC 603caches 12
64 Mbytes of DRAM 330
SRAM instruction memory 34
P.Arrays
96
Multipliers
10
Rambus interface
3.4
Pads network interf. refresh logic 20
Total 505
Of which 28 logic, 65 DRAM, 7 SRAM
24Evaluation
25Utilization
26Utilization
27Speedups
28Speedups
29Programming FlexRAM
- FlexRAM programmed in C extensions C-Flex
- Library of Intelligent Memory Operations (IMOs)
C subroutines that can be called from main pgm
Executed by P.Arrays or P.Mem
Operate on large data sets with poor locality
- Library also contains plain subroutines
- Link program with IMOs or plain subroutines
30C-Flex Programming Extensions
- On processor_range where the following code is
executed
- Waitfor processor_range processors waiting for
others
- Map object to processor_range mapping of pages
- Flush(object), FlushInval(object) flush from
cache
- Broadcast(address), Poll(), Receive(address),
Notify()
- FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()
31Performance Evaluation
- Hardware performance monitoring embedded in the
chip - Software tools to extract and interpret
performance info
32Current Status
- Identified and wrote all applications
- Designed architecture based on apps feasible
technology - Conceived ideas behind language/compiler
- Need to do chip layout and fabrication
development of the compiler - Funds needed for
- processor core (P.Mem)
- chip fabrication
- hardware and software engineers
33Overall Goal
- Build a workstation with an intelligent memory
system
- Build a compiler for the intelligent memory system
- Demonstrate significant speedups on real
applications
34Conclusion
- We have a handle on
- A promising technology (MLD)
- Key applications of industrial interest
- Real chance to transform the computing landscape