Toward an Advanced Intelligent Memory System - PowerPoint PPT Presentation

About This Presentation

Title:

Toward an Advanced Intelligent Memory System

Description:

Need to do: chip layout and fabrication development of the compiler. Funds needed for: ... Fabricate chips. Build a workstation with an intelligent memory system ... – PowerPoint PPT presentation

Number of Views:170

Avg rating:3.0/5.0

Slides: 35

Provided by: torr3

Learn more at: https://iacoma.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Toward an Advanced Intelligent Memory System

1
Toward an Advanced Intelligent Memory System
FlexRAM
Josep Torrellas
University of Illinois
http//iacoma.cs.uiuc.edu
torrellas_at_cs.uiuc.edu
2
People Involved
Students
Other faculty
Michael Huang
David Padua
Joe Renau
H. V. Jagadish
Seung Yoo
Daniel Reed
Jaejin Lee
3
Technological Landscape
Merged Logic and DRAM (MLD)

IBM, Mitsubishi, Samsung, Toshiba and others

Powerful e.g. IBM SA-27E ASIC (Feb 99)

0.18 um (chips for 1 Gbit DRAM)

Logic frequency 400 MHz

IBM PowerPC 603 proc 16 KB I, D caches 3

Further advances in the horizon

Opportunity How to exploit MLD best?
4
Terminology
Processor In Memory (PIM)
Intelligent Memory or Intelligent RAM (IRAM)

5
Key Applications Benefit from HW

Data Mining (decision trees and neural networks)

Computational Biology (protein sequence matching)

Financial Modeling (stock options, derivatives)

Molecular Dynamics (short-range forces)

Multimedia (MPEG-2)

Decision Support Systems (TPC-D)

Speech Recognition

All these are Data Intensive Applications
6
Example App DNA Matching

Problem Find areas of database DNA chains that
match (modulo some mutations) the sample DNA
chains

7
How the Algorithm Works

Pick 4 consecutive aminoacids from sample

Generate 50 most-likely mutations

8
Example App DNA Matching

Compare them to every positions in the database
DNAs

If match is found try to extend it

9
How to Use MLD
1. Main compute engine of the machine

Add proc to DRAM chip

Incremental gains

Include a vector processor
or multiple processors

Hard to program
UC Berkeley IRAM Notre Dame Execube,
Petaflops MIT Raw Stanford Smart Memories
10
How to Use MLD (II)
2. Co-processor, special-purpose processor

ATM switch controller

Process data beside the disk

Graphics accelerator

Stanford Imagine UC Berkeley ISTORE
11
How to Use MLD (III)
3. Our approach take the place of memory
chips in a workstation or server

PIM chip processes the memory-intensive parts
of the program

Illinois FlexRAM UC Davis Active Pages USC-ISI
DIVA
12
Our Solution Principles

Extract high bandwidth from DRAM
Many simple processing units
Run legacy codes with high performance
Do not replace off-the-shelf uP in workstation
Take place of memory chip. Same interface as DRAM
Intelligent memory defaults to plain DRAM
Small increase in cost over DRAM
Simple processing units, still dense
General purpose
Do not hardwire any algorithm. No Special purpose

13
Architecture Proposed
14
The FlexRAM Memory System
Can exploit multiple levels of parallelism
For a high-end workstation

1 P.Host processor (e.g. Merced, IBM GP)

100s of P.Mems in memory (e.g. IBM PowerPC 603)

100,000s of very simple P.Arrays in memory

15
Chip Organization
16
Memory in one FlexRAM Chip

64 Mbytes of DRAM organized as 16Mx32 bits

Organized in 64 1-Mbyte banks

Associated to 1 P.Array

Each bank

1 single port

2 2-Kbyte row buffers (no P.Array cache)

P.Array access to memory 10 ns (row hit) or 20
ns (miss)

On-chip memory bandwidth 102 Gbytes/second

17
Memory in one FlexRAM Chip
Group of 4 P.Arrays share one 8-Kbyte,
4-ported SRAM instruction memory

Holds the P.Array code

Small because short code

Aggressive access time 1 cycle 2.5 ns

18
P.Array

64 P.Arrays per chip. Not SIMD but SPMD

32-bit integer arithmetic 16 registers

No caches, no floating point

4 P.Arrays share one multiplier

28 different 16-bit instructions

Can access own 1 Mbyte of DRAM plus DRAM of
left and right neighbors. Connection forms a
ring

Broadcast and notify primitives Barrier

19
P.Mem

2-issue static superscalar like IBM PowerPC 603

16-Kbyte I, D caches

Executes serial sections

Communication with P.Arrays

Broadcast/notify or plain write/read to memory

Communication with other P.Mems

Memory in all chips is visible

Access via the inter-chip network

Must flush caches to ensure data coherence

20
Issues
Communication P.Mem-P.Host

P.Mem cannot be the master of bus

P.Host starts P.Mems by writing register in
Rambus interf.

P.Host polls a register in Rambus interf. of
master P.Mem

If P.Mem not finished memory controller retries.
Retries
are invisible to P.Host

Virtual memory

P.Mems and P.Arrays use virtual memory

They share a range of virtual addresses with
P.Host

21
Chip Architecture
22
Basic Block
23
Area Estimation (mm )
2
VERY CONSERVATIVE
PowerPC 603caches 12
64 Mbytes of DRAM 330
SRAM instruction memory 34
P.Arrays
96
Multipliers
10
Rambus interface
3.4
Pads network interf. refresh logic 20
Total 505
Of which 28 logic, 65 DRAM, 7 SRAM
24
Evaluation
25
Utilization