Toward an Advanced Intelligent Memory System - PowerPoint PPT Presentation

About This Presentation
Title:

Toward an Advanced Intelligent Memory System

Description:

Need to do: chip layout and fabrication development of the compiler. Funds needed for: ... Fabricate chips. Build a workstation with an intelligent memory system ... – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 35
Provided by: torr3
Category:

less

Transcript and Presenter's Notes

Title: Toward an Advanced Intelligent Memory System


1
Toward an Advanced Intelligent Memory System
FlexRAM
Josep Torrellas
University of Illinois
http//iacoma.cs.uiuc.edu
torrellas_at_cs.uiuc.edu
2
People Involved
Students
Other faculty
Michael Huang
David Padua
Joe Renau
H. V. Jagadish
Seung Yoo
Daniel Reed
Jaejin Lee
3
Technological Landscape
Merged Logic and DRAM (MLD)
  • IBM, Mitsubishi, Samsung, Toshiba and others
  • Powerful e.g. IBM SA-27E ASIC (Feb 99)
  • 0.18 um (chips for 1 Gbit DRAM)
  • Logic frequency 400 MHz
  • IBM PowerPC 603 proc 16 KB I, D caches 3
  • Further advances in the horizon

Opportunity How to exploit MLD best?
4
Terminology
Processor In Memory (PIM)
Intelligent Memory or Intelligent RAM (IRAM)

5
Key Applications Benefit from HW
  • Data Mining (decision trees and neural networks)
  • Computational Biology (protein sequence matching)
  • Financial Modeling (stock options, derivatives)
  • Molecular Dynamics (short-range forces)
  • Multimedia (MPEG-2)
  • Decision Support Systems (TPC-D)
  • Speech Recognition

All these are Data Intensive Applications
6
Example App DNA Matching
  • Problem Find areas of database DNA chains that
    match (modulo some mutations) the sample DNA
    chains

7
How the Algorithm Works
  • Pick 4 consecutive aminoacids from sample
  • Generate 50 most-likely mutations

8
Example App DNA Matching
  • Compare them to every positions in the database
    DNAs
  • If match is found try to extend it

9
How to Use MLD
1. Main compute engine of the machine
  • Add proc to DRAM chip

Incremental gains
  • Include a vector processor
  • or multiple processors

Hard to program
UC Berkeley IRAM Notre Dame Execube,
Petaflops MIT Raw Stanford Smart Memories
10
How to Use MLD (II)
2. Co-processor, special-purpose processor
  • ATM switch controller
  • Process data beside the disk
  • Graphics accelerator

Stanford Imagine UC Berkeley ISTORE
11
How to Use MLD (III)
3. Our approach take the place of memory
chips in a workstation or server
  • PIM chip processes the memory-intensive parts
  • of the program

Illinois FlexRAM UC Davis Active Pages USC-ISI
DIVA
12
Our Solution Principles
  • Extract high bandwidth from DRAM
  • Many simple processing units
  • Run legacy codes with high performance
  • Do not replace off-the-shelf uP in workstation
  • Take place of memory chip. Same interface as DRAM
  • Intelligent memory defaults to plain DRAM
  • Small increase in cost over DRAM
  • Simple processing units, still dense
  • General purpose
  • Do not hardwire any algorithm. No Special purpose

13
Architecture Proposed
14
The FlexRAM Memory System
Can exploit multiple levels of parallelism
For a high-end workstation
  • 1 P.Host processor (e.g. Merced, IBM GP)
  • 100s of P.Mems in memory (e.g. IBM PowerPC 603)
  • 100,000s of very simple P.Arrays in memory

15
Chip Organization
16
Memory in one FlexRAM Chip
  • 64 Mbytes of DRAM organized as 16Mx32 bits
  • Organized in 64 1-Mbyte banks
  • Associated to 1 P.Array
  • Each bank
  • 1 single port
  • 2 2-Kbyte row buffers (no P.Array cache)
  • P.Array access to memory 10 ns (row hit) or 20
    ns (miss)
  • On-chip memory bandwidth 102 Gbytes/second

17
Memory in one FlexRAM Chip
Group of 4 P.Arrays share one 8-Kbyte,
4-ported SRAM instruction memory
  • Holds the P.Array code
  • Small because short code
  • Aggressive access time 1 cycle 2.5 ns

18
P.Array
  • 64 P.Arrays per chip. Not SIMD but SPMD
  • 32-bit integer arithmetic 16 registers
  • No caches, no floating point
  • 4 P.Arrays share one multiplier
  • 28 different 16-bit instructions
  • Can access own 1 Mbyte of DRAM plus DRAM of
  • left and right neighbors. Connection forms a
    ring
  • Broadcast and notify primitives Barrier

19
P.Mem
  • 2-issue static superscalar like IBM PowerPC 603
  • 16-Kbyte I, D caches
  • Executes serial sections
  • Communication with P.Arrays
  • Broadcast/notify or plain write/read to memory
  • Communication with other P.Mems
  • Memory in all chips is visible
  • Access via the inter-chip network
  • Must flush caches to ensure data coherence

20
Issues
Communication P.Mem-P.Host
  • P.Mem cannot be the master of bus
  • P.Host starts P.Mems by writing register in
    Rambus interf.
  • P.Host polls a register in Rambus interf. of
    master P.Mem
  • If P.Mem not finished memory controller retries.
    Retries
  • are invisible to P.Host

Virtual memory
  • P.Mems and P.Arrays use virtual memory
  • They share a range of virtual addresses with
    P.Host

21
Chip Architecture
22
Basic Block
23
Area Estimation (mm )
2
VERY CONSERVATIVE
PowerPC 603caches 12
64 Mbytes of DRAM 330
SRAM instruction memory 34
P.Arrays
96
Multipliers
10
Rambus interface
3.4
Pads network interf. refresh logic 20
Total 505
Of which 28 logic, 65 DRAM, 7 SRAM
24
Evaluation
25
Utilization
  • High P.Array Util
  • Low P.Mem Util

26
Utilization
  • Low P.Host Utilization

27
Speedups
  • Constant Problem Sz
  • Scaled Problem Sz

28
Speedups
  • Varying Logic Frequency

29
Programming FlexRAM
  • FlexRAM programmed in C extensions C-Flex
  • Library of Intelligent Memory Operations (IMOs)

C subroutines that can be called from main pgm
Executed by P.Arrays or P.Mem
Operate on large data sets with poor locality
  • Library also contains plain subroutines
  • Link program with IMOs or plain subroutines

30
C-Flex Programming Extensions
  • On processor_range where the following code is
    executed
  • Waitfor processor_range processors waiting for
    others
  • Map object to processor_range mapping of pages
  • Release object
  • Flush(object), FlushInval(object) flush from
    cache
  • Broadcast(address), Poll(), Receive(address),
    Notify()
  • FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()

31
Performance Evaluation
  • Hardware performance monitoring embedded in the
    chip
  • Software tools to extract and interpret
    performance info

32
Current Status
  • Identified and wrote all applications
  • Designed architecture based on apps feasible
    technology
  • Conceived ideas behind language/compiler
  • Need to do chip layout and fabrication
    development of the compiler
  • Funds needed for
  • processor core (P.Mem)
  • chip fabrication
  • hardware and software engineers

33
Overall Goal
  • Fabricate chips
  • Build a workstation with an intelligent memory
    system
  • Build a compiler for the intelligent memory system
  • Demonstrate significant speedups on real
    applications

34
Conclusion
  • We have a handle on
  • A promising technology (MLD)
  • Key applications of industrial interest
  • Real chance to transform the computing landscape
Write a Comment
User Comments (0)
About PowerShow.com