Implementing Advanced Intelligent Memory - PowerPoint PPT Presentation

About This Presentation
Title:

Implementing Advanced Intelligent Memory

Description:

We can fabricate a large silicon area of. Merged Logic ... Fabricate chips using IBM Cmos ... do: chip layout and fabrication. development of the compiler ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 26
Provided by: anthony136
Category:

less

Transcript and Presenter's Notes

Title: Implementing Advanced Intelligent Memory


1
Implementing Advanced Intelligent Memory
Josep Torrellas, U of Illinois IBM Watson
Ctr.David Padua and Dan Reed, U of Illinois
torrella_at_watson.ibm.com, padua_at_cs.uiuc.edu,
reed_at_cs.uiuc.edu
September 1998
2
Technological Opportunity
We can fabricate a large silicon area of Merged
Logic and Dram (MLD)
Question How to exploit this capability best
to advance computing?
3
Pieces of the Puzzle
  • Today

256 Mbit MLD process with 0.25um
Includes logic running at 200 MHz
E.g. 2 IBM PowerPC 603 with 8KB ID caches
take 10 of the chip
  • Manufacturers

IBM Cmos-7LD technology available Fall 98
Japanese manufacturers (NEC,Fujitzu) are in the
lead
  • In a couple of years 512 Mbit MLD process at
    0.18um

4
Key Applications Clamor for HW
  • Data Mining (decision trees and neural networks)
  • Computational Biology (DNA sequence matching)
  • Financial Modeling (stock options, derivatives)
  • Molecular Dynamics (short-range forces)
  • Plus the typical ones MPEG, TPCD, speech
    recognition

All are Data Intensive Applications
5
Our Solution Principles
1. Extract high bandwidth from DRAM
gt Many simple processing units
2. Run legacy codes w/ high performance
gt Do not replace off-the-shelf uP in workstation
gt Take place of memory chip. Same interface as
DRAM
gt Intelligent memory defaults to plain DRAM
3. Small increase in cost over DRAM
gt Simple processing units, still dense
4. General purpose
gt Do not hardwire any algorithm. No special
purpose
6
Architecture Proposed
P.Host
L1,L2 Cache
P.Mem
Cache
Plain
DRAM
P.Array
DRAM
FlexRAM
Network
7
Proposed Work
  • Design an architecture based on key IBM
    applications
  • Fabricate chips using IBM Cmos 7LD technology
  • Build a workstation w/ an intelligent memory
    system
  • Build a language and compiler for the
    intelligent memory
  • Demonstrate significant speedups on the
    applications

8
Example App DNA Matching
BLAST code from NIH web site
sample DNA
database of DNA chains
Problem Find areas of database DNA chains that
match (modulo some mutations)
the sample DNA chain
9
How the Algorithm Works
1. Pick 4 consecutive aminoacids from the sample
bbcf
2. Generate 50 most-likely mutations
becf
10
Example App DNA Matching
3. Compare them to every position in the database
DNAs
becf
4. If match is found try to extend it
sample DNA
becf
?
?
database of DNA chains
becf
11
P.Arrays
2
  • Total of 64 per chip (90 mm )
  • SPMD engines, not SIMD. Cycling at 200 MHz
  • 32-bit datapath, integer only, including MPY. 28
    instruc.
  • Organized as a ring, no need for a mesh
  • Each P.Array

1 Mbyte of DRAM memory. Can also access the
memory of N and S neighbors
2 1-Kbyte row buffers to capture data locality
8 Kbyte of SRAM I-memory shared by 4 P.Arrays
12
P.Array Design
ALU
Switches
Input Reg.
R.Reg.
Sense AMP/Col. Dec
Controller
Port 0
DRAM Block
Port 1
Addr. Gen.
Port 2
Switches
Instr. Mem
ROW Decoder
Broadcast Bus
13
P.Mem
  • IBM 603 Power PC with 8 KB D 8 KB I cache

2
  • About 15 mm
  • 200 MHz
  • Also included memory interface

14
DRAM Memory
  • 512 Mbit (64 Mbyte) with 0.18um
  • Organized as 64 banks of 1 MB each (one per
    P.Array)
  • 2.2V operating voltage
  • Internal memory bandwidth 102 Gbytes/s at 200
    MHz
  • Memory access time at 200 MHz
  • 2 cycles for row buffer hit
  • 4 cycles for miss

15
Chip Architecture
16
Basic Block
512 row x 4k columns
2Mb Block
256kB Block
256kB Block
Memory Control Block
Memory Control Block
PArray
PArray
PArray
PArray
Mutiplier
Memory Control Block
Memory Control Block
8kB Instruction Memory (4-port SRAM)
8Mb Block
17
Language Compiler
  • High-level C-like explicitly parallel language
    that
  • exposes the architecture
  • Compiler that automatically translates it into
  • structured assembly
  • Libraries of Intelligent Memory Operations
    (IMOs)
  • written in assembly

18
Intelligent Memory Ops
  • General-purpose operations such as
  • Arithmetic/logic/symbolic array operations
  • Set operations. Iterators over elements of a set
  • Regular/irregular structure search and update
  • (CAM operations)
  • Domain-specific operations e.g. FFT

19
Performance Evaluation
  • Hardware performance monitoring
  • embedded in the chip
  • Software tools to extract and interpret
  • performance info

20
Preliminary Results
1
2
0
Uniprocessor
1
0
0
1 FlexRAM
8
0
Relative Execution Time
4 FlexRAM
6
0
4
0
2
0
0
MPEG2
Chroma/Keying
21
Current Status
  • Identified and wrote all applications
  • Designed architecture based on apps IBM
    technology
  • Conceived ideas behind language/compiler
  • Need to do chip layout and fabrication
  • development of the compiler
  • Funds needed for processor core (P.Mem)
  • chip fabrication
  • hardware and
    software engineers

22
Conclusion
  • We have a handle on
  • A promising technology (MLD)
  • Key applications of industrial interest
  • Real chance to transform the computing landscape

23
Current Research Work
Josep Torrellas, U of Illinois IBM Watson Ctr.
torrella_at_cs.uiuc.edu http//iacoma.cs.uiuc.edu
September 1998
24
Current Research Projects
  • 1. Illinois Aggressive COMA (I-ACOMA) Scalable
  • NUMA and COMA architectures
  • 2. FlexRAM Avanced Intelligent Memory
  • 3. Speculative Parallelization Hardware
  • 4. Database Workload characterization TPC-C,
  • TPC-D, Data mining

gt All projects are in collaboration with IBM
Watson
gt Project 4 is also in collaboration with Intel
Oregon
25
Publications 1997 and 98
1.Architectural Advances in DSMs A Possible
Road Ahead by Josep Torrellas, Ninth SIAM
Conference on Parallel Processing for Scientific
Computing Spring 1999. 2.A Direct-Execution
Framework for Fast and Accurate Simulation of
Superscalar Processors by Venkata Krishnan
and Josep Torrellas, International Conference on
Parallel Architectures and Compilation
Techniques (PACT), October 1998. 3.Hardware
and Software Support for Speculative Execution of
Sequential Binaries on a Chip-Multiprocessor
by Venkata Krishnan and Josep Torrellas,
International Conference on Supercomputing (ICS),
July 1998. 4.Comparing Data Forwarding and
Prefetching for Communication-Induced Misses in
Shared-Memory MPs by David Koufaty and
Josep Torrellas, International Conference on
Supercomputing (ICS), July 1998.
5.Cache-Only Memory Architectures by
Fredrik Dahlgren and Josep Torrellas, IEEE
Computer Magazine, to appear 1998.
6.Executing Sequential Binaries on a
Multithreaded Architecture with Speculation
Support by Venkata Krishnan and Josep
Torrellas, Workshop on Multi-Threaded Execution,
Architecture and Compilation (MTEAC'98), January
1998. 7.A Clustered Approach to
Multithreaded Processors by Venkata
Krishnan and Josep Torrellas, International
Parallel Processing Symposium, March 1998.
8.Hardware for Speculative Run-Time
Parallelization in Distributed Shared-Memory
Multiprocessors by Ye Zhang, Lawrence
Rauchwerger, and Josep Torrellas, Fourth
International Symposium on High-Performance
Computer Architecture, February 1998.
9.Enhancing Memory Use in Simple Coma
Multiplexed Simple Coma by Sujoy Basu and
Josep Torrellas, Fourth International Symposium
on High-Performance Computer Architecture,
February 1998. 10.How Processor-Memory
Integration Affects the Design of DSMs by
Liuxi Yang, Anthony-Trung Nguyen, and Josep
Torrellas, Workshop on Mixing Logic and DRAM
Chips that Compute and Remember, June 1997.
11.Efficient Use of Processing Transistors for
Larger On-Chip Storage Multithreading by
Venkata Krishnan and Josep Torrellas, Workshop
on Mixing Logic and DRAM Chips that Compute and
Remember, June 1997. 12.The Memory
Performance of DSS Commercial Workloads in
Shared-Memory Multiprocessors by Pedro
Trancoso, Josep-L. Larriba-Pey, Zheng Zhang, and
Josep Torrellas, Third International Symposium on
High-Performance Computer Architecture, January
1997. 13.Reducing Remote Conflict Misses
NUMA with Remote Cache versus COMA by
Zheng Zhang and Josep Torrellas, Third
International Symposium on High-Performance
Computer Architecture, January 1997.
14.Speeding up the Memory Hierarchy in Flat COMA
Multiprocessors by Liuxi Yang and Josep
Torrellas, Third International Symposium on
High-Performance Computer Architecture, January
1997.
Write a Comment
User Comments (0)
About PowerShow.com