Alpha 21172 Inside out - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Alpha 21172 Inside out

Description:

388 pins, plastic ball grid array(PBGA) Four 21172-BA. data switch chip (DSW) 208 pins, plastic quad flat pack (PQFP) Slide #3 Friday, October 10, 1997 ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 20
Provided by: myronca
Category:
Tags: alpha | flatpack | inside | out | simm

less

Transcript and Presenter's Notes

Title: Alpha 21172 Inside out


1
Alpha 21172 Inside out
  • Zhihui Huang (Jerry)
  • University of Michigan

2
Components
  • One 21172-CA chip
  • Control, I/O, address chip(CIA)
  • 388 pins, plastic ball grid array(PBGA)
  • Four 21172-BA
  • data switch chip (DSW)
  • 208 pins, plastic quad flat pack (PQFP)

3
Data Paths
  • 64-bit data path between CIA and DSW
  • iod
  • 128-bit data path between 21164 and DSW
  • cpu_dat
  • 256-bit memory data path between DSW and memory
  • mem_dat

4
3-way Interface
Vein
Brain
Heart
5
Memory
The DRAM is contained in one bank of SIMMs,
whether there are 4 SIMMs or 8 SIMMs.
DRAM 1
DRAM 2
DRAM 3
DRAM 4
DRAM 5
DRAM 6
DRAM 7
DRAM 8
4 SIMMs fill a data bus of 128 bits
8 SIMMs fill a data bus of 256 bits
Needs a jumper
6
Memory block
It is better to use the 256-bit configuration, or
you pay the full price for DSWs and only use
half of the resources.
A 256-bit block is composed of bit slices across
all the 8 SIMMs The arrangement of the slices
are interleaved within the 4 DSWs
150
3116
4732
6348
7964
9580
10296
127102
It may be clear now why it is a one bank schema
with all the SIMMs have the same size.
As you just see, the 4 DSWs together provide the
lower 128-bit memory bus. For the
256-bit configuration, DSWs also provide
the upper part of the bus
7
Bcache and Memory
A cache where the cache location for a given
address is determined from the middle
address bits. If the cache line size is 2n then
the bottom n address bits correspond to an offset
within a cache entry. If the cache can hold 2m
entries then the next m address bits give the
cache location. The remaining top address bits
are stored as a TAG along with the entry. In
this scheme, there is no choice of which block to
flush on a cache miss since there is only one
place for any block to go. This simple scheme has
the disadvantage that if the program alternately
accesses different addresses which map to the
same cache location then it will suffer a cache
miss on every access to these locations. This
kind of cache conflict is quite likely on a
multi-processor.
A cache architecture in which data is only
written to main memory when it is forced out of
the cache. Opposite of write-through.
The Scache and Bcache block size is either
64-bytes or 32 bytes. The Scache and Bcache
always have identical block sizes. All the Bcache
and main memory FILLs or write transactions are
of the selected block size.
A cache line is allocated when the write
memory data miss the cache
  • 3rd Level Cache for the 21164
  • Attributes
  • optional, external,physical, synchronous SRAM
  • direct-mapped, write-back,write-allocate
  • 256-bit or 512-bit block
  • cache size of 1,2,4,8,16,32,64 Mbytes
  • support up to 512MB of memory
  • 1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36

8
PCI features
  • Supports 64-bits PCI bus width
  • Supports 64-bit PCI addressing (DAC cycles)
  • Accept PCI fast back-to-back cycles
  • addr,data0,data1,data2,...,addr_again!
  • The Frame is only deasserted for a cycle to
    allow the last to finish
  • Issues PCI fast back-to-back cycles in dense
    addrss space

9
CIA Transactions
  • 21164 memory read miss
  • 21164 memory read miss with victim
  • 21164 I/O read
  • 21164 I/O write
  • DMA read
  • DMA read(prefetch)
  • DMA write

10
DSW Data Paths
11
DSW Buffers
  • DMA Buffer Sets (0 and 1)
  • PCI buffer for PCI DMA write data
  • Memory buffer for memory data
  • Flush buffer for system bus data

DMA 0
DMA 1
PCI
Flush
Flush
PCI
IOD
IOD
MEM
MEM
12
DMA Writes
Memory
  • Data arrives in the PCI Buffer
  • Memory Buffer loaded at the same time
  • Bcache line flushed and Flush buffer loaded
  • 3 sources merged and data back at memory

As you just see, the DMA operation causes PCI
buffer loaded from the IOD bus, the MEM buffer
loaded from memory, and the flush buffer loaded
from system bus at the same time
Then the 3 sources are merged and written back
to main memory
DMA 0
21164 BCache
Flush
PCI
IOD
MEM
13
21164 Read Transaction
  • If hit in the Bcache, no memory access is
    required

HIT !!
14
21164 Read Miss
  • If not hit in the Bcache during a read, memory
    access is involved.

command
Memory
21164 BCache
21172 BA
Miss!!
SYS
MEM
Read Miss Path
15
Read Miss With Victim
  • Two scenarios
  • write data with different address tag into a
    valid cache line
  • read data with different address tag into a valid
    cache line

Write allocate!!
read allocate!!
command
Memory
Victim Path
21164 BCache
Merge data
Miss!!
SYS
MEM
Read Miss Path
16
Traffic Jam on MEM bus
Dont forget instruction fetch uses memory too
Lets think about this senario, during the PCI
DMA transfer, there are READ and WRITE
memory happening at the same time
Victim Path
IO Paths not shown
Instruction Queue
DMA 0
DMA 1
PCI
Flush
Flush
PCI
MEM
IOD
IOD
MEM
MEM
SYS
SYS
17
How Fast can DMA be?
  • 2 fetches and 2 writes to memory/DMA
  • 64 bytes/240 ns 266 Mbytes/s
  • 8 bytes /30 ns 266 Mbytes/s

Overhead, retrys, read lines, read line with
victim, instruction fetch all share the same
bandwidth!! It turns out for the worst
case, 17MBytes/s is achieved just above bottom
line
DMA 0
DMA 1
PCI
Flush
Flush
IOD
IOD
MEM
MEM
SYS
SYS
18
Performance of the MB2PCI
  • Worst case
  • 29.9MBytes/s
  • 25.5MBytes/s
  • 17.5MBytes/s
  • Best case
  • 95MBytes/s
  • 80MBytes/s
  • 72MBytes/s

- No intervenence - read line, instruction
fetch - read line, read line with victim,
instruction fetch
19
Conclusion
  • If we want to improve
  • use 256-bit cache block instead of 512-bit
  • Is there a next version 21172 chip surport
    512-bit memory bus?
  • Is there DRAM chips faster then 60ns
  • can we afford 64M Bcache(SRAM)?

There is a trade off here, by using smaller
block, the 21164 will generate more cache miss
cycles and may slow down. On the other hand, for
the DMA transfer, when only 128-bit data is
transferred, no more 512-bit memory read
overhead. There is only 256-bit read now. Thus
improve the worst case performance.
Write a Comment
User Comments (0)
About PowerShow.com