Alpha 21172 Inside out - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Alpha 21172 Inside out

Description:

388 pins, plastic ball grid array(PBGA) Four 21172-BA. data switch chip (DSW) 208 pins, plastic quad flat pack (PQFP) Slide #3 Friday, October 10, 1997 ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 20

Provided by: myronca

Category:

more less

Transcript and Presenter's Notes

Title: Alpha 21172 Inside out

1
Alpha 21172 Inside out

Zhihui Huang (Jerry)
University of Michigan

2
Components

One 21172-CA chip
Control, I/O, address chip(CIA)
388 pins, plastic ball grid array(PBGA)
Four 21172-BA
data switch chip (DSW)
208 pins, plastic quad flat pack (PQFP)

3
Data Paths

64-bit data path between CIA and DSW
iod
128-bit data path between 21164 and DSW
cpu_dat
256-bit memory data path between DSW and memory
mem_dat

4
3-way Interface
Vein
Brain
Heart
5
Memory
The DRAM is contained in one bank of SIMMs,
whether there are 4 SIMMs or 8 SIMMs.
DRAM 1
DRAM 2
DRAM 3
DRAM 4
DRAM 5
DRAM 6
DRAM 7
DRAM 8
4 SIMMs fill a data bus of 128 bits
8 SIMMs fill a data bus of 256 bits
Needs a jumper
6
Memory block
It is better to use the 256-bit configuration, or
you pay the full price for DSWs and only use
half of the resources.
A 256-bit block is composed of bit slices across
all the 8 SIMMs The arrangement of the slices
are interleaved within the 4 DSWs
150
3116
4732
6348
7964
9580
10296
127102
It may be clear now why it is a one bank schema
with all the SIMMs have the same size.
As you just see, the 4 DSWs together provide the
lower 128-bit memory bus. For the
256-bit configuration, DSWs also provide
the upper part of the bus
7
Bcache and Memory
A cache where the cache location for a given
address is determined from the middle
address bits. If the cache line size is 2n then
the bottom n address bits correspond to an offset
within a cache entry. If the cache can hold 2m
entries then the next m address bits give the
cache location. The remaining top address bits
are stored as a TAG along with the entry. In
this scheme, there is no choice of which block to
flush on a cache miss since there is only one
place for any block to go. This simple scheme has
the disadvantage that if the program alternately
accesses different addresses which map to the
same cache location then it will suffer a cache
miss on every access to these locations. This
kind of cache conflict is quite likely on a
multi-processor.
A cache architecture in which data is only
written to main memory when it is forced out of
the cache. Opposite of write-through.
The Scache and Bcache block size is either
64-bytes or 32 bytes. The Scache and Bcache
always have identical block sizes. All the Bcache
and main memory FILLs or write transactions are
of the selected block size.
A cache line is allocated when the write
memory data miss the cache

3rd Level Cache for the 21164
Attributes
optional, external,physical, synchronous SRAM
direct-mapped, write-back,write-allocate
256-bit or 512-bit block
cache size of 1,2,4,8,16,32,64 Mbytes
support up to 512MB of memory
1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36

8
PCI features

Supports 64-bits PCI bus width
Supports 64-bit PCI addressing (DAC cycles)
Accept PCI fast back-to-back cycles
addr,data0,data1,data2,...,addr_again!
The Frame is only deasserted for a cycle to
allow the last to finish
Issues PCI fast back-to-back cycles in dense
addrss space

9
CIA Transactions

21164 memory read miss
21164 memory read miss with victim
21164 I/O read
21164 I/O write
DMA read
DMA read(prefetch)
DMA write

10
DSW Data Paths
11
DSW Buffers

DMA Buffer Sets (0 and 1)
PCI buffer for PCI DMA write data
Memory buffer for memory data
Flush buffer for system bus data

DMA 0
DMA 1
PCI
Flush
Flush
PCI
IOD
IOD
MEM
MEM
12
DMA Writes
Memory

Data arrives in the PCI Buffer
Memory Buffer loaded at the same time
Bcache line flushed and Flush buffer loaded
3 sources merged and data back at memory

As you just see, the DMA operation causes PCI
buffer loaded from the IOD bus, the MEM buffer
loaded from memory, and the flush buffer loaded
from system bus at the same time
Then the 3 sources are merged and written back
to main memory
DMA 0
21164 BCache
Flush
PCI
IOD
MEM
13
21164 Read Transaction

If hit in the Bcache, no memory access is
required

HIT !!
14
21164 Read Miss

If not hit in the Bcache during a read, memory
access is involved.

command
Memory
21164 BCache
21172 BA
Miss!!
SYS
MEM
Read Miss Path
15
Read Miss With Victim

Two scenarios
write data with different address tag into a
valid cache line
read data with different address tag into a valid
cache line

Write allocate!!
read allocate!!
command
Memory
Victim Path
21164 BCache
Merge data
Miss!!
SYS
MEM
Read Miss Path
16
Traffic Jam on MEM bus
Dont forget instruction fetch uses memory too
Lets think about this senario, during the PCI
DMA transfer, there are READ and WRITE
memory happening at the same time
Victim Path
IO Paths not shown
Instruction Queue
DMA 0
DMA 1
PCI
Flush
Flush
PCI
MEM
IOD
IOD
MEM
MEM
SYS
SYS
17
How Fast can DMA be?

2 fetches and 2 writes to memory/DMA
64 bytes/240 ns 266 Mbytes/s
8 bytes /30 ns 266 Mbytes/s

Overhead, retrys, read lines, read line with
victim, instruction fetch all share the same
bandwidth!! It turns out for the worst
case, 17MBytes/s is achieved just above bottom
line
DMA 0
DMA 1
PCI
Flush
Flush
IOD
IOD
MEM
MEM
SYS
SYS
18
Performance of the MB2PCI

Worst case
29.9MBytes/s
25.5MBytes/s
17.5MBytes/s
Best case
95MBytes/s
80MBytes/s
72MBytes/s

- No intervenence - read line, instruction
fetch - read line, read line with victim,
instruction fetch
19
Conclusion

If we want to improve
use 256-bit cache block instead of 512-bit
Is there a next version 21172 chip surport
512-bit memory bus?
Is there DRAM chips faster then 60ns
can we afford 64M Bcache(SRAM)?

There is a trade off here, by using smaller
block, the 21164 will generate more cache miss
cycles and may slow down. On the other hand, for
the DMA transfer, when only 128-bit data is
transferred, no more 512-bit memory read
overhead. There is only 256-bit read now. Thus
improve the worst case performance.

Write a Comment

User Comments (0)