Title: Alpha 21172 Inside out
1Alpha 21172 Inside out
- Zhihui Huang (Jerry)
- University of Michigan
2Components
- One 21172-CA chip
- Control, I/O, address chip(CIA)
- 388 pins, plastic ball grid array(PBGA)
- Four 21172-BA
- data switch chip (DSW)
- 208 pins, plastic quad flat pack (PQFP)
3Data Paths
- 64-bit data path between CIA and DSW
- iod
- 128-bit data path between 21164 and DSW
- cpu_dat
- 256-bit memory data path between DSW and memory
- mem_dat
43-way Interface
Vein
Brain
Heart
5Memory
The DRAM is contained in one bank of SIMMs,
whether there are 4 SIMMs or 8 SIMMs.
DRAM 1
DRAM 2
DRAM 3
DRAM 4
DRAM 5
DRAM 6
DRAM 7
DRAM 8
4 SIMMs fill a data bus of 128 bits
8 SIMMs fill a data bus of 256 bits
Needs a jumper
6Memory block
It is better to use the 256-bit configuration, or
you pay the full price for DSWs and only use
half of the resources.
A 256-bit block is composed of bit slices across
all the 8 SIMMs The arrangement of the slices
are interleaved within the 4 DSWs
150
3116
4732
6348
7964
9580
10296
127102
It may be clear now why it is a one bank schema
with all the SIMMs have the same size.
As you just see, the 4 DSWs together provide the
lower 128-bit memory bus. For the
256-bit configuration, DSWs also provide
the upper part of the bus
7Bcache and Memory
A cache where the cache location for a given
address is determined from the middle
address bits. If the cache line size is 2n then
the bottom n address bits correspond to an offset
within a cache entry. If the cache can hold 2m
entries then the next m address bits give the
cache location. The remaining top address bits
are stored as a TAG along with the entry. In
this scheme, there is no choice of which block to
flush on a cache miss since there is only one
place for any block to go. This simple scheme has
the disadvantage that if the program alternately
accesses different addresses which map to the
same cache location then it will suffer a cache
miss on every access to these locations. This
kind of cache conflict is quite likely on a
multi-processor.
A cache architecture in which data is only
written to main memory when it is forced out of
the cache. Opposite of write-through.
The Scache and Bcache block size is either
64-bytes or 32 bytes. The Scache and Bcache
always have identical block sizes. All the Bcache
and main memory FILLs or write transactions are
of the selected block size.
A cache line is allocated when the write
memory data miss the cache
- 3rd Level Cache for the 21164
- Attributes
- optional, external,physical, synchronous SRAM
- direct-mapped, write-back,write-allocate
- 256-bit or 512-bit block
- cache size of 1,2,4,8,16,32,64 Mbytes
- support up to 512MB of memory
- 1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36
8PCI features
- Supports 64-bits PCI bus width
- Supports 64-bit PCI addressing (DAC cycles)
- Accept PCI fast back-to-back cycles
- addr,data0,data1,data2,...,addr_again!
- The Frame is only deasserted for a cycle to
allow the last to finish - Issues PCI fast back-to-back cycles in dense
addrss space
9CIA Transactions
- 21164 memory read miss
- 21164 memory read miss with victim
- 21164 I/O read
- 21164 I/O write
- DMA read
- DMA read(prefetch)
- DMA write
10DSW Data Paths
11DSW Buffers
- DMA Buffer Sets (0 and 1)
- PCI buffer for PCI DMA write data
- Memory buffer for memory data
- Flush buffer for system bus data
DMA 0
DMA 1
PCI
Flush
Flush
PCI
IOD
IOD
MEM
MEM
12DMA Writes
Memory
- Data arrives in the PCI Buffer
- Memory Buffer loaded at the same time
- Bcache line flushed and Flush buffer loaded
- 3 sources merged and data back at memory
As you just see, the DMA operation causes PCI
buffer loaded from the IOD bus, the MEM buffer
loaded from memory, and the flush buffer loaded
from system bus at the same time
Then the 3 sources are merged and written back
to main memory
DMA 0
21164 BCache
Flush
PCI
IOD
MEM
1321164 Read Transaction
- If hit in the Bcache, no memory access is
required
HIT !!
1421164 Read Miss
- If not hit in the Bcache during a read, memory
access is involved.
command
Memory
21164 BCache
21172 BA
Miss!!
SYS
MEM
Read Miss Path
15Read Miss With Victim
- Two scenarios
- write data with different address tag into a
valid cache line - read data with different address tag into a valid
cache line
Write allocate!!
read allocate!!
command
Memory
Victim Path
21164 BCache
Merge data
Miss!!
SYS
MEM
Read Miss Path
16Traffic Jam on MEM bus
Dont forget instruction fetch uses memory too
Lets think about this senario, during the PCI
DMA transfer, there are READ and WRITE
memory happening at the same time
Victim Path
IO Paths not shown
Instruction Queue
DMA 0
DMA 1
PCI
Flush
Flush
PCI
MEM
IOD
IOD
MEM
MEM
SYS
SYS
17How Fast can DMA be?
- 2 fetches and 2 writes to memory/DMA
- 64 bytes/240 ns 266 Mbytes/s
- 8 bytes /30 ns 266 Mbytes/s
Overhead, retrys, read lines, read line with
victim, instruction fetch all share the same
bandwidth!! It turns out for the worst
case, 17MBytes/s is achieved just above bottom
line
DMA 0
DMA 1
PCI
Flush
Flush
IOD
IOD
MEM
MEM
SYS
SYS
18Performance of the MB2PCI
- Worst case
- 29.9MBytes/s
- 25.5MBytes/s
- 17.5MBytes/s
- Best case
- 95MBytes/s
- 80MBytes/s
- 72MBytes/s
- No intervenence - read line, instruction
fetch - read line, read line with victim,
instruction fetch
19Conclusion
- If we want to improve
- use 256-bit cache block instead of 512-bit
- Is there a next version 21172 chip surport
512-bit memory bus? - Is there DRAM chips faster then 60ns
- can we afford 64M Bcache(SRAM)?
There is a trade off here, by using smaller
block, the 21164 will generate more cache miss
cycles and may slow down. On the other hand, for
the DMA transfer, when only 128-bit data is
transferred, no more 512-bit memory read
overhead. There is only 256-bit read now. Thus
improve the worst case performance.