Title: Cache Coherency on
1Cache Coherency on Heterogeneous Multiprocessor
Platform with Shared Memory
Taeweon Suh
April. 21. 2003
21. System Block Diagram
- ARM920T D
- 16KB
- 32 byte block
- 64-way set associative
- Replacement
- Round-Robin, Random
- - No cache coherency
- PowerPC D
- 32KB
- 32 byte block
- 8-way set associative
- MEI protocol
Shared page register Snoop hit address
register FIQ status register
32. Hardware Design Snoop logic
FIQ
To ARM
From ARM
31
5 4 3 2 1 0
address
BWAIT
BGNT
Valid
8
TAG CAM
Shared page register Snoop hit address
register FIQ status register
. .
64 way
Snoop hit
Pseudo Bus mastership logic - Initiates retry
transaction
BREQ
BGNT
ASB(Advanced System Bus)
42. Hardware Design(cont.) Wrapper
PowerPC755
Protocol Conversion - AMBA to PowerPC
GBL ADDR ARTRY
Protocol Conversion - PowerPC to AMBA
ASB(Advanced System Bus)
53. Simulation Results Worst Case Scenario
- Simulation environment
- Seamless CVE(Mentor Graphics)
- PowerPC 100MHz, ARM920T 50MHz
- Atalanta RTOS
- Bakery algorithm for lock implementation
- 1 task on each CPU
- I enabled, D selectively enabled
- Wait cycle 6 ( 120ns)
for (i0ilt100i) akc_entercritical()
// critical section c buffer
c buffer c // critical section
akc_exitcritical()
63. Simulation Results(cont.) Typical(?) Case
Scenario
- Simulation environment
- Seamless CVE(Mentor Graphics)
- PowerPC 100MHz, ARM920T 50MHz
- Atalanta RTOS
- Bakery algorithm for lock implementation
- 1 task on each CPU
- I enabled, D selectively enabled
- Wait cycle 6 ( 120ns)
for (i0ilt10i) akc_entercritical()
// critical section for (j1jlt100j)
c buffer c
buffer c // critical section
akc_exitcritical()
74. In the context of HPC with cache coherency
- The best way to run your applications as fast as
possible
Split the applications with no dependency each
other
- However, it is not applicable in most cases,
Then ?
Avoid the WCS like data manipulation
Arrange data in cache block boundary and try to
fully use a block(cache line).
85. Conclusion
- Successfully implement cache coherency on
heterogeneous - processor platform with shared memory
- Only invalidation scheme is possible
- Can be generalized on every processor platform
no matter - what cache coherency protocol it supports or
not - Could be used in SoC design, which could have
multiple - heterogeneous processors inside
9Any Questions?