Title: PowerPoint Sunusu
1SINGLE CHIP MULTIPROCESSORS Computer
Architecture Term Paper (11.12.2003) Esra
KIRBAS 2002701357
1/36
2Evaluation of Design Alternatives for a
Multiprocessor Microprocessor By Basem A.
Nayfeh, Lance Hammond and Kunle Olukotun. ISCA
23, 1996, pp. 67-77.
2/36
3- With the use of advanced integrated technology,
several options for design of high-performance
microprocessors are avaliable. - In multiproessor design option, a small of
processors are interconnected on a single-chip or
on a multi-chip-module (MCM) substrate. - We consantrate on single-chip multiprocessors.
3/36
4- Our goal is to study two proposed
cache-sharing mechanisms for single chip
multiprocessors - Shared Level-1 (L1) Cache Architecture
- Shared Level-2 (L2) Cache Architecture
- (Performance of these two architectures will be
compared with a single-bus based shared-memory
multiprocessor .)
4/36
5- A multiprocessor architecture whose interconnect
is closer to the CPUs in the memory hierarchy
will be able to exploit fine-grained parallelism
more efficiently than a multiprocessor
architecture whose interconnect is further away
from the CPUs in the memory hierarchy. - Try to achieve good performance on fine-grained
parallel applications without sacrificing the
performance of parallel independent jobs.
5/36
6- CPU CHARACTERISTICS
- We use the same CPU with all the three
architectures. - 2-way issue processor
- Dynamic scheduling
- Speculative execution
- Non-blocking caches
6/36
77/36
8- 2-way 16KB set-associative instruction and data
caches - 32-entry centeralized instruction window
- 32-entry reorder buffer.
8/36
9Shared L1-Cache Multiprocessor
9/36
10- Advantages of this Architecture
- It provides the lowest latency for
interprocessor communication by using a
shared-memory address space. - Low latency for interprocessor communication
helps to achieve high performance in executing
fine-grained parallel applications. - Processors may fetch shared data into the cache
for each other. - It eleminates the cache coherence logic and
implicitly provides a sequentially consistent
memory without sacrificing the performance.
10/36
11- Disadvantages of this Architecture
- Crossbar switching system increases the access
time of L1 cache. (We assume that average access
time is three.) - All of the memory referances will be entered L1,
so there may be some extra delays due to bank
conflicts. - If the processors are not executing fine-grained
parallel applications, then the miss rate will
increase.
11/36
12Secondary cache and main memories are
uniprocessor like systems ?L2 (2 MB, 10-cycle
latency 2-cycle occupancy) ?Main
Memory 50-cycle latency 6-cycle occupancy
12/36
13Shared L2-Cache Multiprocessor
13/36
14- Write-through primary caches access time is 1
cycle - Latency of L2-cache increses to 14 cycles due to
the cross-bar overhead.
14/36
15- L2 cache has four independent banks to increase
its bandwith and enable it to support four
independent access streams. - Data-path is 64-bit width.
- occupancy is 4 cycles (for the transfer of
32-bit cache line)
15/36
16- Only memory accesses that miss in L1-cache will
have to deal with the problem of reduced
performance L2 cache. - MCM (multi chip module) technology can be used.
- (for 1996)
- ?Main Memory
- 50-cycle latency
- 6-cycle occupancy
16/36
17- To keep the primary caches coherent, we need a
coherency protocol. - Simply, we assume that each primary cache uses a
write-through policy for shared data. - Additional hardware must be installed for this
issue.
17/36
18Shared Main Memory Multiprocessor
18/36
19- Primary cache access time is 1 cycle.
- Secondary cache access time is 12 cycles.
- All CPUs must access main memory to communicate.
19/36
20Ideal Memory Latencies of Three Architectures in
CPU Clock Cycles
20/36
21- SIMULATION ENVIRONMENT
- SimOS simulation environment is used
- IRIX 5.3 operating system is simulated
- ?Hand Parallelized Scientific and Engineering
Applications - ?Compiler Parallelized Scientific and
Engineering Applications - ?Multiprogramming Workload
21/36
22- 2 kinds of simulations is done
- Simple Simulation (no speculative execution,
dynamic scheduling, and non-blocking memory
referances) - Dynamic Superscalar Simulation
22/36
23SIMPLE SIMULATION RESULTS (for high degree of
interprocessor communication) ?EAR
23/36
24?EQNOTT
24/36
25(for moderate degree of interprocessor
communication) ?VOLPACK
25/36
26?FFT Kernel
26/36
27(for low degree of interprocessor
communication) ?MULTIPROGRAMMING
WORKLOAD
27/36
28?OCEAN
28/36
29 DYNAMIC SUPERSCALAR SIMULATION
RESULTS
29/36
30In dynamic superscalar simulation, Shared-L1
cache performance can diminish substantially,
whereas Shared-L2 and shared-memory
architectures retain much of the relative
performance predicted by the simple simulation
results.
30/36
31Piranha A Scalable Architecture Based on
Single-Chip Multiprocessing By Luiz Andre
Barroso, Kourosh Gharachorloo, Robert McNamara,
Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott
Smith, Robert Stets, and Ben Verghese. ISCA 27,
2000, pp. 282-293
31/36
32- For Online Transaction Processing Systems
- Standart ASIC design technology is used
- The centerpiece of the Piranha architecture is a
highly integrated processing node, with eight
simple Alpha processor cores, seperate
instruction and data caches for each core, a
shared second level cache, eight memory
controllers, two coherence protocol engines, and
a network router all on a single chip.
32/36
3333/36
3434/36
3535/36
36SIMULATION
36/36