Title: Lecture 26: Case Studies
1Lecture 26 Case Studies
- Topics processor case studies, Flash memory
- Final exam stats
- Highest 83, median 67
- 70 16 students, 60-69 20 students
- 1st 3 problems and 7th problem gimmes
- 4th problem (LSQ) half the students got full
points - 5th problem (cache hierarchy) 1 correct
solution - 6th problem (coherence) most got more than 15
points - 8th problem (TM) very few mentioned frequent
aborts, - starvation, and
livelock - 9th problem (TM) no one got close to full
points - 10th problem (LRU) 1 correct solution with
the - tree structure
2Finals Discussion LSQ, Caches, TM, LRU
3Case Study I Suns Niagara
- Commercial servers require high thread-level
throughput - and suffer from cache misses
- Suns Niagara focuses on
- simple cores (low power, design complexity,
- can accommodate more
cores) - fine-grain multi-threading (to tolerate long
-
memory latencies)
4Niagara Overview
5SPARC Pipe
No branch predictor Low clock speed (1.2 GHz) One
FP unit shared by all cores
6Case Study II Suns Rock
- 16 cores, each with 2 thread contexts
- 10 W per core (14 mm2), each core is in-order
and - 2.3 GHz (10-12 FO4)(65 nm), total of 240 W and
396 mm2 ! - New features scout threads that prefetch while
the main thread is stalled - on memory access, support for HTM (lazy
versioning and eager CD) - Each cluster of 4 cores shares a 32KB I-cache,
two 32KB D-caches - (one D-cache for two cores), and 2 FP units.
Caches are 4-way p-LRU. - L2 cache is 4-banked 8-way p-LRU and 2 MB.
- Clusters are connected with a crossbar switch
- Good read http//www.opensparc.net/pubs/preszo/0
8/RockISSCC08.pdf
7Rock Overview
8Case Study III Intel Pentium 4
- Pursues high clock speed, ILP, and TLP
- CISC instrs are translated into micro-ops and
stored in a trace cache - to avoid translations every time
- Uses register renaming with 128 physical
registers - Supports up to 48 loads and 32 stores
- Rename/commit width of 3 up to 6 instructions
can be dispatched - to functional units every cycle
- Simple instruction has to traverse a 31-stage
pipeline - Combining branch predictor with local and global
histories - 16KB 8-way L1 4-cyc for ints, 12-cyc for FPs
2MB 8-way L2, 18-cyc
9Clock Rate Vs. CPI AMD Opteron Vs P4
2.8 GHz AMD Opteron vs. 3.8 GHz Intel P4 Opteron
provides a speedup of 1.08
10Case Study IV Intel Core Architecture
- Single-thread execution is still considered
important ? - out-of-order execution and speculation very much
alive - initial processors will have few heavy-weight
cores - To reduce power consumption, the Core
architecture (14 - pipeline stages) is closer to the Pentium M
(12 stages) - than the P4 (30 stages)
- Many transistors invested in a large branch
predictor to - reduce wasted work (power)
- Similarly, SMT is also not guaranteed for all
incarnations - of the Core architecture (SMT makes a hotspot
hotter)
11Case Study V Intel Nehalem
- Quad core, each with 2 SMT threads
- ROB of 96 in Core 2 has been increased to 128 in
Nehalem - ROB dynamically allocated across threads
- Lots of power modes in-built power control unit
- 32KB ID L1 caches, 10-cycle 256KB private L2
cache - per core, 8MB shared L3 cache (40 cycles)
- L1 dTLB 64/32 entries (page sizes of 4KB or
4MB), - 512-entry L2 TLB (small pages only)
12DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
Nehalem Memory Controller Organization
MC1
MC2
MC3
MC1
MC2
MC3
Core 1
Core 2
Core 1
Core 2
Core 3
Core 4
Core 3
Core 4
Socket 1
Socket 2
QPI
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
MC1
MC2
MC3
MC1
MC2
MC3
Core 1
Core 2
Core 1
Core 2
Core 3
Core 4
Core 3
Core 4
Socket 3
Socket 4
13Flash Memory
- Technology cost-effective enough that flash
memory can - now replace magnetic disks on laptops (also
known as - solid-state disks)
- Non-volatile, fast read times (15 MB/sec)
(slower than - DRAM), a write requires an entire block to be
erased - first (about 100K erases are possible) (block
sizes can - be 16-512KB)
14Advanced Course
- Spr09 CS 7810 Advanced Computer Architecture
- co-taught by Al Davis and me
- lots of multi-core topics cache coherence, TM,
networks - memory technologies DRAM layouts, new
technologies, - memory
controller design - Major course project on evaluating original
ideas with - simulators (can lead to publications)
- One programming assignment, take-home final
15Case Studies More Processors
- AMD Barcelona 4 cores, issue width of 3, each
core has private - L1 (64 KB) and L2 (512 KB), shared L3 (2 MB),
95 W (AMD also - has announcements for 3-core chips)
- Sun Niagara2 8 threads per core, up to 8 cores,
60-123 W, - 0.9-1.4 GHz, 4 MB L2 (8 banks), 8 FP units
- IBM Power6 2 cores, 4.7 GHz, each core has a
private 4 MB L2
16Alpha Address Mapping
Virtual address
Unused bits
Level 1
Level 2
Level 3
Page offset
13 bits
10 bits
10 bits
21 bits
10 bits
Page table base register
PTE
PTE
PTE
L1 page table
L2 page table
L3 page table
32-bit physical page number
Page offset
45-bit Physical address
17Alpha Address Mapping
- Each PTE is 8 bytes if page size is 8KB, a
page can - contain 1024 PTEs 10 bits to index into each
level - If page size doubles, we need 47 bits of virtual
address - Since a PTE only stores 32 bits of physical page
number, - the physical memory can be addressed by at most
32 offset - First two levels are in physical memory third
is in virtual - Why the three-level structure? Even a flat
structure would - need PTEs for the PTEs that would have to be
stored in - physical memory more levels of indirection
make it - easier to dynamically allocate pages
18Title