Lecture 26: Case Studies - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 26: Case Studies

Description:

Sun's Niagara focuses on: simple cores (low power, design complexity, ... Case Study I: Sun's Niagara. 4. Niagara Overview. 5. SPARC Pipe. No branch predictor ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 19

Provided by: rajeevbala

Learn more at: https://my.eng.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 26: Case Studies

1
Lecture 26 Case Studies

Topics processor case studies, Flash memory
Final exam stats
Highest 83, median 67
70 16 students, 60-69 20 students
1st 3 problems and 7th problem gimmes
4th problem (LSQ) half the students got full
points
5th problem (cache hierarchy) 1 correct
solution
6th problem (coherence) most got more than 15
points
8th problem (TM) very few mentioned frequent
aborts,
starvation, and
livelock
9th problem (TM) no one got close to full
points
10th problem (LRU) 1 correct solution with
the
tree structure

2
Finals Discussion LSQ, Caches, TM, LRU
3
Case Study I Suns Niagara

Commercial servers require high thread-level
throughput
and suffer from cache misses
Suns Niagara focuses on
simple cores (low power, design complexity,
can accommodate more
cores)
fine-grain multi-threading (to tolerate long
memory latencies)

4
Niagara Overview
5
SPARC Pipe
No branch predictor Low clock speed (1.2 GHz) One
FP unit shared by all cores
6
Case Study II Suns Rock

16 cores, each with 2 thread contexts
10 W per core (14 mm2), each core is in-order
and
2.3 GHz (10-12 FO4)(65 nm), total of 240 W and
396 mm2 !
New features scout threads that prefetch while
the main thread is stalled
on memory access, support for HTM (lazy
versioning and eager CD)
Each cluster of 4 cores shares a 32KB I-cache,
two 32KB D-caches
(one D-cache for two cores), and 2 FP units.
Caches are 4-way p-LRU.
L2 cache is 4-banked 8-way p-LRU and 2 MB.
Clusters are connected with a crossbar switch
Good read http//www.opensparc.net/pubs/preszo/0
8/RockISSCC08.pdf

7
Rock Overview
8
Case Study III Intel Pentium 4

Pursues high clock speed, ILP, and TLP
CISC instrs are translated into micro-ops and
stored in a trace cache
to avoid translations every time
Uses register renaming with 128 physical
registers
Supports up to 48 loads and 32 stores
Rename/commit width of 3 up to 6 instructions
can be dispatched
to functional units every cycle
Simple instruction has to traverse a 31-stage
pipeline
Combining branch predictor with local and global
histories
16KB 8-way L1 4-cyc for ints, 12-cyc for FPs
2MB 8-way L2, 18-cyc

9
Clock Rate Vs. CPI AMD Opteron Vs P4
2.8 GHz AMD Opteron vs. 3.8 GHz Intel P4 Opteron
provides a speedup of 1.08
10
Case Study IV Intel Core Architecture

Single-thread execution is still considered
important ?
out-of-order execution and speculation very much
alive
initial processors will have few heavy-weight
cores
To reduce power consumption, the Core
architecture (14
pipeline stages) is closer to the Pentium M
(12 stages)
than the P4 (30 stages)
Many transistors invested in a large branch
predictor to
reduce wasted work (power)
Similarly, SMT is also not guaranteed for all
incarnations
of the Core architecture (SMT makes a hotspot
hotter)

11
Case Study V Intel Nehalem

Quad core, each with 2 SMT threads
ROB of 96 in Core 2 has been increased to 128 in
Nehalem
ROB dynamically allocated across threads
Lots of power modes in-built power control unit
32KB ID L1 caches, 10-cycle 256KB private L2
cache
per core, 8MB shared L3 cache (40 cycles)
L1 dTLB 64/32 entries (page sizes of 4KB or
4MB),
512-entry L2 TLB (small pages only)

12
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
Nehalem Memory Controller Organization
MC1
MC2
MC3
MC1
MC2
MC3
Core 1
Core 2
Core 1
Core 2
Core 3
Core 4
Core 3
Core 4
Socket 1
Socket 2
QPI
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
MC1
MC2
MC3
MC1
MC2
MC3
Core 1
Core 2
Core 1
Core 2
Core 3
Core 4
Core 3
Core 4
Socket 3
Socket 4
13
Flash Memory

Technology cost-effective enough that flash
memory can
now replace magnetic disks on laptops (also
known as
solid-state disks)
Non-volatile, fast read times (15 MB/sec)
(slower than
DRAM), a write requires an entire block to be
erased
first (about 100K erases are possible) (block
sizes can
be 16-512KB)

14
Advanced Course

Spr09 CS 7810 Advanced Computer Architecture
co-taught by Al Davis and me
lots of multi-core topics cache coherence, TM,
networks
memory technologies DRAM layouts, new
technologies,
memory
controller design
Major course project on evaluating original
ideas with
simulators (can lead to publications)
One programming assignment, take-home final

15
Case Studies More Processors

AMD Barcelona 4 cores, issue width of 3, each
core has private
L1 (64 KB) and L2 (512 KB), shared L3 (2 MB),
95 W (AMD also
has announcements for 3-core chips)
Sun Niagara2 8 threads per core, up to 8 cores,
60-123 W,
0.9-1.4 GHz, 4 MB L2 (8 banks), 8 FP units
IBM Power6 2 cores, 4.7 GHz, each core has a
private 4 MB L2

16
Alpha Address Mapping
Virtual address
Unused bits
Level 1
Level 2
Level 3
Page offset
13 bits
10 bits
10 bits
21 bits
10 bits
Page table base register

PTE
PTE
PTE
L1 page table
L2 page table
L3 page table
32-bit physical page number
Page offset
45-bit Physical address
17
Alpha Address Mapping