Lecture 26: Case Studies - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 26: Case Studies

Description:

Sun's Niagara focuses on: simple cores (low power, design complexity, ... Case Study I: Sun's Niagara. 4. Niagara Overview. 5. SPARC Pipe. No branch predictor ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 19
Provided by: rajeevbala
Learn more at: https://my.eng.utah.edu
Category:
Tags: case | lecture | studies

less

Transcript and Presenter's Notes

Title: Lecture 26: Case Studies


1
Lecture 26 Case Studies
  • Topics processor case studies, Flash memory
  • Final exam stats
  • Highest 83, median 67
  • 70 16 students, 60-69 20 students
  • 1st 3 problems and 7th problem gimmes
  • 4th problem (LSQ) half the students got full
    points
  • 5th problem (cache hierarchy) 1 correct
    solution
  • 6th problem (coherence) most got more than 15
    points
  • 8th problem (TM) very few mentioned frequent
    aborts,
  • starvation, and
    livelock
  • 9th problem (TM) no one got close to full
    points
  • 10th problem (LRU) 1 correct solution with
    the
  • tree structure

2
Finals Discussion LSQ, Caches, TM, LRU
3
Case Study I Suns Niagara
  • Commercial servers require high thread-level
    throughput
  • and suffer from cache misses
  • Suns Niagara focuses on
  • simple cores (low power, design complexity,
  • can accommodate more
    cores)
  • fine-grain multi-threading (to tolerate long

  • memory latencies)

4
Niagara Overview
5
SPARC Pipe
No branch predictor Low clock speed (1.2 GHz) One
FP unit shared by all cores
6
Case Study II Suns Rock
  • 16 cores, each with 2 thread contexts
  • 10 W per core (14 mm2), each core is in-order
    and
  • 2.3 GHz (10-12 FO4)(65 nm), total of 240 W and
    396 mm2 !
  • New features scout threads that prefetch while
    the main thread is stalled
  • on memory access, support for HTM (lazy
    versioning and eager CD)
  • Each cluster of 4 cores shares a 32KB I-cache,
    two 32KB D-caches
  • (one D-cache for two cores), and 2 FP units.
    Caches are 4-way p-LRU.
  • L2 cache is 4-banked 8-way p-LRU and 2 MB.
  • Clusters are connected with a crossbar switch
  • Good read http//www.opensparc.net/pubs/preszo/0
    8/RockISSCC08.pdf

7
Rock Overview
8
Case Study III Intel Pentium 4
  • Pursues high clock speed, ILP, and TLP
  • CISC instrs are translated into micro-ops and
    stored in a trace cache
  • to avoid translations every time
  • Uses register renaming with 128 physical
    registers
  • Supports up to 48 loads and 32 stores
  • Rename/commit width of 3 up to 6 instructions
    can be dispatched
  • to functional units every cycle
  • Simple instruction has to traverse a 31-stage
    pipeline
  • Combining branch predictor with local and global
    histories
  • 16KB 8-way L1 4-cyc for ints, 12-cyc for FPs
    2MB 8-way L2, 18-cyc

9
Clock Rate Vs. CPI AMD Opteron Vs P4
2.8 GHz AMD Opteron vs. 3.8 GHz Intel P4 Opteron
provides a speedup of 1.08
10
Case Study IV Intel Core Architecture
  • Single-thread execution is still considered
    important ?
  • out-of-order execution and speculation very much
    alive
  • initial processors will have few heavy-weight
    cores
  • To reduce power consumption, the Core
    architecture (14
  • pipeline stages) is closer to the Pentium M
    (12 stages)
  • than the P4 (30 stages)
  • Many transistors invested in a large branch
    predictor to
  • reduce wasted work (power)
  • Similarly, SMT is also not guaranteed for all
    incarnations
  • of the Core architecture (SMT makes a hotspot
    hotter)

11
Case Study V Intel Nehalem
  • Quad core, each with 2 SMT threads
  • ROB of 96 in Core 2 has been increased to 128 in
    Nehalem
  • ROB dynamically allocated across threads
  • Lots of power modes in-built power control unit
  • 32KB ID L1 caches, 10-cycle 256KB private L2
    cache
  • per core, 8MB shared L3 cache (40 cycles)
  • L1 dTLB 64/32 entries (page sizes of 4KB or
    4MB),
  • 512-entry L2 TLB (small pages only)

12
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
Nehalem Memory Controller Organization
MC1
MC2
MC3
MC1
MC2
MC3
Core 1
Core 2
Core 1
Core 2
Core 3
Core 4
Core 3
Core 4
Socket 1
Socket 2
QPI
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
MC1
MC2
MC3
MC1
MC2
MC3
Core 1
Core 2
Core 1
Core 2
Core 3
Core 4
Core 3
Core 4
Socket 3
Socket 4
13
Flash Memory
  • Technology cost-effective enough that flash
    memory can
  • now replace magnetic disks on laptops (also
    known as
  • solid-state disks)
  • Non-volatile, fast read times (15 MB/sec)
    (slower than
  • DRAM), a write requires an entire block to be
    erased
  • first (about 100K erases are possible) (block
    sizes can
  • be 16-512KB)

14
Advanced Course
  • Spr09 CS 7810 Advanced Computer Architecture
  • co-taught by Al Davis and me
  • lots of multi-core topics cache coherence, TM,
    networks
  • memory technologies DRAM layouts, new
    technologies,
  • memory
    controller design
  • Major course project on evaluating original
    ideas with
  • simulators (can lead to publications)
  • One programming assignment, take-home final

15
Case Studies More Processors
  • AMD Barcelona 4 cores, issue width of 3, each
    core has private
  • L1 (64 KB) and L2 (512 KB), shared L3 (2 MB),
    95 W (AMD also
  • has announcements for 3-core chips)
  • Sun Niagara2 8 threads per core, up to 8 cores,
    60-123 W,
  • 0.9-1.4 GHz, 4 MB L2 (8 banks), 8 FP units
  • IBM Power6 2 cores, 4.7 GHz, each core has a
    private 4 MB L2

16
Alpha Address Mapping
Virtual address
Unused bits
Level 1
Level 2
Level 3
Page offset
13 bits
10 bits
10 bits
21 bits
10 bits
Page table base register



PTE
PTE
PTE
L1 page table
L2 page table
L3 page table
32-bit physical page number
Page offset
45-bit Physical address
17
Alpha Address Mapping
  • Each PTE is 8 bytes if page size is 8KB, a
    page can
  • contain 1024 PTEs 10 bits to index into each
    level
  • If page size doubles, we need 47 bits of virtual
    address
  • Since a PTE only stores 32 bits of physical page
    number,
  • the physical memory can be addressed by at most
    32 offset
  • First two levels are in physical memory third
    is in virtual
  • Why the three-level structure? Even a flat
    structure would
  • need PTEs for the PTEs that would have to be
    stored in
  • physical memory more levels of indirection
    make it
  • easier to dynamically allocate pages

18
Title
  • Bullet
Write a Comment
User Comments (0)
About PowerShow.com