Multicore: Commercial Processors - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Multicore: Commercial Processors

Description:

... Core ... core has a 12 stage pipeline (Yonah) Low-power (less than 25 watts) Dual Core ... Core processors communicate with the system chipset over a 667 ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 60
Provided by: cre29
Category:

less

Transcript and Presenter's Notes

Title: Multicore: Commercial Processors


1
Multicore Commercial Processors
2
Some Examples
  • Desktop and Server/Enterprise Space
  • Intel
  • AMD
  • SUN Microsystems
  • The Embedded Space Freescale Semiconductor

3
Focus
  • The Chip Level Architecture
  • What do we have on chip?
  • The Core Architecture
  • Note the presence/absence/configuration of
    concepts studied earlier in class
  • Rationalize the design decisions that led to the
    preceding
  • What can/should we expect next?
  • Building systems using multicore chips

4
The Intel Core Duo Processor Series
5
Intel Core Duo
  • Homogeneous cores
  • Bus based on chip interconnect
  • Shared Memory
  • Traditional I/O

Classic OOO Reservation Stations, Issue ports,
Schedulersetc
Source Intel Corp.
Large, shared set associative, prefetch, etc.
6
Intel Core Duo Vital Stats
  • 151 million transistors Shared 2 MB L2 cache
  • Each core has a 12 stage pipeline (Yonah)
  • Low-power (less than 25 watts) Dual Core
    microprocessor
  • Supports Intels Vanderpool virtualization
    technology
  • EM64T (Intel x86-64 extensions) is not supported
  • Desktop market not severe due to lack of OS and
    software
  • Sossaman processor for servers, which is based on
    Yonah, also lacks EM64T-support ? severe
    disadvantage
  • Communication between the L2 cache and both
    execution cores is handled by an arbitration bus
    unit
  • Eliminates cache coherency traffic over the FSB
  • Raises the core-to-L2 latency
  • The increase in clock frequency offsets the
    impact
  • Core processors communicate with the system
    chipset over a 667 MT/s front side bus (FSB), up
    from 533 MT/s used by the fastest Pentium M.
  • Intel Core Solo uses the same two-core die as the
    Core Duo, but features only one active core
  • Chips failing quality control can be sold
  • Core 2 Duo processors will also include the
    ability to disable one core to conserve power

7
The Core micro-architecture
Source Ars Technica
8
The Core Execution core
Source Ars Technica
9
Intel Core Duo
  • High memory latency due to the lack of on-die
    memory controller (further aggravated by
    system-chipset's use of DDR-II RAM)
  • Main-memory transactions have to pass through
    the Northbridge of the chipset
  • Higher latency compared to the AMD's Turion
    platform.
  • Weakness shared by the entire line of Pentium
    processors
  • L2-cache is quite effective at hiding main-memory
    latency
  • Execution units
  • Three 64-bit integer exec units
  • one CIU (complex) two SIU (simple)
  • Two FPUs
  • Poor Floating Point Unit (FPU) throughput
  • Limited to little "performance per watt" in
    single threaded applications compared to its
    predecessor.

10
Core 2 Duo and Core Duo
Source Intel Corp.
  • Very similar architectures
  • Bump in the processor speed
  • Increase in Level 2 cache. (2MB to 4MB)
  • Both chips have a 65-nm process technology
    architecture and support a 667 MHz front-side-bus
    (FSB).
  • 14 stage pipeline

11
Intel CoreTM2 Duo Processor
Process Technology 65 nm
Number of Processor Cores 2
L2 Cache Size (shared between 2 processor cores) Up to 4MB
Transistor Gate Height / Gate Oxide Thickness (65 nm) 1.2 nm
Transistor Gate Length (for 65nm Process Technology) 35 nm
Line Width 65 nm
Number of Transistors 291 million
Processor Die Size 143 mm2
Average Power lt1.1 Watt
12
Intel Core 2 Duo
Source Hard Core Hardware
13
Wide Dynamic Execution
Source Bit Tech
14
Wide Dynamic Execution
Source Bit Tech
15
Wide Dynamic Execution
  • Pipe width of 4 execution units per chip (Pentium
    M/Pentium 4 Netburst have 3)
  • Delivery of more instructions per clock cycle
  • Pipeline depth of 14 vs. 31 in Pentium Prescott 4
  • Compromise between efficient execution of short
    instructions and long instructions
  • Ops fusion
  • Less work for the processor pipeline to run
  • Micro-ops fusion
  • fuse together repetitive instructions in x86 code
  • Macro-ops fusion
  • works on the x86 instructions themselves, not
    just their micro derivatives.
  • Instruction loads and micro-ops can be reduced by
    approximately 15 and 10, respectively

16
Intelligent Power Capability
Source Bit Tech
17
Intelligent Power Capability
  • SpeedStep technology
  • Dyamic clock speed reduction
  • Intel mobile processors include this already
  • Enhanced SpeedStep used in Core 2 Duo
  • Controller that turns on sections of the
    processor as needed. One core can be shut down
    for single-threaded applications
  • Power consumption decreased by enhancements to
    Intel's 65nm process node
  • use Low-K dielectrics and strained silicon
  • use low-leakage and "sleep" transistors

18
Advanced Smart Cache
Source Bit Tech
19
Advanced Smart Cache
Source Bit Tech
  • Both cores share data stored in the L2 cache via
    an arbitration bus unit embedded in the cache.
  • Dynamically allocates cache space between the two
    cores, minimising bus traffic by allowing both
    cores to access one copy of data
  • Does larger L2 cache matter?
  • Studies point out that improvements in execution
    time are low from a 2MB to 4MB for most
    applications (2-4)

20
Smart Memory Access
Source Bit Tech
21
Smart Memory Access
Execution with and without memory disambiguation
Memory Aliasing
  • Improved prefetch units
  • Memory disambiguation
  • Allows re-ordering instructions more efficiently

Execution without memory disambiguation
Example from http//arstechnica.com/articles/paedi
a/cpu/core.ars/8
Source Ars Technica
22
Advanced Digital Media Boost
Source Bit Tech
23
Advanced Digital Media Boost
  • Streaming SIMD Extension (SSE) instructions
  • SSE instructions are an extension of the standard
    x86 instruction set.
  • Utilized in multimedia encoding, decoding, image
    manipulation and encryption
  • SSE instructions are 128-bit.
  • Up from 64-bits
  • Double the SSE performance over previous
    generation

24
Comparison of SSE to prior processors
Source Ars Technica
25
Intel Conroe Vs Presler
Conroe
Presler
  • What is the major difference?
  • Shared L2 versus separate caches

Source Bit Tech
26
Intels Roadmap for Multicore
Mobile processors
Enterprise processors
Desktop processors
8C 12MB shared (45nm)
8C 12MB shared (45nm)
QC 8/16MB shared
DC 3MB /6MB shared (45nm)
DC 3 MB/6 MB shared (45nm)
QC 4MB
DC 2/4MB shared
DC 4MB
DC 16MB
DC 2/4MB shared
DC 2MB
DC 4MB
SC 1MB
DC 2MB
DC 2/4MB
SC 512KB/ 1/ 2MB
2006
2008
2007
2006
2008
2007
2006
2008
2007
Source Adapted from Toms Hardware
  • Drivers are
  • Market segments
  • More cache
  • More cores
  • 80 core processor prototype has been designed!

27
Intel Chipset Example
Source Extreme Tech
28
References and Links
  • http//www.intel.com/products/processor/coreduo/
  • http//en.wikipedia.org/wiki/Intel_Core
  • http//www.hothardware.com/viewarticle.aspx?articl
    eid845cid1
  • http//www.bit-tech.net/hardware/2006/03/10/intel_
    core_microarchitecture/
  • http//www.bit-tech.net/hardware/2006/05/19/intel_
    core_duo_t2600_on_the_desktop
  • http//www.bit-tech.net/hardware/2006/07/14/intel_
    core_2_duo_processors/
  • http//www.hardcoreware.net/reviews/review-347-1.h
    tm
  • http//www.trustedreviews.com/cpu-memory/review/20
    06/08/28/Intel-Core-2-Duo-Merom-Notebooks/p1
  • http//www.trustedreviews.com/cpu-memory/review/20
    06/07/14/Intel-Core-2-Duo-Conroe-E6400-E6600-E6700
    -X6800/p1
  • http//techreport.com/reviews/2006q2/core-duo/inde
    x.x?pg1
  • http//arstechnica.com/articles/paedia/cpu/core.ar
    s/1
  • http//www.anandtech.com/mobile/showdoc.aspx?i266
    3p4
  • http//www.extremetech.com/article2/0,1697,1988794
    ,00.asp
  • http//www.coreduoinfo.com/blog/about-intel-core-d
    uo/
  • http//67.91.114.164/intel_c2d_info.htm
  • http//www.pcper.com/article.php?aid272typeexpe
    rt

29
AMD MultiCore Processors
30
Dual Core AMD Opteron
Source AMD
31
AMD Multicore (Dualcore) Opteron
  • Two AMD Opteron CPU cores on a single die
  • Each has 1MB L2 cache
  • 90nm, 205 million transistors
  • Approximately same die size as 130nm single-core
    AMD Opteron processor
  • 95 watt power envelope
  • fits into 90nm power infrastructure
  • Introduced with K8 Revision E core in April 2005

Source AMD
32
Opteron Core Pipeline
Source Chip Architect
33
AMD Opteron Processor Core Architecture
Source The 3D shop
34
Dual Core AMD Opteron
  • AMD64 technology
  • Runs 32-bit applications and is 64-bit capable
  • Compatible with the x86 software infrastructure
  • Enables a single architecture across 32- and
    64-bit environments
  • Direct Connect Architecture
  • NUMA system
  • Each processor shares its memory with other
    processors in the system
  • Integrated Memory Controller on-die
  • DDR2 DRAM memory controller offers memory BW up
    to 10.7 GB/s per processor
  • HyperTransport
  • Point-to-point interconnect can be used to build
    a mesh of multiple-processor Opteron systems
  • Scalable bandwidth interconnect between
    processors, I/O subsystems, and other chipsets
  • 24.0 GB/s peak bandwidth per processor

35
Dual Core AMD Opteron
  • Not a simple aggregation of K8 cores
  • Integrated the cores for efficiency
  • Dual-core Opteron acts very much like a SMP
    system
  • Compatible with existing single-threaded,
    multi-threaded (hyperthreaded) software
  • MOESI coherency protocol (O Owns)
  • Updates through system request interface
  • SSE3 support with 10 new instructions.
  • Quad-core upgradeability
  • Hardware assisted AMD Virtualization
  • Optimized Power Management

36
Dual Core AMD Opteron
Source Elec Design
37
AMD Opteron (SOI)
Source Chip Architect
38
AMD 64 bit Core
  • 1MB L2 Cache
  • Detailed discussion of the 64-bit core
    architecture at
  • http//chip-architect.com/news/2003_09_21_Detailed
    _Architecture_of_AMDs_64bit_Core.html

39
Multiprocessor Systems using AMD Opteron
PCI-E Bridge
I/O Hub
PCI-E Bridge
PCI-E Bridge
I/O Hub
USB
PCI
  • Legacy x86 Architecture
  • CPUs, Memory, I/O all share a bus
  • Major bottleneck to performance
  • Faster CPUs or more cores for performance
  • Symmetric Multiprocessing
  • AMD64 Direct Connect Architecture
  • Eliminates FSB bottleneck
  • HyperTransport Technology interconnect for high
    bandwidth and low latency
  • Each CPU has its own memory
  • Each CPU can access the main memory of another
    processor, transparent to the programmer ?
    Different from SMP

Source AMD
40
Multiprocessor Systems using AMD Opteron
Source XBitlabs
41
Cache coherency
Source Chip Architect
42
AMD Athlon 64 X2
Source AMD
43
References and Links
  • http//techreport.com/reviews/2005q2/opteron-x75/i
    ndex.x?pg1
  • http//www.tomshardware.com/2005/06/03/dual_core_s
    tress_test/index.html
  • http//www.a1-electronics.net/AMD_Section/CPUs/200
    5/AMD_Athlon64x2_Apr.shtml
  • http//en.wikipedia.org/wiki/Opteron
  • http//en.wikipedia.org/wiki/Athlon_64_X2
  • http//www.amd.com/us-en/Processors/ProductInforma
    tion/0,,30_118_8796_14309,00.html
  • http//chip-architect.com/news/2003_09_21_Detailed
    _Architecture_of_AMDs_64bit_Core.html
  • http//firingsquad.com/hardware/amd_dual-core_opte
    ron_875/page2.asp
  • http//www.xbitlabs.com/articles/cpu/display/opter
    on-ws_4.html
  • http//www.extremetech.com/article2/0,1697,1675784
    ,00.asp
  • http//www.elecdesign.com/Articles/Index.cfm?AD1
    ArticleID11991
  • http//www.the3dshop.com/userimages/amd_systems/op
    teron_dualcore.htm
  • http//www.nextcomputing.com/advantages/thruadv.sh
    tml
  • http//arstechnica.com/news.ars/post/20060817-7535
    .html
  • http//www.bit-tech.net/hardware/2005/05/09/amd_a6
    4x2_4800/1.html

44
SUN UltraSPARC Multicore
45
SUN UltraSPARC T1
  • Eight cores, each 4-way threaded
  • 1.2 GHz
  • Cache
  • 16K 4-way 32B L1-I
  • 8K 4-way 16B L1-D
  • 3MB internal L2 cache partitioned into four banks
    and four memory controllers.
  • Data moved between the L2 and the cores using an
    integrated crossbar switch to provide high
    throughput

Source Sun
46
SUN UltraSPARC T1
Source Sun
47
SUN UltraSPARC T1 Pipeline
  • T1's integer pipeline
  • Fetch, Thread Selection, Decode, Execute, Memory
    Access, Writeback

Source Sun
48
SUN UltraSPARC T2 Niagara 2
Source Sun
49
SUN UltraSPARC T2
  • Ultra SPARC T2 has 8 threads/core (8 Sparc Cores)
  • 8 stage integer pipeline ( as opposed to 6 for
    T1)
  • Twice the performance of T1 with a transactional
    workload (under the same power envelope)
  • Each thread, increased to 1.4 GHz from 1.2 GHz
  • One PCI Express port (x8 1.0)
  • Two 10 Gigabit Ethernet ports with packet
    classification and filtering
  • L2 cache size increased to 4 MB shared (8-banks,
    16-way associative)
  • 1 floating point unit per core
  • Eight encryption engines
  • Four dual-channel FBDIMM memory controllers
  • 711 signal I/O,1831 total

50
UltraSparc T2 Core Microarchitecture
Source Realworld Tech
51
UltraSparc T2 Memory System
Source Sun
52
UltraSparc T2 Core Block Diagram
  • IFU Instruction Fetch Unit
  • 16 KB I, 32B lines, 8-way SA
  • 64-entry fully-associative ITLB
  • EXU0/1 Integer Execution Units
  • 4 threads share each unit
  • Executes one integer instrn/cycle
  • LSU Load/Store Unit
  • 8KB D, 16B lines, 4-way SA 128-entry
    fully-associative
  • DTLB
  • FGU Floating/Graphics Unit
  • SPU Stream Processing Unit
  • Cryptographic acceleration
  • TLU Trap Logic Unit
  • Updates machine state, handles exceptions and
    interrupts
  • MMU Memory Management Unit
  • Hardware tablewalk (HWTW)
  • 8KB, 64KB, 4MB, 256MB pages

Source Sun
53
UltraSparc T2 Core Pipeline
  • 8 stages for integer operations
  • Fetch, Cache, Pick, Decode, Execute, Memory,
    Bypass, Writeback
  • gt 3-cycle load-use
  • Memory (translation, tag/data access)
  • Bypass (late select, formatting)
  • 12 stages for floating-point
  • Fetch, Cache, Pick, Decode, Execute, FX1, FX2,
    FX3, FX4, FX5, FB, FW
  • 6-cycle latency for dependent FP ops
  • Longer pipeline for divide/sqrt

54
References and Links
  • http//realworldtech.com/page.cfm?ArticleIDRWT090
    406012516p4
  • http//www.opensparc.net/cgi-bin/goto.php?w/pubs/
    preszo/06/HotChips06_09_ppt_master.pdf
  • http//www.freescale.com/files/netcomm/doc/fact_sh
    eet/MPC8572FS.pdf

55
The Embedded Multicores
56
Freescale MPC8572 PowerQUICC III Processor
Source Freescale
57
Freescale MPC8572 PowerQUICC III Processor
  • Dual Embedded e500 core 36-bit physical
    addressing
  • Double-precision floating-point
  • Integrated L1/L2 cache
  • L1 cache32 KB data and 32 KB
  • Shared L2 cache1 MB with ECC
  • L2 configurable as SRAM, cache and I/O
    transactions can be stashed into L2 cache regions
  • Integrated DDR memory controller with
  • full ECC support
  • Integrated security engine, Pattern Matching
    Engine, Packet Deflate Engine
  • Four on-chip triple-speed Ethernet controllers

58
References and Links
  • http//www.freescale.com/files/netcomm/doc/fact_sh
    eet/MPC8572FS.pdf

59
Summary
  • Multicore technology spans the product spectrum
  • The downward migration of leading edge technology
    continues
  • Architectural principles are key to
  • Developers extracting performance
  • Designers improving performance
  • Marketing understanding new markets for
    performance
  • Research spans the spectrum of software,
    security, reliability, parallelelism,
    virtualization and much more!
Write a Comment
User Comments (0)
About PowerShow.com