1 / 35
About This Presentation



... address for the branch to the A stage to redirect the fetch stream. ... can be streamed through the prefetch cache in a manner similar to stream buffers. ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 36
Provided by: balkirk


Transcript and Presenter's Notes


  • Prepared byBalkir Kayaalti

  • SPARC stands for a Scalable Processor
  • It is an open processor architecture.(i.e. Member
    companies to the SPARC community can freely
    produce the processor)
  • SUN ULTRA SPARCv9 is a robust RISC architecture
  • -64 bit integer address and data
  • -Superscalar implementations
  • -Extremely fast trap handling and context
  • The presentation will look in detail to the SUN
    Microsystems Ultra SPARC III v9 architecture.

Major Architectural units
  • The processors micro-architecture design
  • has six major functional units that perform
  • relatively independently
  • Instruction issue unit (IIU)
  • Floating point unit (FPU)
  • Integer execution unit (IEU)
  • Data cache unit (DCU)
  • External memory unit (EMU)
  • System interface unit (SIU)
  • The units communicate requests and results among
    themselves through well-defined interface
    protocols, as the next figure

Communication paths between architectural units
Instruction issue unit
  • This unit feeds the execution pipelines with the
  • It independently predicts the control flow
    through a program and fetches the predicted path
    from the memory system.
  • Fetched instructions are staged in a queue before
    forwarding to the two execution units integer
    and floating point
  • This unit includes
  • 32-Kbyte, four-way associative Instruction
  • The instruction address translation buffer
  • A 16 K-entry branch predictor

Ultra SPARC-III pipeline and physical data
  • Pipeline feature Parameter
  • Instruction issue 4 integer
  • 2 float point
  • 2 graphics
  • Level-one(L1) caches Data 64-Kbyte, 4-way
  • Instructions 32-Kbyte, 4-way
  • Prefetch 2-Kbyte,4-way
  • Write 2-Kbyte,4-way
  • Level-two(L2) cache Unified (data and
  • 4- and 8-Mbyte,1-way
  • On-chip tagsoff chip data

Pipeline blocks
  • Stage Function
  • A Generate instruction fetch addresses,
    generate pre-decoded instruction bits on
  • P Fetch first cycle of instructions from cache
    access first cycle of branch prediction
  • F Fetch second cycle of instructions from cache
    access second cycle of branch prediction
    translate virtual-to- physical address
  • B Calculate branch target addresses decode
    first cycle of instructions
  • I Decode second cycle of instructionsenqueue
    instructions into the queue
  • J Steer instructions to execution units
  • R Read integer register file operands check
    operand dependencies
  • E Execute integers for arithmetic, logical, and
    shift instructions read, and check dependency
    of, first cycle of data cache access
    floating-point register file

Pipeline blocks2
  • Stage Function
  • C Access second cycle of data cache, and forward
    load data for word and doubleword loads execute
    first cycle of floating-point instructions
  • M Load data alignment for half-word and byte
    loads execute second cycle of floating-point
  • W Write speculative integer register file
    execute third cycle of floating-point
  • X Extend integer pipeline for precise
    floating-point traps execute fourth cycle of
    floating-point instructions
  • T Report traps
  • D Write architectural register file

  • The instruction issue unit Stages A-J
  • The execution unit Stages R-D
  • data cache E, C, M, and W stages of the pipe in
    parallel with integer execution unit stages
  • Floating point unit Side pipeline parallel E
    through D stages of the integer pipeline

Instruction issue unit cont.
  • To increase the performance high level of
    instruction parallelism is desired.
  • Ultra SPARC is a static speculation machine.
  • - Dynamic speculation machines require very
    high fetch bandwidths to fill an instruction
    window and find instruction-level parallelism.
  • - In a static speculation machine the compiler
    can make the speculated path sequential,
    resulting in fewer requirements on the
    instruction fetch unit.

Instruction issue unit
Stage A Address lines enter to the instruction
cache. All fetch address generation and
selection occurs. Stage P,F Instruction cache
access. Branch prediction Instruction
address translation access
  • By the time the instructions are available from
    the cache in the B
  • stage, we also have the physical address from the
    translator and a
  • prediction for any branch that was fetched.
  • The processor uses all this information in the B
    stage to
  • determine whether to follow a sequential or
    taken-branch path

Branch prediction
  • The processor also determines whether the
    instruction cache access was a hit or miss. If
    the processor predicts a taken branch in the B
    stage, the processor sends back the target
    address for the branch to the A stage to redirect
    the fetch stream.
  • Waiting until the B stage to redirect the fetch
    stream lets us use a large, accurate branch
  • Branch predictor uses a G-share algorithm with
    16K 2-bit saturating up/down counters
  • Predictor is pipelined since it is big.

Instruction buffer (queue)
  • There are 2 instruction queues designed
    (instruction queue and miss queue)
  • The 20-entry instruction queue decouples the
    fetch unit from the execution units, allowing
    each to proceed at its own rate
  • If a branch is taken at the two cycles that
    should pass for filling the queue with right
    instructions , immediately instructions in the
    miss queue can be used.

Integer execute unit
  • Execution pipelines can support concurrent launch
    up to six instructions which can consist of
  • -two integer operations,A0/A1 pipelines
  • -two FP operations, FP pipelines
  • -one memory operation (load/store), MS pipeline
  • -one special purpose memory operation (
    prefetch cache load only)
  • -one control transfer instruction (CTI), BR
  • However only four Instructions per cycle (IPC)
    can be executed in a sustain manner.

Working and Architectural Register File (WARF)
  • Physically it is a one block but logically it can
    be seen as two separate register files. (working
    register file and architectural)
  • SPARC architectures use register files and
    windowing techniques.
  • Any time 8 global registers can be reached g0
  • Global register g0 is always 0.
  • At any time, an instruction can access the 8
    global and a 24-register window into the
    registers. A register window comprises the 8 in
    and 8 local registers of a particular register
    set, ttogether with the 8 in registers of an
    adjacent register set, which are addressable from
    the current window as out registers.

Register windows
  • WRF consist of 32 64-bit registers (each of
    with 3 write,7 read ports and 32642048 minus 64
    1984 bit write port to transport data from
    Architectural register file
  • ARF has 160 entries (Total 8 register windows)
  • 8x864 for local registers in the window
  • 8x864 registers for 16 IN/OUT shared
  • 28 register for 4 set of 8 global registers.
  • The WRF manages as single window updated as
    results computed

  • The processor accesses the WRF in the pipelines
    R stage and supplies integer operands to the
    execution units.
  • Most integer operations complete in one cycle ,
    so result can be written immediately at C stage.
  • If an exceptional event occurs, results written
    must be undone so original copies of integer
    registers are copied using broadside copy of all
    integer files from appropriate ARF window.
  • The place where to architecture register file is
    written at the end of the pipeline since all
    exceptions should be resolved.
  • ARF fills 16 WRF entries after a window change
  • On an exception 31 nonzero registers of WRF
    should be updated.

On chip memory system
Chache diagram used in the architecture
On chip memory system
  • Level-one(L1) caches Data 64-Kbyte, 4-way
  • Instructions 32-Kbyte, 4-way
  • Prefetch 2-Kbyte,4-way
  • Write 2-Kbyte,4-way
  • Level-two(L2) cache Unified (data and
  • 4- and 8-Mbyte,1-way
  • On-chip tags off chip data
  • average latency L1 hit time L1 miss rate
    L1miss time
  • L2 miss rate L2 miss time

Prefetch cache
  • Performance is highly increased by using a
    Prefetch Cache in parallel with the L1 data
  • By issuing up to eight in-flight prefetches to
    main memory, the prefetch cache enables program
    to utilize 100 of the available main memory
    bandwidth without incurring a slow-down due to
    the main memory latency.

Prefetch cache
  • The prefetch cache 2-Kbyte SRAM organized as 32
    entries of 64 bytes and using four-way
    associativity with an LRU replacement policy.
  • A multi-port SRAM design let us achieve a very
    high throughput.
  • Data can be streamed through the prefetch cache
    in a manner similar to stream buffers.
  • On every cycle, each of two independent read
    ports supply 8 bytes of data to the pipeline
    while a third write port fills the cache with 16

Prefetch cache
  • Some early processors like Ultra Sparc II uses
    prefetch instructions.
  • Autonomous stride prefetch engine that tracks the
    program counters of load instructions and detects
    when a load instruction is striding through
    memory .
  • When the prefetch engine detects a striding
    load, the prefetch engine issues a hardware
    prefetch independent of any software prefetch.
  • This allows the prefetch cache to be effective
    even on codes that do not include prefetch

Write cache
  • Write-caching is an excellent way to reduce the
    bandwidth due to store traffic.
  • A write cache is used in SPARC-III to reduce the
    store traffic bandwidth to the off-chip L2 data
  • Size is 2Kbyte -4 way associative
  • Advantage of using it is being the sole source
    of on-chip dirty data, the write cache easily
    handles both multiprocessor and on-chip cache
  • Error recovery also becomes easier with the
    write cache, since the write cache keeps all
    other on-chip caches clean and simply invalidates
    them when an error is detected.

Write chaching
  • A byte validate policy is used on the write
    cache. Rather than reading the data from the L2
    cache for the bytes within the line that are not
    being overwritten, we just keep an individual
    valid bit for each byte. Not performing the
    read-on-allocate saves considerable L2 cache
    bandwidth by postponing a read-modify-write until
    the write cache evicts a line. Frequently, by
    eviction time the entire line has been written so
    the write cache can eliminate the read.
  • Write cache is included in the L2 data cache and
    write-cache data can supersede read data from the
    L2 data cache . We handle this by a byte-merging
    multiplexer on the incoming L2 cache data bus
    that can choose either writecache data or L2
    cache data for each byte.

Floating point unit
  • This unit contains data paths and control logic
    to execute floating point and partitioned
    fixed-point data type instructions.
  • Three data paths concurrently execute floating
    point or graphics instructions, one each per
    cycle from the following classes
  • -Divide/multiply (single or double precision or
  • -Add/subtract/compare (single or double
    precision or partitioned)
  • -An independent division datapath which lets
    non-pipelined divide proceed concurrently with
    the full pipelined multiply and adder paths.
  • In order to meet the cycle time of the floating
    point operations latency cycles must be added.
  • With using advanced circuit techniques for
    floating point add multiply units a latency cycle
    will be enough.

External memory interface
  • External memory consist of a large L2 cache built
    off chip and a main memory built off chip using
    synchronous DRAMs.
  • Size of L2 caches 4 or 8 Mbyte
  • Latency 12 clock cycles to support 32 byte line
    to L1
  • Tags for the L2 is placed on-chip to early detect
    L2 miss
  • (L2 cache controller accesses on-chip tags
    parallel with the start of the off-chip SRAM
    access and provide a way select signal to a late
    select address pin on the off-chip SRAMs)

  • L2 caches are Wave-pipelined and operate at
  • Main memory DRAM controller is on chip, reducing
    memory latency and scales the memory bandwidth
    with the number of processor.
  • The memory controller supports up to 4 Gbytes of
    SDRAM memory organized as four independent banks.

Trap stage in the pipeline
  • In this architecture classical stall signal(
    which freezes the state of the pipeline is
    eliminated for performance purposes)
  • Instead a trap stage is put at the end of the
    pipeline to restore a state when an unexpected
    event occurs.
  • Its handled like a trapthe instructions that
    are in the pipeline will be refetched from Stage

  • One of the advanced RISC microprocessor is the
    Sun Microsystems UltraSPARC.It finds many
    application in desktops, network systems ,
    scientific calculation machines.
  • The internal architecture of the UltraSPARC-III.
    is represented .
  • Various parts of the processor is examined like
    instruction issue, execution, on chip and
    external memory.

  • 1) Ultra Sparc IIIDesigning Third -Generation
    64-Bit performance ,IEEE Micro ,June 1999
  • 2)Design Decisions Influencing Ultra SPARCs
    Instruction Fetch Architecture, 29th annual
    IEEE/ACM International Symposium on
    Microarchitecture ,p178-190,1996 Paris
  • 3)Ultra SPARC III v9 Manual,Sun Microsystems.

Write a Comment
User Comments (0)