SUN ULTRASPARC-III ARCHITECTURE

About This Presentation

Title:

SUN ULTRASPARC-III ARCHITECTURE

Description:

... address for the branch to the A stage to redirect the fetch stream. ... can be streamed through the prefetch cache in a manner similar to stream buffers. ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 36

Provided by: balkirk

Category:

more less

Transcript and Presenter's Notes

Title: SUN ULTRASPARC-III ARCHITECTURE

1
SUN ULTRASPARC-III ARCHITECTURE

CMPE 511 PRESENTATION
Prepared byBalkir Kayaalti

2
Introduction

SPARC stands for a Scalable Processor
ARChitecture.
It is an open processor architecture.(i.e. Member
companies to the SPARC community can freely
produce the processor)
SUN ULTRA SPARCv9 is a robust RISC architecture
with
-64 bit integer address and data
-Superscalar implementations
-Extremely fast trap handling and context
switching.
The presentation will look in detail to the SUN
Microsystems Ultra SPARC III v9 architecture.

3
Major Architectural units

The processors micro-architecture design
has six major functional units that perform
relatively independently
Instruction issue unit (IIU)
Floating point unit (FPU)
Integer execution unit (IEU)
Data cache unit (DCU)
External memory unit (EMU)
System interface unit (SIU)
The units communicate requests and results among
themselves through well-defined interface
protocols, as the next figure

4
Communication paths between architectural units
5
Instruction issue unit

This unit feeds the execution pipelines with the
instructions.
It independently predicts the control flow
through a program and fetches the predicted path
from the memory system.
Fetched instructions are staged in a queue before
forwarding to the two execution units integer
and floating point
This unit includes
32-Kbyte, four-way associative Instruction
cache
The instruction address translation buffer
A 16 K-entry branch predictor

6
Ultra SPARC-III pipeline and physical data

Pipeline feature Parameter
Instruction issue 4 integer
2 float point
2 graphics
Level-one(L1) caches Data 64-Kbyte, 4-way
Instructions 32-Kbyte, 4-way
Prefetch 2-Kbyte,4-way
Write 2-Kbyte,4-way
Level-two(L2) cache Unified (data and
instructions)
4- and 8-Mbyte,1-way
On-chip tagsoff chip data

7
Pipeline
8
Pipeline blocks

Stage Function
A Generate instruction fetch addresses,
generate pre-decoded instruction bits on
P Fetch first cycle of instructions from cache
access first cycle of branch prediction
F Fetch second cycle of instructions from cache
access second cycle of branch prediction
translate virtual-to- physical address
B Calculate branch target addresses decode
first cycle of instructions
I Decode second cycle of instructionsenqueue
instructions into the queue
J Steer instructions to execution units
R Read integer register file operands check
operand dependencies
E Execute integers for arithmetic, logical, and
shift instructions read, and check dependency
of, first cycle of data cache access
floating-point register file

9
Pipeline blocks2

Stage Function
C Access second cycle of data cache, and forward
load data for word and doubleword loads execute
first cycle of floating-point instructions
M Load data alignment for half-word and byte
loads execute second cycle of floating-point
instructions
W Write speculative integer register file
execute third cycle of floating-point
instructions
X Extend integer pipeline for precise
floating-point traps execute fourth cycle of
floating-point instructions
T Report traps
D Write architectural register file

10
Pipeline

The instruction issue unit Stages A-J
The execution unit Stages R-D
data cache E, C, M, and W stages of the pipe in
parallel with integer execution unit stages
Floating point unit Side pipeline parallel E
through D stages of the integer pipeline

11
Pipeline
12
Instruction issue unit cont.

To increase the performance high level of
instruction parallelism is desired.
Ultra SPARC is a static speculation machine.
- Dynamic speculation machines require very
high fetch bandwidths to fill an instruction
window and find instruction-level parallelism.
- In a static speculation machine the compiler
can make the speculated path sequential,
resulting in fewer requirements on the
instruction fetch unit.

13
Instruction issue unit
Stage A Address lines enter to the instruction
cache. All fetch address generation and
selection occurs. Stage P,F Instruction cache
access. Branch prediction Instruction
address translation access
14

By the time the instructions are available from
the cache in the B
stage, we also have the physical address from the
translator and a
prediction for any branch that was fetched.
The processor uses all this information in the B
stage to
determine whether to follow a sequential or
taken-branch path

15
Branch prediction

The processor also determines whether the
instruction cache access was a hit or miss. If
the processor predicts a taken branch in the B
stage, the processor sends back the target
address for the branch to the A stage to redirect
the fetch stream.
Waiting until the B stage to redirect the fetch
stream lets us use a large, accurate branch
predictor.
Branch predictor uses a G-share algorithm with
16K 2-bit saturating up/down counters
Predictor is pipelined since it is big.

16
Instruction buffer (queue)

There are 2 instruction queues designed
(instruction queue and miss queue)
The 20-entry instruction queue decouples the
fetch unit from the execution units, allowing
each to proceed at its own rate
If a branch is taken at the two cycles that
should pass for filling the queue with right
instructions , immediately instructions in the
miss queue can be used.

17
Integer execute unit

Execution pipelines can support concurrent launch
up to six instructions which can consist of
-two integer operations,A0/A1 pipelines
-two FP operations, FP pipelines
-one memory operation (load/store), MS pipeline
-one special purpose memory operation (
prefetch cache load only)
-one control transfer instruction (CTI), BR
pipeline
However only four Instructions per cycle (IPC)
can be executed in a sustain manner.

18
Working and Architectural Register File (WARF)

Physically it is a one block but logically it can
be seen as two separate register files. (working
register file and architectural)
SPARC architectures use register files and
windowing techniques.
Any time 8 global registers can be reached g0
g7
Global register g0 is always 0.
At any time, an instruction can access the 8
global and a 24-register window into the
registers. A register window comprises the 8 in
and 8 local registers of a particular register
set, ttogether with the 8 in registers of an
adjacent register set, which are addressable from
the current window as out registers.

19
Register windows
20
WARF

WRF consist of 32 64-bit registers (each of
with 3 write,7 read ports and 32642048 minus 64
1984 bit write port to transport data from
Architectural register file
ARF has 160 entries (Total 8 register windows)
8x864 for local registers in the window
8x864 registers for 16 IN/OUT shared
registers.
28 register for 4 set of 8 global registers.
The WRF manages as single window updated as
results computed

The processor accesses the WRF in the pipelines
R stage and supplies integer operands to the
execution units.
Most integer operations complete in one cycle ,
so result can be written immediately at C stage.
If an exceptional event occurs, results written
must be undone so original copies of integer
registers are copied using broadside copy of all
integer files from appropriate ARF window.
The place where to architecture register file is
written at the end of the pipeline since all
exceptions should be resolved.
ARF fills 16 WRF entries after a window change
On an exception 31 nonzero registers of WRF
should be updated.

22
On chip memory system
Chache diagram used in the architecture
23
On chip memory system

Level-one(L1) caches Data 64-Kbyte, 4-way
Instructions 32-Kbyte, 4-way
Prefetch 2-Kbyte,4-way
Write 2-Kbyte,4-way
Level-two(L2) cache Unified (data and
instructions)
4- and 8-Mbyte,1-way
On-chip tags off chip data
average latency L1 hit time L1 miss rate
L1miss time
L2 miss rate L2 miss time

24
Prefetch cache

Performance is highly increased by using a
Prefetch Cache in parallel with the L1 data
cache.
By issuing up to eight in-flight prefetches to
main memory, the prefetch cache enables program
to utilize 100 of the available main memory
bandwidth without incurring a slow-down due to
the main memory latency.

25
Prefetch cache

The prefetch cache 2-Kbyte SRAM organized as 32
entries of 64 bytes and using four-way
associativity with an LRU replacement policy.
A multi-port SRAM design let us achieve a very
high throughput.
Data can be streamed through the prefetch cache
in a manner similar to stream buffers.
On every cycle, each of two independent read
ports supply 8 bytes of data to the pipeline
while a third write port fills the cache with 16
bytes.

26
Prefetch cache

Some early processors like Ultra Sparc II uses
prefetch instructions.
Autonomous stride prefetch engine that tracks the
program counters of load instructions and detects
when a load instruction is striding through
memory .
When the prefetch engine detects a striding
load, the prefetch engine issues a hardware
prefetch independent of any software prefetch.
This allows the prefetch cache to be effective
even on codes that do not include prefetch
instructions.

27
Write cache

Write-caching is an excellent way to reduce the
bandwidth due to store traffic.
A write cache is used in SPARC-III to reduce the
store traffic bandwidth to the off-chip L2 data
cache
Size is 2Kbyte -4 way associative
Advantage of using it is being the sole source
of on-chip dirty data, the write cache easily
handles both multiprocessor and on-chip cache
consistency.
Error recovery also becomes easier with the
write cache, since the write cache keeps all
other on-chip caches clean and simply invalidates
them when an error is detected.

28
Write chaching

A byte validate policy is used on the write
cache. Rather than reading the data from the L2
cache for the bytes within the line that are not
being overwritten, we just keep an individual
valid bit for each byte. Not performing the
read-on-allocate saves considerable L2 cache
bandwidth by postponing a read-modify-write until
the write cache evicts a line. Frequently, by
eviction time the entire line has been written so
the write cache can eliminate the read.
Write cache is included in the L2 data cache and
write-cache data can supersede read data from the
L2 data cache . We handle this by a byte-merging
multiplexer on the incoming L2 cache data bus
that can choose either writecache data or L2
cache data for each byte.

29
Floating point unit

This unit contains data paths and control logic
to execute floating point and partitioned
fixed-point data type instructions.
Three data paths concurrently execute floating
point or graphics instructions, one each per
cycle from the following classes
-Divide/multiply (single or double precision or
partitioned)
-Add/subtract/compare (single or double
precision or partitioned)
-An independent division datapath which lets
non-pipelined divide proceed concurrently with
the full pipelined multiply and adder paths.
In order to meet the cycle time of the floating
point operations latency cycles must be added.
With using advanced circuit techniques for
floating point add multiply units a latency cycle
will be enough.

30
External memory interface

External memory consist of a large L2 cache built
off chip and a main memory built off chip using
synchronous DRAMs.
Size of L2 caches 4 or 8 Mbyte
Latency 12 clock cycles to support 32 byte line
to L1
Tags for the L2 is placed on-chip to early detect
L2 miss
(L2 cache controller accesses on-chip tags
parallel with the start of the off-chip SRAM
access and provide a way select signal to a late
select address pin on the off-chip SRAMs)

L2 caches are Wave-pipelined and operate at
600MHz.,
Main memory DRAM controller is on chip, reducing
memory latency and scales the memory bandwidth
with the number of processor.
The memory controller supports up to 4 Gbytes of
SDRAM memory organized as four independent banks.

32
Trap stage in the pipeline

In this architecture classical stall signal(
which freezes the state of the pipeline is
eliminated for performance purposes)
Instead a trap stage is put at the end of the
pipeline to restore a state when an unexpected
event occurs.
Its handled like a trapthe instructions that
are in the pipeline will be refetched from Stage
A.

33
Conclusion

One of the advanced RISC microprocessor is the
Sun Microsystems UltraSPARC.It finds many
application in desktops, network systems ,
scientific calculation machines.
The internal architecture of the UltraSPARC-III.
is represented .
Various parts of the processor is examined like
instruction issue, execution, on chip and
external memory.

34
References

1) Ultra Sparc IIIDesigning Third -Generation
64-Bit performance ,IEEE Micro ,June 1999
2)Design Decisions Influencing Ultra SPARCs
Instruction Fetch Architecture, 29th annual
IEEE/ACM International Symposium on
Microarchitecture ,p178-190,1996 Paris
3)Ultra SPARC III v9 Manual,Sun Microsystems.

THANK YOU

Write a Comment

User Comments (0)

About PowerShow.com

SUN ULTRASPARC-III ARCHITECTURE - PowerPoint PPT Presentation

SUN ULTRASPARC-III ARCHITECTURE

... address for the branch to the A stage to redirect the fetch stream. ... can be streamed through the prefetch cache in a manner similar to stream buffers. ... – PowerPoint PPT presentation