Pipelines - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Pipelines

Description:

instruction decode mechanism only does actual work at the time instruction is ... Decode unit - decodes instruction into relevant sets of micro-operations to be ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 36

Provided by: martin159

Category:

more less

Transcript and Presenter's Notes

Title: Pipelines

1
Pipelines

Fetch-execute cycle requires a number of
micro-operations to complete, but this is a fixed
sequence of events where one thing cannot be done
until another is completed
this means that actually for a large part of the
fetch-execute cycle the various component of the
CPU are idle e.g.
the ALU only does some useful work at the end of
the execution phase, but is idle when the
instructions, operand location information and
the operand values themselves are being fetched
instruction decode mechanism only does actual
work at the time instruction is first fetched,
but after that it is idle

Analogy - imagine passing a piece of paper down a
line of people where each person has to look at
the paper and do something with it, before
passing it on to the next person, but only one
piece of paper is allowed in the line at one time
- so the first person in the line cannot start a
new job until the last person has finished - a
lot of waste time - this is like normal
fetch-execute cycle
compare that with a line of people where after
first person has processed the piece of paper
they start on the next piece of paper and so on,
so eventually everyone in the line is busy
processing some job - this is like the pipeline
approach

Example - simplistic 3 stage pipeline to
illustrate idea
3 instructions in code
load 100 load into Accumulator value at address
100
add 200 add to accumulator value at address 200
store 104 store value in accumulator into
location at address 104
without pipeline each will execute in turn with
no overlap

Example instruction execution without a pipeline

Illustration of overlap in instruction execution
with simplistic 3 stage pipeline it is more
efficient

So there is the possibility of increasing the
work done by overlapping the various
micro-operations required in the fetch-execute
sequence for one instruction with the
fetch-execute micro-operations of another
instruction
there are problems with the approach which leads
to some waste, because in CPU in general we do
not know for certain which instruction will need
to be executed next after a given instruction
until we have completed that instruction

so for pipeline idea to work it is necessary for
the CPU to predict which instruction will be
executed next (done by a predictor) in order to
start processing it before the earlier
instructions have completed. If it turns out that
the execution of a particular instruction is not
going to be needed, it simply means that that the
work done processing that instruction has to be
abandoned - however, the predictor gets it right
much of the time - so still overall a great
increase in efficiency
the prediction and the duplication of operations
will require the duplication of some of the
components on the CPU, but space on the CPU is
comparatively cheap, whereas the performance
benefits are great

8
Lecture 9 Modern CPU development ( VDUs)

In this lecture we will cover
Some aspects of development in modern CPUs
clock speeds
cache
bus connections
Schematic structure of Pentium Pentium III/IV
VDUs - totally unrelated - new topic - I just
have to do it now

9
chip fabrication

chip fabrication technology has improved over the
years so that number of transistors that can be
packed onto one chip has increased
packing density number of transistors in unit
area
more transistors on a chip means
the more functions a given chip can perform
the faster it can operate - shorter distances
between transistors, and greater switching speed
of transistors

10
The Processor

Manufactured by Intel originally and then by
newer competitors e.g. AMD
in classic machines
8088/8086 - 29,000 transistors
80286 - 134,000 transistors
80386 - 275,000 transistors
80486 - 1.2 million transistors
in more recent machines
Pentium - 3.1 million transistors
Pentium Pro, K6 - 5.5 million transistors
Pentium II and III - 8.5 million transistors
Pentium IV - 42 million transistors

11
Clock Speeds

measured in MHz where MHz MegaHertz million
pulses per second
Original PC XT ran at 4.77MHz
80286 systems ran at 10-25MHz
80386 ran at 20-40MHz
80486 25-50MHz and 50-100MHz using clock
multiplying
Pentium 60-200 MHz, K6 150-350MHz
Pentium II - 200-450MHz
Pentium III 400-850MHz
Pentium IV - up to 3.8 GHZ ( 1 Giga Hertz
1000MHz)

12
Clock multiplying

clock multiplying - clock pulses internal to CPU
are driven by a frequency multiplier - runs at
some multiple of speed of bus clock (clock on
motherboard)

Bus Clock
133.3 MHz
Pentium IV - 2GHz
x15
2 GHz 2000 MHz
13
Cache development

80486 has 8K of level 1 cache
Pentium originally had 8K of cache for data and
8K for instructions
MMX Pentium has 16K for each data and
instructions
Pentium Pro Pentium II had 512K level 2 cache
integrated into processor module
Level 2 caches were from 128K to 512K - but
Pentium III can have up to 2Mb L2 cache

14
BUSES

Communication pathway for data, address and
control signals
External (I/O) buses - communicate to I/O ports
System (local) buses - communication local to
processor between processor and memory and fast
devices e.g. graphical display
perennial problem with buses is that speed of
operation i.e. data transfer rate is not fast
enough to supply processor with data at the rate
processor can request it

15
Classic bus structure

Traditional arrangement of buses - like the
simple CPU we looked at
implies buses can work at speed close to speed
processor requests data

16
Bus standards

History of changing bus standards has been to try
and increase speed of buses
I/O Buses
original IBM PC bus standard 4.7MHz, 8 bit
Old Bus standardISA - Industry standard
architecture, 8MHz, 16 bit
PCI Peripheral Component Interconnect
old standard - 32 bit data transfer, 33MHz - for
maximum 132 MBytes/sec in burst mode
current PCI standard 64 bit transfer, 66MHz -
for 528 Mbytes/sec in burst mode

17
Schematic Diagram of some of Buses in Pentium III
PC
18

Backside bus - connection between CPU and level 2
cache - runs at close to speed of processor
(typically half the speed in Pentium III, now
integrated onto CPU chip at chip speed on Pentium
IV)
Frontside bus - connection between CPU and
motherboard chips (Pentium III - 133MHz, but
Pentium IV up to 800MHz)

19
Address busses

Chip Address Addressable Data bus
Bus Memory
8088 20 1Mb 8
8086 20 1Mb 16
80286 24 16Mb 16
80386dx 32 4Gb 32
80486dx 32 4Gb 32
Pentium 32 4Gb
64/32

20
Classic Pentium Architecture
21

Bus Interface Unit - manages data transfers
onto/from the bus that connects processor to rest
of computer
Data Cache and Instruction Cache - level 1 cache
- small copies of data and code from main memory
(8K each)
Pre-fetch unit - pre-fetches instructions into
Decode unit for decoding prior to their being
needed - part of pipeline - can pre-fetch an
instruction while an instruction is being
executed
Decode unit - decodes instruction into relevant
sets of micro-operations to be carried out

if instruction is a branch then important that
Branch Predictor make a prediction as to which
instruction is most likely to be executed next
after branch
Registers - and 2 ALUs - execute instructions - 2
ALUs so that 2 instructions can be executed at
the same time
Note Data Cache is connected to execution unit
and instruction cache is connected to Pre-fetch
and Decode unit
Floating Point Unit - used for executing
calculations involving floating point numbers
(ALUs only do integer arithmetic) - note internal
bus from data cache and decode unit

23
(No Transcript)
24

Changes to Pentium architecture from original
Pentium to Pentium III
1. 16K Data and Instruction caches rather than 8K
2. MMX instructions added and linked to Floating
Point Unit - MMX instructions are SIMD (Single
Instruction, Multiple Data) instructions where
the same instruction is applied to more than one
item of data at the same time e.g. Adding same
number to 8 bytes is done by putting all 8 bytes
into a 64 bit register and adding the number to
each byte in register as if it was a separate
byte
such repetitive actions on large amounts of data
occur often in multimedia applications - but
actually anything that uses large arrays of data
can benefit

3. micro-ops are placed in Instruction Pool in
order that Branch Target Buffer predicts they are
going to be needed, execution unit, ALUs, etc
execute each micro-op as it becomes ready to be
executed e.g. when any operands that it needs are
ready
instruction pool is really a pipeline - but with
set of micro-ops available out of normal order
means that execution unit can usually find
something useful to do - minimises time spent
simply waiting (a waste of time)

Most significant changes in Pentium IV over
Pentium III - PIV has
1.integrated the level 2 cache onto the actual
processor chip - no need for processor module -
so has what is called a flat form connection to
motherboard
2. doubled size of pipeline over Pentium III - so
it can process more instructions at the same time
3. increased the number of SIMD instructions
4. Frontside bus speed increased to up to 800MHz
from 133MHz

27
Pixels and colours

Digital images consist of a large number of
discrete (i.e. separate) very small areas each of
which has a specific colour - called pixel
(Picture Element)
Various different colours can be represented as a
mixture of 3 primary light colours - Red, Green
and Blue (RGB) - what colour you get depends on
the amount of Red, Green and Blue when mixed
together

so VDU screen a pixel has 3 small areas of a
phosphor coating located next to each other one
of which glows Red, one Green and one Blue when
hit by electrons - they only glow for a very
short time - the mixing of the glowing colours
give the impression of a single colour

29
one pixel
30
(No Transcript)
31

Colour to be displayed is represented by 3
numbers, each number giving the intensity with
which an electron gun has to fire at each of the
RGB phosphors - the intensity (number of
electrons hitting the phosphor) makes it glow
more intensely
True colour has 24 bits - 1 byte (8 bits) for
Red, Green and Blue - approx 16 million colours
VGA has 8 bits that act as an index into a table
the entries which give intensities for each of
RGB - 256 colours from a pre-defined palette

32
Terminology

resolution total number of pixels on screen
horizontal pixels x vertical pixels - various
standards - VGA (lowest standard now) 640x480
SVGA 800x600, 1024x760
dot pitch distance between pixels - typical
0.28mm
dot frequency number of pixels output to screen
per second size of display refresh rate
refresh rate frequency of repainting screen
e.g. 60Hz
Aspect ratio 43 ratio of width to height
size of screen diagonal of display

33
Raster scanning interlacing

Electron beam scans left to right, then beam is
turned off while it is re-focused to left hand
side but one line lower - called Raster scanning
VDU has a maximum rate at which it can scan a
line - the greater the number of vertical lines
the longer it takes to scan one screen
the phosphor glow diminishes quite quickly so if
set to very high resolutions glow might fade too
much between scans

so interlacing paints every other line on whole
screen - in order to support higher resolutions
than otherwise possible
however, it might result in flickering as
phosphor glow is not kept sufficiently constant
for human eye

35
Memory requirements