Pipelines - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Pipelines

Description:

instruction decode mechanism only does actual work at the time instruction is ... Decode unit - decodes instruction into relevant sets of micro-operations to be ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 36
Provided by: martin159
Category:

less

Transcript and Presenter's Notes

Title: Pipelines


1
Pipelines
  • Fetch-execute cycle requires a number of
    micro-operations to complete, but this is a fixed
    sequence of events where one thing cannot be done
    until another is completed
  • this means that actually for a large part of the
    fetch-execute cycle the various component of the
    CPU are idle e.g.
  • the ALU only does some useful work at the end of
    the execution phase, but is idle when the
    instructions, operand location information and
    the operand values themselves are being fetched
  • instruction decode mechanism only does actual
    work at the time instruction is first fetched,
    but after that it is idle

2
  • Analogy - imagine passing a piece of paper down a
    line of people where each person has to look at
    the paper and do something with it, before
    passing it on to the next person, but only one
    piece of paper is allowed in the line at one time
    - so the first person in the line cannot start a
    new job until the last person has finished - a
    lot of waste time - this is like normal
    fetch-execute cycle
  • compare that with a line of people where after
    first person has processed the piece of paper
    they start on the next piece of paper and so on,
    so eventually everyone in the line is busy
    processing some job - this is like the pipeline
    approach

3
  • Example - simplistic 3 stage pipeline to
    illustrate idea
  • 3 instructions in code
  • load 100 load into Accumulator value at address
    100
  • add 200 add to accumulator value at address 200
  • store 104 store value in accumulator into
    location at address 104
  • without pipeline each will execute in turn with
    no overlap

4
  • Example instruction execution without a pipeline

5
  • Illustration of overlap in instruction execution
    with simplistic 3 stage pipeline it is more
    efficient

6
  • So there is the possibility of increasing the
    work done by overlapping the various
    micro-operations required in the fetch-execute
    sequence for one instruction with the
    fetch-execute micro-operations of another
    instruction
  • there are problems with the approach which leads
    to some waste, because in CPU in general we do
    not know for certain which instruction will need
    to be executed next after a given instruction
    until we have completed that instruction

7
  • so for pipeline idea to work it is necessary for
    the CPU to predict which instruction will be
    executed next (done by a predictor) in order to
    start processing it before the earlier
    instructions have completed. If it turns out that
    the execution of a particular instruction is not
    going to be needed, it simply means that that the
    work done processing that instruction has to be
    abandoned - however, the predictor gets it right
    much of the time - so still overall a great
    increase in efficiency
  • the prediction and the duplication of operations
    will require the duplication of some of the
    components on the CPU, but space on the CPU is
    comparatively cheap, whereas the performance
    benefits are great

8
Lecture 9 Modern CPU development ( VDUs)
  • In this lecture we will cover
  • Some aspects of development in modern CPUs
  • clock speeds
  • cache
  • bus connections
  • Schematic structure of Pentium Pentium III/IV
  • VDUs - totally unrelated - new topic - I just
    have to do it now

9
chip fabrication
  • chip fabrication technology has improved over the
    years so that number of transistors that can be
    packed onto one chip has increased
  • packing density number of transistors in unit
    area
  • more transistors on a chip means
  • the more functions a given chip can perform
  • the faster it can operate - shorter distances
    between transistors, and greater switching speed
    of transistors

10
The Processor
  • Manufactured by Intel originally and then by
    newer competitors e.g. AMD
  • in classic machines
  • 8088/8086 - 29,000 transistors
  • 80286 - 134,000 transistors
  • 80386 - 275,000 transistors
  • 80486 - 1.2 million transistors
  • in more recent machines
  • Pentium - 3.1 million transistors
  • Pentium Pro, K6 - 5.5 million transistors
  • Pentium II and III - 8.5 million transistors
  • Pentium IV - 42 million transistors

11
Clock Speeds
  • measured in MHz where MHz MegaHertz million
    pulses per second
  • Original PC XT ran at 4.77MHz
  • 80286 systems ran at 10-25MHz
  • 80386 ran at 20-40MHz
  • 80486 25-50MHz and 50-100MHz using clock
    multiplying
  • Pentium 60-200 MHz, K6 150-350MHz
  • Pentium II - 200-450MHz
  • Pentium III 400-850MHz
  • Pentium IV - up to 3.8 GHZ ( 1 Giga Hertz
    1000MHz)

12
Clock multiplying
  • clock multiplying - clock pulses internal to CPU
    are driven by a frequency multiplier - runs at
    some multiple of speed of bus clock (clock on
    motherboard)

Bus Clock
133.3 MHz
Pentium IV - 2GHz
x15
2 GHz 2000 MHz
13
Cache development
  • 80486 has 8K of level 1 cache
  • Pentium originally had 8K of cache for data and
    8K for instructions
  • MMX Pentium has 16K for each data and
    instructions
  • Pentium Pro Pentium II had 512K level 2 cache
    integrated into processor module
  • Level 2 caches were from 128K to 512K - but
    Pentium III can have up to 2Mb L2 cache

14
BUSES
  • Communication pathway for data, address and
    control signals
  • External (I/O) buses - communicate to I/O ports
  • System (local) buses - communication local to
    processor between processor and memory and fast
    devices e.g. graphical display
  • perennial problem with buses is that speed of
    operation i.e. data transfer rate is not fast
    enough to supply processor with data at the rate
    processor can request it

15
Classic bus structure
  • Traditional arrangement of buses - like the
    simple CPU we looked at
  • implies buses can work at speed close to speed
    processor requests data

16
Bus standards
  • History of changing bus standards has been to try
    and increase speed of buses
  • I/O Buses
  • original IBM PC bus standard 4.7MHz, 8 bit
  • Old Bus standardISA - Industry standard
    architecture, 8MHz, 16 bit
  • PCI Peripheral Component Interconnect
  • old standard - 32 bit data transfer, 33MHz - for
    maximum 132 MBytes/sec in burst mode
  • current PCI standard 64 bit transfer, 66MHz -
    for 528 Mbytes/sec in burst mode

17
Schematic Diagram of some of Buses in Pentium III
PC
18
  • Backside bus - connection between CPU and level 2
    cache - runs at close to speed of processor
    (typically half the speed in Pentium III, now
    integrated onto CPU chip at chip speed on Pentium
    IV)
  • Frontside bus - connection between CPU and
    motherboard chips (Pentium III - 133MHz, but
    Pentium IV up to 800MHz)

19
Address busses
  • Chip Address Addressable Data bus
    Bus Memory
  • 8088 20 1Mb 8
  • 8086 20 1Mb 16
  • 80286 24 16Mb 16
  • 80386dx 32 4Gb 32
  • 80486dx 32 4Gb 32
  • Pentium 32 4Gb
    64/32

20
Classic Pentium Architecture
21
  • Bus Interface Unit - manages data transfers
    onto/from the bus that connects processor to rest
    of computer
  • Data Cache and Instruction Cache - level 1 cache
    - small copies of data and code from main memory
    (8K each)
  • Pre-fetch unit - pre-fetches instructions into
    Decode unit for decoding prior to their being
    needed - part of pipeline - can pre-fetch an
    instruction while an instruction is being
    executed
  • Decode unit - decodes instruction into relevant
    sets of micro-operations to be carried out

22
  • if instruction is a branch then important that
    Branch Predictor make a prediction as to which
    instruction is most likely to be executed next
    after branch
  • Registers - and 2 ALUs - execute instructions - 2
    ALUs so that 2 instructions can be executed at
    the same time
  • Note Data Cache is connected to execution unit
    and instruction cache is connected to Pre-fetch
    and Decode unit
  • Floating Point Unit - used for executing
    calculations involving floating point numbers
    (ALUs only do integer arithmetic) - note internal
    bus from data cache and decode unit

23
(No Transcript)
24
  • Changes to Pentium architecture from original
    Pentium to Pentium III
  • 1. 16K Data and Instruction caches rather than 8K
  • 2. MMX instructions added and linked to Floating
    Point Unit - MMX instructions are SIMD (Single
    Instruction, Multiple Data) instructions where
    the same instruction is applied to more than one
    item of data at the same time e.g. Adding same
    number to 8 bytes is done by putting all 8 bytes
    into a 64 bit register and adding the number to
    each byte in register as if it was a separate
    byte
  • such repetitive actions on large amounts of data
    occur often in multimedia applications - but
    actually anything that uses large arrays of data
    can benefit

25
  • 3. micro-ops are placed in Instruction Pool in
    order that Branch Target Buffer predicts they are
    going to be needed, execution unit, ALUs, etc
    execute each micro-op as it becomes ready to be
    executed e.g. when any operands that it needs are
    ready
  • instruction pool is really a pipeline - but with
    set of micro-ops available out of normal order
    means that execution unit can usually find
    something useful to do - minimises time spent
    simply waiting (a waste of time)

26
  • Most significant changes in Pentium IV over
    Pentium III - PIV has
  • 1.integrated the level 2 cache onto the actual
    processor chip - no need for processor module -
    so has what is called a flat form connection to
    motherboard
  • 2. doubled size of pipeline over Pentium III - so
    it can process more instructions at the same time
  • 3. increased the number of SIMD instructions
  • 4. Frontside bus speed increased to up to 800MHz
    from 133MHz

27
Pixels and colours
  • Digital images consist of a large number of
    discrete (i.e. separate) very small areas each of
    which has a specific colour - called pixel
    (Picture Element)
  • Various different colours can be represented as a
    mixture of 3 primary light colours - Red, Green
    and Blue (RGB) - what colour you get depends on
    the amount of Red, Green and Blue when mixed
    together

28
  • so VDU screen a pixel has 3 small areas of a
    phosphor coating located next to each other one
    of which glows Red, one Green and one Blue when
    hit by electrons - they only glow for a very
    short time - the mixing of the glowing colours
    give the impression of a single colour

29
one pixel
30
(No Transcript)
31
  • Colour to be displayed is represented by 3
    numbers, each number giving the intensity with
    which an electron gun has to fire at each of the
    RGB phosphors - the intensity (number of
    electrons hitting the phosphor) makes it glow
    more intensely
  • True colour has 24 bits - 1 byte (8 bits) for
    Red, Green and Blue - approx 16 million colours
  • VGA has 8 bits that act as an index into a table
    the entries which give intensities for each of
    RGB - 256 colours from a pre-defined palette

32
Terminology
  • resolution total number of pixels on screen
    horizontal pixels x vertical pixels - various
    standards - VGA (lowest standard now) 640x480
    SVGA 800x600, 1024x760
  • dot pitch distance between pixels - typical
    0.28mm
  • dot frequency number of pixels output to screen
    per second size of display refresh rate
  • refresh rate frequency of repainting screen
    e.g. 60Hz
  • Aspect ratio 43 ratio of width to height
  • size of screen diagonal of display

33
Raster scanning interlacing
  • Electron beam scans left to right, then beam is
    turned off while it is re-focused to left hand
    side but one line lower - called Raster scanning
  • VDU has a maximum rate at which it can scan a
    line - the greater the number of vertical lines
    the longer it takes to scan one screen
  • the phosphor glow diminishes quite quickly so if
    set to very high resolutions glow might fade too
    much between scans

34
  • so interlacing paints every other line on whole
    screen - in order to support higher resolutions
    than otherwise possible
  • however, it might result in flickering as
    phosphor glow is not kept sufficiently constant
    for human eye

35
Memory requirements
  • Consider VGA

640 x 480 307200 pixels, each pixel needing 8
bits 8 bits 256 colours
Write a Comment
User Comments (0)
About PowerShow.com