Scalable Numerical Algorithms and Methods on the ASCI Machines PowerPoint PPT Presentation

presentation player overlay
1 / 52
About This Presentation
Transcript and Presenter's Notes

Title: Scalable Numerical Algorithms and Methods on the ASCI Machines


1
CS61V
Pipelining and Multi-threading
2
Searching for Parallelism
  • Goal of the computer architect
  • Identify potential opportunities for parallelism
    at every possible level and exploit them e.g.
  • Bit level
  • Instruction level
  • Processor level

3
  • More parallelism within each CPU
  • Pipelined CPUs ( increase instruction throughput
    )
  • Superscalar CPUs ( multiple functional units )
  • Superpipelined CPUs (multiple instruction
    issues per clock)
  • Multi-threaded CPUs that run multiple instruction
    streams (so when one stream stalls on memory or
    I/O, another stream can make progress)
  • More CPUs (100 to 10,000)
  • Hardware support for shared memory, and for
    locks on memory.
  • Hardware support for memory consistency (because
    a remote write can change local memory at any
    time)
  • Hardware support for data movement between
    memories

4
Pipelining
5
Instruction-Level Pipelining
  • Some CPUs divide the fetch-decode-execute cycle
    into smaller steps.
  • These smaller steps can often be executed in
    parallel to increase throughput.
  • Such parallel execution is called
    instruction-level pipelining.
  • This term is sometimes abbreviated to ILP in the
    literature.

6
Pipelining Example
  • Let's say that we have decided to go into the
    increasingly lucrative SUV manufacturing
    business. After some intense research, we
    determine that there are five stages in the SUV
    building process, as follows
  • Stage 1 build the chassis.
  • Stage 2 drop the engine in the chassis.
  • Stage 3 put doors, a hood, and coverings on the
    chassis.
  • Stage 4 attach the wheels.
  • Stage 5 paint the SUV.

7
Pipelining
  • There are five skilled crews ready to work on
    each stage in the manufacturing process
  • Our big strategy is to have the factory run as
    follows
  • Line up all five crews in a row, and we have the
    first crew start an SUV at Stage 1.
  • After Stage 1 is complete, the SUV moves down the
    line to the next stage and the next crew drops
    the engine in.
  • While the Stage 2 Crew is installing the engine
    in the chassis that the Stage 1 Crew just built,
    the Stage 1 Crew (along with all of the rest of
    the crews) is free to go play football, watch the
    big-screen plasma TV in the break room, surf the
    'net, etc.
  • Once the Stage 2 Crew is done, the SUV moves down
    to Stage 3 and the Stage 3 Crew takes over while
    the Stage 2 Crew hits the break room to party
    with everyone else.

8
Pipelining
  • The SUV moves on down the line through all five
    stages this way, with only one crew working on
    one stage at any given time while the rest of the
    crews are idle.
  • Once the completed SUV finishes Stage 5, the crew
    at Stage 1 then starts on another SUV.
  • At this rate, it takes exactly five hours to
    finish a single SUV, and our factory puts out one
    SUV every five hours (assuming 1 hr per stage)

SUV at Stage 2
9
Pipelining
  • How can we improve the production?
  • Add a second production line using 5 additional
    skilled crews.
  • This increases throughput to two SUVs every 5
    hours.
  • Problems ?
  • Requires a lot more money to pay for extra crews
  • Double the inefficiency with twice the number of
    crews in the break room at one time.

10
Pipelining
  • Finally a smart consultant hits upon the a clever
    idea to improve productivity
  • Why let workers spend four-fifths of their day
    in the break room, when they could be doing
    useful work during that time.
  • The revised workflow is now as follows
  • The Stage 1 crew builds a chassis. Once the
    chassis is complete, they send it on to the Stage
    2 crew.
  • The Stage 2 crew receives the chassis and begins
    dropping the engine in, while the Stage 1 crew
    starts on a new chassis.
  • When both Stage 1 and Stage 2 crews are finished,
    the Stage 2 crew's work advances to Stage 3, the
    Stage 1 crew's work advances to Stage 2, and the
    Stage 1 crew starts on a new chassis.

11
Pipelining
  • As the assembly line begins to fill up with SUVs
    in various stages of production, more of the
    crews are put to work simultaneously until all of
    the crews are working on a different vehicle in a
    different stage of production.
  • If we can keep the assembly line full, and keep
    all five crews working at once, then we can
    produce one SUV every hour a five-fold
    improvement in SUV completion rate over the
    previous completion rate of one SUV every five
    hours.
  • That, in a nutshell, is pipelining.
  • While the total amount of time that each
    individual SUV spends in production has not
    changed from the original 5 hours, the rate at
    which the factory as a whole completes SUVs has
    increased drastically.

12
Pipelining
All stages in the pipeline are working
simultaneously
13
Instruction Scheduling
  • In the von Neumann model of execution an
    instruction starts only after its predecessor
    completes.
  • This is not a very efficient model of execution.
  • Due to von Neumann bottleneck or the memory wall.

14
Instruction Pipelines
  • Almost all processors today use instruction
    pipelines to allow overlap of instructions
    (Pentium 4 has a 20 stage pipeline!!!).
  • The execution of an instruction is divided into
    stages each stage is performed by a separate
    part of the processor.
  • Each of these stages completes its operation in
    one cycle (shorter than the cycle in the von
    Neumann model).
  • An instruction still takes the same time to
    execute.

instr
time
F Fetch instruction from cache or memory. D
Decode instruction. E Execute. ALU operation
or address calculation. M Memory access. W
Write back result into register.
15
4 Stage Pipeline
16
Single Cycle Pipeline
White space hardware sitting idle
2 instructions processed after 9ns
17
4-stage pipeline
5 instructions processed after 9ns
18
Pipelining
  • The length of slowest stage will determine the
    length of all the stages in the pipeline.
  • If one stage takes considerably longer than the
    others then many cycles are wasted as the
    functional units remain idle.
  • The smaller the pipeline stage, the faster the
    clock speed per stage. Hence deeper pipelines
    increase overall clock frequency.

19
10 stage pipeline
  • Stage 1 build the chassis.
  • Crew 1a Fit the parts of the chassis together
    and spot-weld the joins.
  • Crew 1b Fully weld all the parts of the chassis.
  • Stage 2 drop the engine in the chassis.
  • Crew 2a Place the engine in the chassis and
    mount it in place.
  • Crew 2b Connect the engine to the moving parts
    of the car.
  • Stage 3 put doors, a hood, and coverings on the
    chassis.
  • Crew 3a Put the doors and hood on the chassis.
  • Crew 3b Put the other coverings on the chassis.
  • Stage 4 attach the wheels.
  • Crew 4a Attach the two front wheels.
  • Crew 4b Attach the two rear wheels.
  • Stage 5 paint the SUV.
  • Crew 5a Paint the sides of the SUV.
  • Crew 5b Paint the top of the SUV.

20
Pipelining
21
Pipelining
Deeper Pipeline
22
Pipeline Fills
  • A pipeline will only work at peak efficiency when
    all stages are filled. The initial filling of a
    pipeline can impact performance in the early
    stages of a programs execution.
  • Many pipeline flushes will have a negative impact
    on performance

Pipeline being filled
23
Pipeline Bubbles
  • In reality pipelining isnt totally free.
  • Sometimes instructions get hung up in one
    pipeline stage for multiple cycles.
  • When this happens the pipeline is said to have
    stalled.
  • When an instruction stalls it backs up all
    instructions coming behind it in the execution.
  • When it eventually exits the stalled stage then
    the gap ( called a bubble ) created by the stall
    remains in the pipeline until the instruction is
    executed fully.
  • Pipeline bubbles reduce the overall instruction
    throughput for an instruction.

24
Pipeline Bubbles
Two instructions behind schedule
25
Pipeline Bubbles
  • Many of the architectural features associated
    with modern processors are deigned to avoid
    pipeline stalls due to
  • Resource conflicts two instructions requiring
    same resource at the same time.
  • Data dependencies.
  • Conditional branching unknown branching
    address.
  • They include
  • OOE Out of order execution
  • Branch Prediction
  • Speculative Execution

26
Pentium Pipelines
  • Pentium (P5) 5 stagesPentium Pro, II, III (P6)
    10 stages (1 cycle ex)Pentium 4 (NetBurst)
    20 stages (no decode)
  • From Pentium 4 (Partially) Previewed,
    Microprocessor Report, 8/28/00

27
Superscalar Architectures
28
Superscalar Computing
  • Almost all modern processors are superscalar i.e.
  • They allow more than one instruction to be
    completed per clock cycle.
  • Superscalar computing is achieved by having
    multiple functional units.
  • With the increase of transistors per die, more
    functional units can be included e.g. two ALUs
    working in parallel as in the Pentium processor.
  • Hence more than one scalar ( integer ) operation
    can be performed per clock cycle and therefore
    superscalar computing was introduced.

29
Superscalar Computing
Lets assume we add two additional crews to Stage
2, each building different engines
30
Superscalar Computing
To illustrate both pipelining and superscalar
parallel execution in action, consider the
following sequence of three SUV orders sent out
to the empty factory floor, right when the shop
opens up 1. Extinction Turbo 2. Extinction
Turbo 3. Extinction LE Now let's follow these
three cars through the assembly line during the
first four hours of the day.
31
Superscalar Computing
Hour 1 The line is empty when the first Turbo
enters it and the Stage 1 Crew kicks into action.
32
Superscalar Computing
Hour 2 The first Turbo moves on to Stage 2a,
while the second Turbo enters the line.
33
Superscalar Computing
Hour 3 Both of the Turbos are in the line being
worked on when the LE enters the line.
34
Superscalar Computing
Hour 4 Now all three cars are in the assembly
line at different stages. Notice that there are
actually three cars in various versions and
stages of "Stage 2," all at the same time.
35
Superscalar Computing
Single stage ALU
Multi-stage FPU unit
36
Superscalar Computing
Dual execution units per stage
37
Pipeline Hazards
  • How can we guarantee no dependencies between
    instructions in a pipeline ( and reduce pipeline
    stalls or bubbles )?
  • One way is to interleave execution of
    instructions from different program threads on
    same pipeline. This is called multithreading.

38
Threads
39
What Is a Thread ?
  • A thread
  • Is an independent flow of control
  • Operates within a process with other threads

Mono-threaded process
Multi-threaded process
Thread 1
Thread1 Thread2 Thread3 Thread4
Process B
Process A
40
What Is A Thread ?
Lightweight process
  • Threads vs. Processes
  • Threads use and exist within the process
    resources
  • A thread maintains its own stack and registers,
    scheduling properties, set of pending and blocked
    signals.
  • Secondary Threads vs. Initial Threads
  • An initial thread is created automatically when a
    process is created.
  • Secondary threads are peers.

41
Multithreaded Programming
  • To realize potential program performance gains
  • On a uniprocessor, multi-threaded processes
    provide for concurrent execution.
  • On a multiprocessor system, a process with
    multiple threads provides potential parallelism.
  • Benefits of multithreaded programming
  • Compared to the cost of creating and managing a
    process, a thread can be created and managed with
    much less operating system overhead.
  • All threads within a process share the same
    address space. Inter-thread communication is more
    efficient and than inter-process communication.

42
Threads
  • Can be context switched more easily
  • Registers and PC
  • Not memory management
  • Can run on different processors concurrently in
    an (Symmetric Multi-threaded Processor ) SMP
  • Share CPU in a uniprocessor
  • May (will) require concurrency control
    programming like mutex locks.

43
Single-Threaded Processing
Single thread of execution per task other tasks
wait
4 processes
Ineffective instruction scheduling
Pipeline bubble
44
Context Switching
  • Threads belonging to each process are given a
    fixed time-slice to execute.
  • When a time-slice is up, its context is saved to
    memory.
  • When the thread or process gains a new time slice
    its context is reloaded and it can continue
    execution from the exact point it was in when it
    was flushed from the CPU.
  • This is called context-switching.
  • Context switching for a process is more expensive
    than for a lightweight thread.
  • So to improve performance, cut down on context
    switches or at least constrain them to
    lightweight threads.

45
SMP
  • A solution to this problem is Symmetric Multi
    Processing (SMP) i.e. have two processors
    attached to a global shared memory.
  • Two processes can be executing concurrently at
    the same time on two different processors.
  • Problem
  • Twice as much execution but equally twice as much
    empty issue and execution slots.

46
Empty issue slots
Twice the number of pipeline bubbles
47
SuperThreading
  • A technique employed in high performance
    architectures to reduce the amount of wasted
    resources is time-slice multithreading or
    superthreading.
  • Processors that exploit this technology are known
    as multi-threaded processors.
  • Multithreaded processors can execute more than
    one thread at a time.

48
Only the instructions belonging to one thread can
be in a stage at one time.
Fewer wasted slots
Lack of Pipeline Bubbles ( due to memory latency
or data dependence )
Still a waste of execution slots
49
  • An improvement on superthreading, is to remove
    the restriction that only one thread can have
    access to a pipeline stage during a clock cycle.
  • This is called Simultaneous Multithreading or
    HyperThreading.

50
Fewer execution slots remain empty
Mixed thread instructions per stage
51
Compare with
52
  • Hyperthreading is similar to a single-threaded
    SMP system.
  • Instead of physical dual processing units a
    hyperthreaded processor has access to dual
    logical processing units.
  • Threads are scheduled to execute on any of the
    logical processors.
  • The main advantages of hyperthreading is
  • Increased flexibility to fill execution slots
  • The cost to add hyperthreading logic to die is
    small e.g. 5 of surface die for Intel Xeon
    processor.
  • Reduced cache coherency problems than SMP but
    there are increased chances of cache conflict.
Write a Comment
User Comments (0)
About PowerShow.com