Title: Scalable Numerical Algorithms and Methods on the ASCI Machines
1CS61V
Pipelining and Multi-threading
2Searching for Parallelism
- Goal of the computer architect
- Identify potential opportunities for parallelism
at every possible level and exploit them e.g. - Bit level
- Instruction level
- Processor level
3- More parallelism within each CPU
- Pipelined CPUs ( increase instruction throughput
) - Superscalar CPUs ( multiple functional units )
- Superpipelined CPUs (multiple instruction
issues per clock) - Multi-threaded CPUs that run multiple instruction
streams (so when one stream stalls on memory or
I/O, another stream can make progress) - More CPUs (100 to 10,000)
- Hardware support for shared memory, and for
locks on memory. - Hardware support for memory consistency (because
a remote write can change local memory at any
time) - Hardware support for data movement between
memories
4Pipelining
5Instruction-Level Pipelining
- Some CPUs divide the fetch-decode-execute cycle
into smaller steps. - These smaller steps can often be executed in
parallel to increase throughput. - Such parallel execution is called
instruction-level pipelining. - This term is sometimes abbreviated to ILP in the
literature.
6Pipelining Example
- Let's say that we have decided to go into the
increasingly lucrative SUV manufacturing
business. After some intense research, we
determine that there are five stages in the SUV
building process, as follows - Stage 1 build the chassis.
- Stage 2 drop the engine in the chassis.
- Stage 3 put doors, a hood, and coverings on the
chassis. - Stage 4 attach the wheels.
- Stage 5 paint the SUV.
7Pipelining
- There are five skilled crews ready to work on
each stage in the manufacturing process - Our big strategy is to have the factory run as
follows - Line up all five crews in a row, and we have the
first crew start an SUV at Stage 1. - After Stage 1 is complete, the SUV moves down the
line to the next stage and the next crew drops
the engine in. - While the Stage 2 Crew is installing the engine
in the chassis that the Stage 1 Crew just built,
the Stage 1 Crew (along with all of the rest of
the crews) is free to go play football, watch the
big-screen plasma TV in the break room, surf the
'net, etc. - Once the Stage 2 Crew is done, the SUV moves down
to Stage 3 and the Stage 3 Crew takes over while
the Stage 2 Crew hits the break room to party
with everyone else.
8Pipelining
- The SUV moves on down the line through all five
stages this way, with only one crew working on
one stage at any given time while the rest of the
crews are idle. - Once the completed SUV finishes Stage 5, the crew
at Stage 1 then starts on another SUV. - At this rate, it takes exactly five hours to
finish a single SUV, and our factory puts out one
SUV every five hours (assuming 1 hr per stage)
SUV at Stage 2
9Pipelining
- How can we improve the production?
- Add a second production line using 5 additional
skilled crews. - This increases throughput to two SUVs every 5
hours. - Problems ?
- Requires a lot more money to pay for extra crews
- Double the inefficiency with twice the number of
crews in the break room at one time.
10Pipelining
- Finally a smart consultant hits upon the a clever
idea to improve productivity - Why let workers spend four-fifths of their day
in the break room, when they could be doing
useful work during that time. - The revised workflow is now as follows
- The Stage 1 crew builds a chassis. Once the
chassis is complete, they send it on to the Stage
2 crew. - The Stage 2 crew receives the chassis and begins
dropping the engine in, while the Stage 1 crew
starts on a new chassis. - When both Stage 1 and Stage 2 crews are finished,
the Stage 2 crew's work advances to Stage 3, the
Stage 1 crew's work advances to Stage 2, and the
Stage 1 crew starts on a new chassis.
11Pipelining
- As the assembly line begins to fill up with SUVs
in various stages of production, more of the
crews are put to work simultaneously until all of
the crews are working on a different vehicle in a
different stage of production. - If we can keep the assembly line full, and keep
all five crews working at once, then we can
produce one SUV every hour a five-fold
improvement in SUV completion rate over the
previous completion rate of one SUV every five
hours. - That, in a nutshell, is pipelining.
- While the total amount of time that each
individual SUV spends in production has not
changed from the original 5 hours, the rate at
which the factory as a whole completes SUVs has
increased drastically.
12Pipelining
All stages in the pipeline are working
simultaneously
13Instruction Scheduling
- In the von Neumann model of execution an
instruction starts only after its predecessor
completes. - This is not a very efficient model of execution.
- Due to von Neumann bottleneck or the memory wall.
14Instruction Pipelines
- Almost all processors today use instruction
pipelines to allow overlap of instructions
(Pentium 4 has a 20 stage pipeline!!!). - The execution of an instruction is divided into
stages each stage is performed by a separate
part of the processor. - Each of these stages completes its operation in
one cycle (shorter than the cycle in the von
Neumann model). - An instruction still takes the same time to
execute.
instr
time
F Fetch instruction from cache or memory. D
Decode instruction. E Execute. ALU operation
or address calculation. M Memory access. W
Write back result into register.
154 Stage Pipeline
16Single Cycle Pipeline
White space hardware sitting idle
2 instructions processed after 9ns
174-stage pipeline
5 instructions processed after 9ns
18Pipelining
- The length of slowest stage will determine the
length of all the stages in the pipeline. - If one stage takes considerably longer than the
others then many cycles are wasted as the
functional units remain idle. - The smaller the pipeline stage, the faster the
clock speed per stage. Hence deeper pipelines
increase overall clock frequency.
1910 stage pipeline
- Stage 1 build the chassis.
- Crew 1a Fit the parts of the chassis together
and spot-weld the joins. - Crew 1b Fully weld all the parts of the chassis.
- Stage 2 drop the engine in the chassis.
- Crew 2a Place the engine in the chassis and
mount it in place. - Crew 2b Connect the engine to the moving parts
of the car. - Stage 3 put doors, a hood, and coverings on the
chassis. - Crew 3a Put the doors and hood on the chassis.
- Crew 3b Put the other coverings on the chassis.
- Stage 4 attach the wheels.
- Crew 4a Attach the two front wheels.
- Crew 4b Attach the two rear wheels.
- Stage 5 paint the SUV.
- Crew 5a Paint the sides of the SUV.
- Crew 5b Paint the top of the SUV.
20Pipelining
21Pipelining
Deeper Pipeline
22Pipeline Fills
- A pipeline will only work at peak efficiency when
all stages are filled. The initial filling of a
pipeline can impact performance in the early
stages of a programs execution. - Many pipeline flushes will have a negative impact
on performance
Pipeline being filled
23Pipeline Bubbles
- In reality pipelining isnt totally free.
- Sometimes instructions get hung up in one
pipeline stage for multiple cycles. - When this happens the pipeline is said to have
stalled. - When an instruction stalls it backs up all
instructions coming behind it in the execution. - When it eventually exits the stalled stage then
the gap ( called a bubble ) created by the stall
remains in the pipeline until the instruction is
executed fully. - Pipeline bubbles reduce the overall instruction
throughput for an instruction.
24Pipeline Bubbles
Two instructions behind schedule
25Pipeline Bubbles
- Many of the architectural features associated
with modern processors are deigned to avoid
pipeline stalls due to - Resource conflicts two instructions requiring
same resource at the same time. - Data dependencies.
- Conditional branching unknown branching
address. - They include
- OOE Out of order execution
- Branch Prediction
- Speculative Execution
26Pentium Pipelines
- Pentium (P5) 5 stagesPentium Pro, II, III (P6)
10 stages (1 cycle ex)Pentium 4 (NetBurst)
20 stages (no decode) - From Pentium 4 (Partially) Previewed,
Microprocessor Report, 8/28/00
27Superscalar Architectures
28Superscalar Computing
- Almost all modern processors are superscalar i.e.
- They allow more than one instruction to be
completed per clock cycle. - Superscalar computing is achieved by having
multiple functional units. - With the increase of transistors per die, more
functional units can be included e.g. two ALUs
working in parallel as in the Pentium processor. - Hence more than one scalar ( integer ) operation
can be performed per clock cycle and therefore
superscalar computing was introduced.
29Superscalar Computing
Lets assume we add two additional crews to Stage
2, each building different engines
30Superscalar Computing
To illustrate both pipelining and superscalar
parallel execution in action, consider the
following sequence of three SUV orders sent out
to the empty factory floor, right when the shop
opens up 1. Extinction Turbo 2. Extinction
Turbo 3. Extinction LE Now let's follow these
three cars through the assembly line during the
first four hours of the day.
31Superscalar Computing
Hour 1 The line is empty when the first Turbo
enters it and the Stage 1 Crew kicks into action.
32Superscalar Computing
Hour 2 The first Turbo moves on to Stage 2a,
while the second Turbo enters the line.
33Superscalar Computing
Hour 3 Both of the Turbos are in the line being
worked on when the LE enters the line.
34Superscalar Computing
Hour 4 Now all three cars are in the assembly
line at different stages. Notice that there are
actually three cars in various versions and
stages of "Stage 2," all at the same time.
35Superscalar Computing
Single stage ALU
Multi-stage FPU unit
36Superscalar Computing
Dual execution units per stage
37Pipeline Hazards
- How can we guarantee no dependencies between
instructions in a pipeline ( and reduce pipeline
stalls or bubbles )? - One way is to interleave execution of
instructions from different program threads on
same pipeline. This is called multithreading.
38Threads
39What Is a Thread ?
- A thread
- Is an independent flow of control
- Operates within a process with other threads
Mono-threaded process
Multi-threaded process
Thread 1
Thread1 Thread2 Thread3 Thread4
Process B
Process A
40What Is A Thread ?
Lightweight process
- Threads vs. Processes
- Threads use and exist within the process
resources - A thread maintains its own stack and registers,
scheduling properties, set of pending and blocked
signals. - Secondary Threads vs. Initial Threads
- An initial thread is created automatically when a
process is created. - Secondary threads are peers.
41Multithreaded Programming
- To realize potential program performance gains
- On a uniprocessor, multi-threaded processes
provide for concurrent execution. - On a multiprocessor system, a process with
multiple threads provides potential parallelism. - Benefits of multithreaded programming
- Compared to the cost of creating and managing a
process, a thread can be created and managed with
much less operating system overhead. - All threads within a process share the same
address space. Inter-thread communication is more
efficient and than inter-process communication.
42Threads
- Can be context switched more easily
- Registers and PC
- Not memory management
- Can run on different processors concurrently in
an (Symmetric Multi-threaded Processor ) SMP - Share CPU in a uniprocessor
- May (will) require concurrency control
programming like mutex locks.
43Single-Threaded Processing
Single thread of execution per task other tasks
wait
4 processes
Ineffective instruction scheduling
Pipeline bubble
44Context Switching
- Threads belonging to each process are given a
fixed time-slice to execute. - When a time-slice is up, its context is saved to
memory. - When the thread or process gains a new time slice
its context is reloaded and it can continue
execution from the exact point it was in when it
was flushed from the CPU. - This is called context-switching.
- Context switching for a process is more expensive
than for a lightweight thread. - So to improve performance, cut down on context
switches or at least constrain them to
lightweight threads.
45SMP
- A solution to this problem is Symmetric Multi
Processing (SMP) i.e. have two processors
attached to a global shared memory. - Two processes can be executing concurrently at
the same time on two different processors. - Problem
- Twice as much execution but equally twice as much
empty issue and execution slots.
46Empty issue slots
Twice the number of pipeline bubbles
47SuperThreading
- A technique employed in high performance
architectures to reduce the amount of wasted
resources is time-slice multithreading or
superthreading. - Processors that exploit this technology are known
as multi-threaded processors. - Multithreaded processors can execute more than
one thread at a time.
48Only the instructions belonging to one thread can
be in a stage at one time.
Fewer wasted slots
Lack of Pipeline Bubbles ( due to memory latency
or data dependence )
Still a waste of execution slots
49- An improvement on superthreading, is to remove
the restriction that only one thread can have
access to a pipeline stage during a clock cycle. - This is called Simultaneous Multithreading or
HyperThreading.
50Fewer execution slots remain empty
Mixed thread instructions per stage
51Compare with
52- Hyperthreading is similar to a single-threaded
SMP system. - Instead of physical dual processing units a
hyperthreaded processor has access to dual
logical processing units. - Threads are scheduled to execute on any of the
logical processors. - The main advantages of hyperthreading is
- Increased flexibility to fill execution slots
- The cost to add hyperthreading logic to die is
small e.g. 5 of surface die for Intel Xeon
processor. - Reduced cache coherency problems than SMP but
there are increased chances of cache conflict.