Scalable Numerical Algorithms and Methods on the ASCI Machines presentation

About This Presentation

Transcript and Presenter's Notes

Title: Scalable Numerical Algorithms and Methods on the ASCI Machines

1
CS61V
Pipelining and Multi-threading
2
Searching for Parallelism

Goal of the computer architect
Identify potential opportunities for parallelism
at every possible level and exploit them e.g.
Bit level
Instruction level
Processor level

More parallelism within each CPU
Pipelined CPUs ( increase instruction throughput
)
Superscalar CPUs ( multiple functional units )
Superpipelined CPUs (multiple instruction
issues per clock)
Multi-threaded CPUs that run multiple instruction
streams (so when one stream stalls on memory or
I/O, another stream can make progress)
More CPUs (100 to 10,000)
Hardware support for shared memory, and for
locks on memory.
Hardware support for memory consistency (because
a remote write can change local memory at any
time)
Hardware support for data movement between
memories

4
Pipelining
5
Instruction-Level Pipelining

Some CPUs divide the fetch-decode-execute cycle
into smaller steps.
These smaller steps can often be executed in
parallel to increase throughput.
Such parallel execution is called
instruction-level pipelining.
This term is sometimes abbreviated to ILP in the
literature.

6
Pipelining Example

Let's say that we have decided to go into the
increasingly lucrative SUV manufacturing
business. After some intense research, we
determine that there are five stages in the SUV
building process, as follows
Stage 1 build the chassis.
Stage 2 drop the engine in the chassis.
Stage 3 put doors, a hood, and coverings on the
chassis.
Stage 4 attach the wheels.
Stage 5 paint the SUV.

7
Pipelining

There are five skilled crews ready to work on
each stage in the manufacturing process
Our big strategy is to have the factory run as
follows
Line up all five crews in a row, and we have the
first crew start an SUV at Stage 1.
After Stage 1 is complete, the SUV moves down the
line to the next stage and the next crew drops
the engine in.
While the Stage 2 Crew is installing the engine
in the chassis that the Stage 1 Crew just built,
the Stage 1 Crew (along with all of the rest of
the crews) is free to go play football, watch the
big-screen plasma TV in the break room, surf the
'net, etc.
Once the Stage 2 Crew is done, the SUV moves down
to Stage 3 and the Stage 3 Crew takes over while
the Stage 2 Crew hits the break room to party
with everyone else.

8
Pipelining

The SUV moves on down the line through all five
stages this way, with only one crew working on
one stage at any given time while the rest of the
crews are idle.
Once the completed SUV finishes Stage 5, the crew
at Stage 1 then starts on another SUV.
At this rate, it takes exactly five hours to
finish a single SUV, and our factory puts out one
SUV every five hours (assuming 1 hr per stage)

SUV at Stage 2
9
Pipelining

How can we improve the production?
Add a second production line using 5 additional
skilled crews.
This increases throughput to two SUVs every 5
hours.
Problems ?
Requires a lot more money to pay for extra crews
Double the inefficiency with twice the number of
crews in the break room at one time.

10
Pipelining

Finally a smart consultant hits upon the a clever
idea to improve productivity
Why let workers spend four-fifths of their day
in the break room, when they could be doing
useful work during that time.
The revised workflow is now as follows
The Stage 1 crew builds a chassis. Once the
chassis is complete, they send it on to the Stage
2 crew.
The Stage 2 crew receives the chassis and begins
dropping the engine in, while the Stage 1 crew
starts on a new chassis.
When both Stage 1 and Stage 2 crews are finished,
the Stage 2 crew's work advances to Stage 3, the
Stage 1 crew's work advances to Stage 2, and the
Stage 1 crew starts on a new chassis.

11
Pipelining

As the assembly line begins to fill up with SUVs
in various stages of production, more of the
crews are put to work simultaneously until all of
the crews are working on a different vehicle in a
different stage of production.
If we can keep the assembly line full, and keep
all five crews working at once, then we can
produce one SUV every hour a five-fold
improvement in SUV completion rate over the
previous completion rate of one SUV every five
hours.
That, in a nutshell, is pipelining.
While the total amount of time that each
individual SUV spends in production has not
changed from the original 5 hours, the rate at
which the factory as a whole completes SUVs has
increased drastically.

12
Pipelining
All stages in the pipeline are working
simultaneously
13
Instruction Scheduling

In the von Neumann model of execution an
instruction starts only after its predecessor
completes.
This is not a very efficient model of execution.
Due to von Neumann bottleneck or the memory wall.

14
Instruction Pipelines

Almost all processors today use instruction
pipelines to allow overlap of instructions
(Pentium 4 has a 20 stage pipeline!!!).
The execution of an instruction is divided into
stages each stage is performed by a separate
part of the processor.
Each of these stages completes its operation in
one cycle (shorter than the cycle in the von
Neumann model).
An instruction still takes the same time to
execute.

instr
time
F Fetch instruction from cache or memory. D
Decode instruction. E Execute. ALU operation
or address calculation. M Memory access. W
Write back result into register.
15
4 Stage Pipeline
16
Single Cycle Pipeline
White space hardware sitting idle
2 instructions processed after 9ns
17
4-stage pipeline
5 instructions processed after 9ns
18
Pipelining

The length of slowest stage will determine the
length of all the stages in the pipeline.
If one stage takes considerably longer than the
others then many cycles are wasted as the
functional units remain idle.
The smaller the pipeline stage, the faster the
clock speed per stage. Hence deeper pipelines
increase overall clock frequency.

19
10 stage pipeline

Stage 1 build the chassis.
Crew 1a Fit the parts of the chassis together
and spot-weld the joins.
Crew 1b Fully weld all the parts of the chassis.
Stage 2 drop the engine in the chassis.
Crew 2a Place the engine in the chassis and
mount it in place.
Crew 2b Connect the engine to the moving parts
of the car.
Stage 3 put doors, a hood, and coverings on the
chassis.
Crew 3a Put the doors and hood on the chassis.
Crew 3b Put the other coverings on the chassis.
Stage 4 attach the wheels.
Crew 4a Attach the two front wheels.
Crew 4b Attach the two rear wheels.
Stage 5 paint the SUV.
Crew 5a Paint the sides of the SUV.
Crew 5b Paint the top of the SUV.

20
Pipelining
21
Pipelining
Deeper Pipeline
22
Pipeline Fills

A pipeline will only work at peak efficiency when
all stages are filled. The initial filling of a
pipeline can impact performance in the early
stages of a programs execution.
Many pipeline flushes will have a negative impact
on performance

Pipeline being filled
23
Pipeline Bubbles

In reality pipelining isnt totally free.
Sometimes instructions get hung up in one
pipeline stage for multiple cycles.
When this happens the pipeline is said to have
stalled.
When an instruction stalls it backs up all
instructions coming behind it in the execution.
When it eventually exits the stalled stage then
the gap ( called a bubble ) created by the stall
remains in the pipeline until the instruction is
executed fully.
Pipeline bubbles reduce the overall instruction
throughput for an instruction.

24
Pipeline Bubbles
Two instructions behind schedule
25
Pipeline Bubbles

Many of the architectural features associated
with modern processors are deigned to avoid
pipeline stalls due to
Resource conflicts two instructions requiring
same resource at the same time.
Data dependencies.
Conditional branching unknown branching
address.
They include
OOE Out of order execution
Branch Prediction
Speculative Execution

26
Pentium Pipelines

Pentium (P5) 5 stagesPentium Pro, II, III (P6)
10 stages (1 cycle ex)Pentium 4 (NetBurst)
20 stages (no decode)
From Pentium 4 (Partially) Previewed,
Microprocessor Report, 8/28/00

27
Superscalar Architectures
28
Superscalar Computing

Almost all modern processors are superscalar i.e.
They allow more than one instruction to be
completed per clock cycle.
Superscalar computing is achieved by having
multiple functional units.
With the increase of transistors per die, more
functional units can be included e.g. two ALUs
working in parallel as in the Pentium processor.
Hence more than one scalar ( integer ) operation
can be performed per clock cycle and therefore
superscalar computing was introduced.

29
Superscalar Computing
Lets assume we add two additional crews to Stage
2, each building different engines
30
Superscalar Computing
To illustrate both pipelining and superscalar
parallel execution in action, consider the
following sequence of three SUV orders sent out
to the empty factory floor, right when the shop
opens up 1. Extinction Turbo 2. Extinction
Turbo 3. Extinction LE Now let's follow these
three cars through the assembly line during the
first four hours of the day.
31
Superscalar Computing
Hour 1 The line is empty when the first Turbo
enters it and the Stage 1 Crew kicks into action.
32
Superscalar Computing
Hour 2 The first Turbo moves on to Stage 2a,
while the second Turbo enters the line.
33
Superscalar Computing
Hour 3 Both of the Turbos are in the line being
worked on when the LE enters the line.
34
Superscalar Computing
Hour 4 Now all three cars are in the assembly
line at different stages. Notice that there are
actually three cars in various versions and
stages of "Stage 2," all at the same time.
35
Superscalar Computing
Single stage ALU
Multi-stage FPU unit
36
Superscalar Computing
Dual execution units per stage
37
Pipeline Hazards

How can we guarantee no dependencies between
instructions in a pipeline ( and reduce pipeline
stalls or bubbles )?
One way is to interleave execution of
instructions from different program threads on
same pipeline. This is called multithreading.

38
Threads
39
What Is a Thread ?

A thread
Is an independent flow of control
Operates within a process with other threads

Mono-threaded process
Multi-threaded process
Thread 1
Thread1 Thread2 Thread3 Thread4
Process B
Process A
40
What Is A Thread ?
Lightweight process

Threads vs. Processes
Threads use and exist within the process
resources
A thread maintains its own stack and registers,
scheduling properties, set of pending and blocked
signals.
Secondary Threads vs. Initial Threads
An initial thread is created automatically when a
process is created.
Secondary threads are peers.

41
Multithreaded Programming

To realize potential program performance gains
On a uniprocessor, multi-threaded processes
provide for concurrent execution.
On a multiprocessor system, a process with
multiple threads provides potential parallelism.
Benefits of multithreaded programming
Compared to the cost of creating and managing a
process, a thread can be created and managed with
much less operating system overhead.
All threads within a process share the same
address space. Inter-thread communication is more
efficient and than inter-process communication.

42
Threads

Can be context switched more easily
Registers and PC
Not memory management
Can run on different processors concurrently in
an (Symmetric Multi-threaded Processor ) SMP
Share CPU in a uniprocessor
May (will) require concurrency control
programming like mutex locks.

43
Single-Threaded Processing
Single thread of execution per task other tasks
wait
4 processes
Ineffective instruction scheduling
Pipeline bubble
44
Context Switching

Threads belonging to each process are given a
fixed time-slice to execute.
When a time-slice is up, its context is saved to
memory.
When the thread or process gains a new time slice
its context is reloaded and it can continue
execution from the exact point it was in when it
was flushed from the CPU.
This is called context-switching.
Context switching for a process is more expensive
than for a lightweight thread.
So to improve performance, cut down on context
switches or at least constrain them to
lightweight threads.

45
SMP

A solution to this problem is Symmetric Multi
Processing (SMP) i.e. have two processors
attached to a global shared memory.
Two processes can be executing concurrently at
the same time on two different processors.
Problem
Twice as much execution but equally twice as much
empty issue and execution slots.

46
Empty issue slots
Twice the number of pipeline bubbles
47
SuperThreading

A technique employed in high performance
architectures to reduce the amount of wasted
resources is time-slice multithreading or
superthreading.
Processors that exploit this technology are known
as multi-threaded processors.
Multithreaded processors can execute more than
one thread at a time.

48
Only the instructions belonging to one thread can
be in a stage at one time.
Fewer wasted slots
Lack of Pipeline Bubbles ( due to memory latency
or data dependence )
Still a waste of execution slots
49

An improvement on superthreading, is to remove
the restriction that only one thread can have
access to a pipeline stage during a clock cycle.
This is called Simultaneous Multithreading or
HyperThreading.

50
Fewer execution slots remain empty
Mixed thread instructions per stage
51
Compare with
52

Hyperthreading is similar to a single-threaded
SMP system.
Instead of physical dual processing units a
hyperthreaded processor has access to dual
logical processing units.
Threads are scheduled to execute on any of the
logical processors.
The main advantages of hyperthreading is
Increased flexibility to fill execution slots
The cost to add hyperthreading logic to die is
small e.g. 5 of surface die for Intel Xeon
processor.
Reduced cache coherency problems than SMP but
there are increased chances of cache conflict.

Write a Comment

User Comments (0)

About PowerShow.com

Scalable Numerical Algorithms and Methods on the ASCI Machines PowerPoint PPT Presentation