High Performance Computer - PowerPoint PPT Presentation

About This Presentation
Title:

High Performance Computer

Description:

The Intel Pentium4 has a higher clock speed. than the ... Clock tick. Case 1: Case 2: Time. Performance = Clock Speed x Parallelism. What About Parallelism? ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 43
Provided by: talk8
Category:

less

Transcript and Presenter's Notes

Title: High Performance Computer


1
High Performance Computer Architecture
Challenges Rajeev Balasubramonian School of
Computing, University of Utah
2
Dramatic Clock Speed Improvements!!
Intel Pentium 4 3.2 GHz
The 1st Intel processor 108 KHz
3
Clock Speed Performance ?
  • The Intel Pentium4 has a higher clock speed
  • than the IBM Power4 does the Pentium4
  • execute your program faster?

4
Clock Speed Performance ?
  • The Intel Pentium4 has a higher clock speed
  • than the IBM Power4 does the Pentium4
  • execute your program faster?

Case 1
Completing instruction
Clock tick
Case 2
Time
5
Performance Clock Speed x Parallelism
6
What About Parallelism?
7
Dramatic Clock Speed Improvements!!
Intel Pentium 4 3.2 GHz
The 1st Intel processor 108 KHz
8
The Basic Pipeline
Consider an automobile assembly line
A new car rolls out every day
Stage 1
Stage 2
Stage 3
Stage 4
1 day
1 day
1 day
1 day
A new car rolls out every half day
In each case, it takes 4 days to build a car,
but More stages ? more parallelism and less
time between cars
9
What Determines Clock Speed?
  • Clock speed is a function of work done in each
  • stage in the earlier examples, the clock
    speeds
  • were 1 car/day and 2 cars/day
  • Similarly, it takes plenty of work to execute
    an
  • instruction and this work is broken into stages

Execution of a single instruction
10
What Determines Clock Speed?
  • Clock speed is a function of work done in each
  • stage in the earlier examples, the clock
    speeds
  • were 1 car/day and 2 cars/day
  • Similarly, it takes plenty of work to execute
    an
  • instruction and this work is broken into stages

250ps ? 4GHz clock speed
Execution of a single instruction
11
Clock Speed Improvements
  • Why have we seen such dramatic improvements
  • in clock speed?
  • work has been broken up into more stages
  • early Intel chips executed work equivalent
  • to approximately 56 logic gates todays
  • chips execute 12 logic gates worth of work
  • transistors have been becoming faster
  • as technology improves, we can draw
  • smaller and smaller transistors/gates on a
  • chip and that improves their speed
  • (doubles every 5-6 years)

12
Will these Improvements Continue?
  • Transistors will continue to shrink and become
  • faster for at least 10 more years
  • Each pipeline stage is already pretty small
  • improvements from this factor will cease
  • If clock speed improvements stagnate, should
  • we turn our focus to parallelism?

13
Microprocessor Blocks
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
ALU
ALU
ALU
L2 Cache
L1 Data Cache
Register File
14
Innovations Branch Predictor
Improve prediction accuracy by detecting frequent
patterns
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
ALU
ALU
ALU
L2 Cache
L1 Data Cache
Register File
15
Innovations Out-of-order Issue
Out-of-order issue if later instructions do not
depend on earlier ones, execute them first
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
ALU
ALU
ALU
L2 Cache
L1 Data Cache
Register File
16
Innovations Superscalar Architectures
Multiple ALUs increase execution bandwidth
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
L2 Cache
L1 Data Cache
ALU
ALU
ALU
Register File
17
Innovations Data Caches
2K papers on caches efficient data layout,
stride prefetching
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
ALU
ALU
ALU
L2 Cache
L1 Data Cache
Register File
18
Summary
  • Historically, computer engineers have focused on
  • performance
  • Performance is a function of clock speed and
  • parallelism
  • As technology improves, clock speeds will
  • improve, although at a slower rate
  • Parallelism has been gradually improving and
  • plenty of low-hanging fruits have been picked

19
Outline
  • Recent Microprocessor History
  • Current Trends and Challenges
  • Solutions to Handling these Challenges

20
Trend I An Opportunity
  • Transistors on a chip have been doubling every
  • two years (Moores Law)
  • In the past, transistors have been used for
  • out-of-order logic, large caches, etc
  • In the future, transistors can be employed for
  • multiple processors on a single chip

21
Chip Multiprocessors (CMP)
  • The IBM Power4 has two processors on a die
  • Sun has announced the 8-processor Niagara

P1
P2
P3
P4
L2 cache
22
The Challenge
  • Nearly every chip will have multiple processors,
  • but where are the threads?
  • Some applications will truly benefit they can
    be
  • easily decomposed into threads
  • Some applications are inherently sequential
    can
  • we execute speculative threads to speed up
    these
  • programs? (open problem!)

23
Trend II Power Consumption
  • Power a a f C V2 , where a is activity factor,
  • f is frequency, C is capacitance, and V is
    voltage
  • Every new chip has higher frequency, more
  • transistors (higher C), and slightly lower
    voltage
  • the net result is an increase in power
    consumption

24
Scary Slide!
  • Power density cannot be allowed to increase at
  • current rates (Source Borkar et al., Intel)

25
Impact of Power Increases
  • Well, UtahPower sends you fatter bills every
    month
  • To maintain constant chip temperature, heat
  • produced on a chip has to be dissipated away
  • every additional watt increases cooling cost of
    a
  • chip by approximately 4 !!
  • If temperature of a chip rises, the power
    dissipated
  • also increases (almost exponentially) ? a
    vicious
  • cycle!

26
Trend III Wire Delays
  • As technology improves, logic gates shrink ?
  • their speed increases and clock speeds improve
  • As logic gates shrink, wires shrink too
  • unfortunately, their speed improves only
  • marginally
  • In relative terms, future chips will have fast
  • transistors/gates and slow wires
  • Computation is cheap, communication is expensive!

27
Impact of Wire Delays
  • Crossing the chip used to take one cycle
  • In the future, crossing the chip can take up to
    30
  • cycles
  • Many structures on a chip are wire-constrained
  • (register file, cache) their access times
    slow
  • down ? throughput decreases as instructions
  • sit around waiting for values
  • Long wires also consume power

28
Trend IV Soft Errors
  • High energy particles constantly collide with
  • objects and deposit charge
  • Transistors are becoming smaller and on-chip
  • voltages are being lowered ? it doesnt take
    much
  • to toggle the state of the transistor
  • The frequency of this occurrence is projected to
  • increase by nine orders of magnitude over a 20
  • year period

29
Impact of Soft Errors
  • When a particle strike occurs, the component is
  • not rendered permanently faulty only the
    value
  • it contains is erroneous
  • Hence, this is termed a transient fault or soft
    error
  • The error propagates when other instructions
    read
  • this faulty value
  • This is already a problem for mission-critical
    apps
  • (space, defense, highly-available servers) and
    may
  • soon be a problem in other domains

30
Summary of Trends
  • More transistors, more processors on a single
    chip
  • High power consumption
  • Long wire delays
  • Frequent soft errors
  • We are attempting to exploit transistors to
    increase
  • parallelism in light of the above challenges,
    wed
  • be happy to even preserve parallelism

31
Transistors Wire Delays
  • Bring in a large window of instructions so you
  • can find high parallelism
  • Distribute instructions across processors so
    that
  • communication is minimized

Instructions
Processors
32
Difficult Branches
  • Mispredicted branches result in poor parallelism
  • and wasted work (power)
  • Solution when you arrive at a fork, take both
  • directions execute on low frequency units to
  • control power dissipation levels

Instructions
Processors
33
Thermal Emergencies
  • Heterogeneous units allow you to reduce cooling
  • costs
  • If a chips peak power is 110W, allow enough
  • cooling to handle 100W average save 40/chip!
  • If the application starts consuming more than
  • 100W and temperature starts to rise, start
  • favoring the low power processor cores
  • intelligent management allows you to make
  • forward progress even in a thermal emergency

34
Handling Long Wire Delays
  • Wires can be designed to have different
    properties
  • Knob 1 wire width and spacing fat wires are
  • faster, but have low bandwidth

35
Handling Wire Capacitance
  • Knob 2 wires have repeaters/buffers many,
  • large buffers ? low delay, high power
    consumption

36
Mapping Data to Wires
  • We can optimize wires for delay, bandwidth,
    power
  • Different data transfers on a chip have
    different
  • latency and bandwidth needs an intelligent
  • mapping of data to wires can improve
    performance
  • and lower power consumption

37
Handling Soft Errors
  • Errors can be detected and corrected by
    providing
  • redundancy execute two copies of a program
  • (perhaps, on a CMP) and compare results
  • Note that this doubles power consumption!

Leading Thread
Trailing Thread
38
Handling Soft Errors
  • Trailing thread is capable of higher performance
  • than leading thread but theres no point
    catching
  • up hence, artificially slow the trailing
    thread by
  • lowering its frequency ? lower power dissipation

Peak thruput 1 BIPS 2 BIPS
Trailing thread never fetches data from memory
and never guesses at branches
Leading Thread
Trailing Thread
39
Summary of Solutions
  • Heterogeneous wires and processors
  • Instructions and data have different needs map
  • them to appropriate wires and processors
  • Note how these solutions target multiple issues
  • simultaneously slow wires, many transistors,
  • soft errors, power/thermal emergencies

40
Conclusions
  • Performance has improved because of clock
  • speed and parallelism advances
  • Clock speed improvements will continue at a
  • slower rate
  • Parallelism is on a downward trend because of
  • technology trends and because low-hanging
  • fruits have been picked
  • We must find creative ways to preserve or even
  • improve parallelism in the future

41
(No Transcript)
42
Slide Title
  • Point 1.
Write a Comment
User Comments (0)
About PowerShow.com