Title: greg astfalk
1(No Transcript)
2high-end computing technology where is it
heading?
greg astfalk woon yung chung woon-yung_chung_at_hp.c
om
3prologue
- this is not a talk about hewlett-packards
product offering(s) - the context is hpc (high performance computing)
- somewhat biased to scientific computing
- also applies to commercial computing
4backdrop
- end-users of hpc systems have needs and wants
from hpc systems - the computer industry delivers the hpc systems
- there exists a gap between the two wrt
- programming
- processors
- architectures
- interconnects/storage
- in this talk we (weakly) quantify the gaps in
these 4 areas
5end-users programming wants
- end-users of hpc machines would ideally like to
think and code sequentially - have a compiler and run-time system that produces
portable and (nearly) optimal parallel code - regardless of processor count
- regardless of architecture type
- yes, i am being a bit facetious but the idea
remains true
6parallelism methodologies
- there exists 5 methodologies to achieve
parallelism - automatic parallelization via compilers
- explicit threading
- pthreads
- message-passing
- mpi
- pragma/directive
- openmp
- explicitly parallel languages
- upc, et al.
7parallel programming
- parallel programming is a cerebral effort
- if lots of neurons plus mpi constitutes
prime-time then parallel programming has
arrived - no major technologies on the horizon to change
this status quo
8discontinuities
- the ease of parallel programming has not
progressed at the same rate that parallel systems
have become available - performance gains require compiler optimization
or pbo - most parallelism requires hand-coding
- in the real-world many users dont use any
compiler optimizations
9parallel efficiency
- mindful that the bounds on parallel efficiency
are, in general, far apart - 50 efficiency on 32 processors is good
- 10 efficiency on ?(100) processors is excellent
- gt2 efficiency on ?(1000) processors is heroic
- a little communication can knee over the
efficiency vs. processor count curve
10apps with sufficient parallelism
- few existing applications can utilize ?(1000), or
even ?(100), processors with any reasonable
degree of efficiency - to date have generally required heroic effort
- new algorithms (i.e., data and control
decompositions) or nearly complete are necessary - such large-scale parallelism will have arrived
when msc/nastran and oracle exist on such systems
and utilize the processors
11latency tolerant algorithms
- latency tolerance will be a increasingly
important theme for the future - hardware will not solve this problem
- more on this point later
- developing algorithms that have significant
latency tolerance will be necessary - this means thinking outside the box about the
algorithms - simple modifications to existing algorithms
generally wont suffice
12operating systems
- development environments will move to nt
- heavy-lifting will remain with unix
- four unixs to survive (alphabetically)
- hp-ux
- linux
- aix 5l
- solaris
- linux will be important at the lower-end but will
not significantly encroach on the high-end
13end-users proc/arch wants
- all things being equal high-end users would
likely want a classic cray vector supercomputer - no caches
- multiple pipes to memory
- single word access
- hardware support for gather/scatter
- etc.
- it is true however that for some applications
contemporary risc processors perform better
14processors
- the processor of choice is now, and will be,
for some time to come the risc processor - risc processors have caches
- caches are good
- caches are bad
- if your code fits in cache, you arent
supercomputing! ?
15risc processor performance
- a rule of thumb is that a risc processor, any
risc processor, gets on average, on a sustained
basis, - 10 of its peak performance
- the 3? on this is large
- achieved performance varies with
- architecture
- application
- algorithm
- coding
- dataset size
- anything else you can think of
16semiconductor processes
- semiconductor processes change every 2-3 years
- assuming that technology scaling applies to
subsequent generations then per generation - frequency increase of 40
- transistor density increase of 100
- energy per transition decrease of 60
17semiconductor processes
18what to do with gates
- it is not a simple question of what the best use
of the gates is - larger caches
- multiple cores
- specialized functional units
- etc.
- the impact of soft errors with decreasing design
rule size will be a important topic - what happens if a alpha particles flips a bit in
a register?
19processor futures
- you can expect, for the short term, moores law
like gains in processors peak performance - doubling of performance every 18-24 months
- does not necessarily apply to application
performance - moores law will not last forever
- 4-5 more turns (maybe?)
20processor evolution
next generation
performance
IA-64
EPIC
Superscalar risc
2 instructions/cycle
RISC
1 micron - gt .5 micron --gt .35 micron --gt .25
micron --gt .18 micron --gt .13 micron
lt
1 instruction/cycle
CISC
20-30 increase per year due to advances in
underlying semiconductor technology
.3 ins/cycle
time
hp confidential
European analysts briefing, london. September 5,
2000
21customer spending (m)
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
0
idc, february 2000
- technology disruptions
- risc crossed over cisc in 1996
- itanium will cross over risc in 2004
22present high-end architectures
- todays high-end architecture is either
- smp
- ccnuma
- cluster of smp nodes
- cluster of ccnuma nodes
- japanese vector system
- all of these architectures work
- efficiency varies with application type
23architectural issues
- of the choices available the smp is preferred,
however - smp processor count is limited
- cost of scalability is prohibitive
- ccnuma addresses these limitations but induces
its own - disparate latencies
- better, but still limited, scalability
- ras limitations
- clusters too have pros and cons
- huge latencies
- low cost
- etc.
24physics
- limitations imposed by physics have led us to
architectures that have a deep memory hierarchy - the algorithmist and programmer must deal with,
and exploit, the hierarchy to achieve good
performance - this is part of the cerebral effort of parallel
programming we mentioned earlier
25memory hierarchy
- typical latencies for todays technology
26balanced system ratios
- a ideal high-end system should be balanced wrt
its performance metrics - for each peak flop/second
- 0.51 byte of physical memory
- 10100 byte of disk capacity
- 416 byte/sec of cache bandwidth
- 13 byte/sec of memory bandwidth
- 0.11 bit/sec of interconnect bandwidth
- 0.020.2 byte/sec of disk bandwidth
27balanced system
- applying the balanced system ratios to a unnamed
contemporary 16 processor smp
28storage
- data volumes are growing at a extremely rapid
pace - disk capacity sold doubled from 1997 to 1998
- storage is a increasingly large percent of the
total server sale - disk technology is advancing too slowly
- per generation, of 1-1.5 years
- access time decreases 10
- spindle bandwidth increases 30
- capacity increases 50
29networks
- only the standards will be widely deployed
- gigabit ethernet
- gigabyte ethernet
- fibre channel (2x and 10x later)
- sio
- atm
- dwdm backbones
- the last mile problem remains with us
- inter-system interconnect for clustering will not
keep pace with the demands (for latency and
bandwidth)
30vendors constraints
- rule 1 be profitable to return value to the
shareholders - you dont control the market size
- you can only spend 10 of your revenue on rd
- dont fab your own silicon (hopefully)
- you must be more than just a technical
computing company - to not do this is to fail to meet rule 1 (see
above)
31market sizes
- according to the industry analysts the technical
market is, depending on where you draw the
cut-line, 4-5 billion annually - the bulk of the market is small-ish systems (data
from forest baskett at sgi)
32a perspective
- commercial computing is not a enemy
- without the commercial markets revenue our
ability to build hpc-like systems would be
limited - the commercial market benefits from the
technology innovation in the hpc market - is performance left on the table in designing a
system to serve both the commercial and technical
markets - yes
33why?
- lack of a cold war
- performance of hpc systems has been marginalized
- in the mid-70s how many applications ran faster
on a vax 11/780 than the cray-1 - none
- how many applications today run faster on a
pentium than the cray t90? - some
- current demand for hpc systems is elastic
34future prognostication
- computing in the future will be all about data
and moving data - the growth in data volumes is incredible
- richer media types (i.e., video) means more data
- distributed collaborations imply moving data
- e-whatever requires large, rapid data movement
- more flops ? more data
35data movement
- the scope of data movement encompasses
- register to functional unit
- cache to register
- cache to cache
- memory to cache
- disk to memory
- tape to disk
- system to system
- pda to client to server
- continent to continent
- all of these are going to be important
36epilogue
- for hpc in the future
- it is going to be risc processors
- smp and ccnuma architectures
- smp processor count relatively constant
- technology trends are reasonably predictable
- mpi, pthreads and openmp for parallelism
- latency management will be crucial
- it will be all about data
37epilogue (contd)
- for the computer industry in the future
- trending toward e-everything
- e-commerce
- apps-on-tap
- brokered services
- remote data
- virtual data centers
- visualization
- nt for development
- vectors are dying
- for hpc vendors in the future
- there will be fewer ?
38conclusion
- hpc users will need to yield more to what the
industry can provide rather than vice-versa - vendors rule 1 is a cruel master