Today - PowerPoint PPT Presentation

About This Presentation
Title:

Today

Description:

Trends in 'Supercomputers' for Scientific Computing ... Difficult to obtain snapshot to compare across vendor platforms. 4-way. Cpq PL 5000 ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 55
Provided by: jaswinder1
Category:
Tags: snapshot | today

less

Transcript and Presenter's Notes

Title: Today


1
Today
  • About the class
  • Introductions
  • Any new people?
  • Start of first module Parallel Computing

2
Goals for This Module
  • Overview of Parallel Architecture and Programming
    Models
  • Drivers of Parallel Computing (Appl, Tech trends,
    Arch., Economics)
  • Trends in Supercomputers for Scientific
    Computing
  • Evolution and Convergence of Parallel
    Architectures
  • Fundamental Issues in Programming Models and
    Architecture
  • Parallel programs
  • Process of parallelization
  • What parallel programs look like in major
    programming models
  • Programming for performance
  • Key performance issues and architectural
    interactions

3
Overview of Parallel Architecture and
Programming Models
4
What is a Parallel Computer?
  • A collection of processing elements that
    cooperate to solve large problems fast
  • Some broad issues that distinguish parallel
    computers
  • Resource Allocation
  • how large a collection?
  • how powerful are the elements?
  • how much memory?
  • Data access, Communication and Synchronization
  • how do the elements cooperate and communicate?
  • how are data transmitted between processors?
  • what are the abstractions and primitives for
    cooperation?
  • Performance and Scalability
  • how does it all translate into performance?
  • how does it scale?

5
Why Parallelism?
  • Provides alternative to faster clock for
    performance
  • Assuming a doubling of effective per-node
    performance every 2 years, 1024-CPU system can
    get you the performance that it would take 20
    years for a single-CPU system to deliver
  • Applies at all levels of system design
  • Is increasingly central in information processing
  • Scientific computing simulation, data analysis,
    data storage and management, etc.
  • Commercial computing Transaction processing,
    databases
  • Internet applications Search Google operates
    at least 50,000 CPUs, many as part of large
    parallel systems

6
How to Study Parallel Systems
  • History diverse and innovative organizational
    structures, often tied to novel programming
    models
  • Rapidly matured under strong technological
    constraints
  • The microprocessor is ubiquitous
  • Laptops and supercomputers are fundamentally
    similar!
  • Technological trends cause diverse approaches to
    converge
  • Technological trends make parallel computing
    inevitable
  • In the mainstream
  • Need to understand fundamental principles and
    design tradeoffs, not just taxonomies
  • Naming, Ordering, Replication, Communication
    performance

7
Outline
  • Drivers of Parallel Computing
  • Trends in Supercomputers for Scientific
    Computing
  • Evolution and Convergence of Parallel
    Architectures
  • Fundamental Issues in Programming Models and
    Architecture

8
Drivers of Parallel Computing
  • Application Needs Our insatiable need for
    computing cycles
  • Scientific computing CFD, Biology, Chemistry,
    Physics, ...
  • General-purpose computing Video, Graphics, CAD,
    Databases, TP...
  • Internet applications Search, e-Commerce,
    Clustering ...
  • Technology Trends
  • Architecture Trends
  • Economics
  • Current trends
  • All microprocessors have multiprocessor support
  • Servers and workstations are often MP Sun, SGI,
    Dell, COMPAQ...
  • Microprocessors are multiprocessors SMP on a chip

9
Application Trends
  • Demand for cycles fuels advances in hardware, and
    vice-versa
  • Cycle drives exponential increase in
    microprocessor performance
  • Drives parallel architecture harder most
    demanding applications
  • Range of performance demands
  • Need range of system performance with
    progressively increasing cost
  • Platform pyramid
  • Goal of applications in using parallel machines
    Speedup
  • Speedup (p processors)
  • For a fixed problem size (input data set),
    performance 1/time
  • Speedup fixed problem (p processors)

Time (1 processor)
Time (p processors)
10
Scientific Computing Demand
11
Engineering Computing Demand
  • Large parallel machines a mainstay in many
    industries
  • Petroleum (reservoir analysis)
  • Automotive (crash simulation, drag analysis,
    combustion efficiency),
  • Aeronautics (airflow analysis, engine efficiency,
    structural mechanics, electromagnetism),
  • Computer-aided design
  • Pharmaceuticals (molecular modeling)
  • Visualization
  • in all of the above
  • entertainment (movies), architecture
    (walk-throughs, rendering)
  • Financial modeling (yield and derivative
    analysis)
  • etc.

12
Learning Curve for Parallel Applications
  • AMBER molecular dynamics simulation program
  • Starting point was vector code for Cray-1
  • 145 MFLOP on Cray90, 406 for final version on
    128-processor Paragon, 891 on 128-processor Cray
    T3D

13
Commercial Computing
  • Also relies on parallelism for high end
  • Scale not so large, but use much more wide-spread
  • Computational power determines scale of business
    that can be handled
  • Databases, online-transaction processing,
    decision support, data mining, data warehousing
    ...
  • TPC benchmarks (TPC-C order entry, TPC-D decision
    support)
  • Explicit scaling criteria provided
  • Size of enterprise scales with size of system
  • Problem size no longer fixed as p increases, so
    throughput is used as a performance measure
    (transactions per minute or tpm)

14
TPC-C Results for Wintel Systems
6-way Unisys AQ HS6 Pentium Pro 200 MHz 12,026
tpmC 39.38/tpmC Avail 11-30-97 TPC-C
v3.3 (withdrawn)
4-way Cpq PL 5000 Pentium Pro 200 MHz 6,751
tpmC 89.62/tpmC Avail 12-1-96 TPC-C
v3.2 (withdrawn)
4-way IBM NF 7000 PII Xeon 400 MHz 18,893
tpmC 29.09/tpmC Avail 12-29-98 TPC-C
v3.3 (withdrawn)
8-way Cpq PL 8500 PIII Xeon 550 MHz 40,369
tpmC 18.46/tpmC Avail 12-31-99 TPC-C
v3.5 (withdrawn)
8-way Dell PE 8450 PIII Xeon 700 MHz 57,015
tpmC 14.99/tpmC Avail 1-15-01 TPC-C
v3.5 (withdrawn)
32-way Unisys ES7000 PIII Xeon 900 MHz 165,218
tpmC 21.33/tpmC Avail 3-10-02 TPC-C v5.0
32-way NEC Express5800 Itanium2 1GHz 342,746
tpmC 12.86/tpmC Avail 3-31-03 TPC-C v5.0
32-way Unisys ES7000 Xeon MP 2 GHz 234,325
tpmC 11.59/tpmC Avail 3-31-03 TPC-C v5.0
  • Parallelism is pervasive
  • Small to moderate scale parallelism very
    important
  • Difficult to obtain snapshot to compare across
    vendor platforms

15
Summary of Application Trends
  • Transition to parallel computing has occurred for
    scientific and engineering computing
  • In rapid progress in commercial computing
  • Database and transactions as well as financial
  • Usually smaller-scale, but large-scale systems
    also used
  • Desktop also uses multithreaded programs, which
    are a lot like parallel programs
  • Demand for improving throughput on sequential
    workloads
  • Greatest use of small-scale multiprocessors
  • Solid application demand, keeps increasing with
    time

16
Drivers of Parallel Computing
  • Application Needs
  • Technology Trends
  • Architecture Trends
  • Economics

17
Technology Trends Rise of the Micro
The natural building block for multiprocessors is
now also about the fastest!
18
General Technology Trends
  • Microprocessor performance increases 50 - 100
    per year
  • Transistor count doubles every 3 years
  • DRAM size quadruples every 3 years
  • Huge investment per generation is carried by huge
    commodity market
  • Not that single-processor performance is
    plateauing, but that parallelism is a natural way
    to improve it.

19
Clock Frequency Growth Rate (Intel family)
  • 30 per year

20
Transistor Count Growth Rate (Intel family)
  • 100 million transistors on chip by early 2000s
    A.D.
  • Transistor count grows much faster than clock
    rate
  • - 40 per year, order of magnitude more
    contribution in 2 decades

21
Technology A Closer Look
  • Basic advance is decreasing feature size ( ??)
  • Circuits become either faster or lower in power
  • Die size is growing too
  • Clock rate improves roughly proportional to
    improvement in ?
  • Number of transistors improves like ????(or
    faster)
  • Performance gt 100x per decade clock rate 10x,
    rest transistor count
  • How to use more transistors?
  • Parallelism in processing
  • multiple operations per cycle reduces CPI
  • Locality in data access
  • avoids latency and reduces CPI
  • also improves processor utilization
  • Both need resources, so tradeoff
  • Fundamental issue is resource distribution, as in
    uniprocessors

22
Similar Story for Storage
  • Divergence between memory capacity and speed more
    pronounced
  • Capacity increased by 1000x from 1980-95, and
    increases 50 per yr
  • Latency reduces only 3 per year (only 2x from
    1980-95)
  • Bandwidth per memory chip increases 2x as fast as
    latency reduces
  • Larger memories are slower, while processors get
    faster
  • Need to transfer more data in parallel
  • Need deeper cache hierarchies
  • How to organize caches?

23
Similar Story for Storage
  • Parallelism increases effective size of each
    level of hierarchy, without increasing access
    time
  • Parallelism and locality within memory systems
    too
  • New designs fetch many bits within memory chip
    follow with fast pipelined transfer across
    narrower interface
  • Buffer caches most recently accessed data
  • Disks too Parallel disks plus caching
  • Overall, dramatic growth of processor speed,
    storage capacity and bandwidths relative to
    latency (especially) and clock speed point toward
    parallelism as the desirable architectural
    direction

24
Drivers of Parallel Computing
  • Application Needs
  • Technology Trends
  • Architecture Trends
  • Economics

25
Architectural Trends
  • Architecture translates technologys gifts to
    performance and capability
  • Resolves the tradeoff between parallelism and
    locality
  • Recent microprocessors 1/3 compute, 1/3 cache,
    1/3 off-chip connect
  • Tradeoffs may change with scale and technology
    advances
  • Four generations of architectural history tube,
    transistor, IC, VLSI
  • Here focus only on VLSI generation
  • Greatest delineation in VLSI has been in type of
    parallelism exploited

26
Architectural Trends in Parallelism
  • Greatest trend in VLSI generation is increase in
    parallelism
  • Up to 1985 bit level parallelism 4-bit -gt 8 bit
    -gt 16-bit
  • slows after 32 bit
  • adoption of 64-bit well under way, 128-bit is far
    (not performance issue)
  • great inflection point when 32-bit micro and
    cache fit on a chip
  • Mid 80s to mid 90s instruction level parallelism
  • pipelining and simple instruction sets,
    compiler advances (RISC)
  • on-chip caches and functional units gt
    superscalar execution
  • greater sophistication out of order execution,
    speculation, prediction
  • to deal with control transfer and latency
    problems
  • Next step thread level parallelism

27
Phases in VLSI Generation
  • How good is instruction-level parallelism (ILP)?
  • Thread-level needed in microprocessors?
  • SMT, Intel Hyperthreading

28
Can ILP get us there?
  • Reported speedups for superscalar processors
  • Horst, Harris, and Jardine 1990
    ...................... 1.37
  • Wang and Wu 1988 .............................
    ............. 1.70
  • Smith, Johnson, and Horowitz 1989
    .............. 2.30
  • Murakami et al. 1989 .........................
    ............... 2.55
  • Chang et al. 1991 ............................
    ................. 2.90
  • Jouppi and Wall 1989 .........................
    ............. 3.20
  • Lee, Kwok, and Briggs 1991 ...................
    ........ 3.50
  • Wall 1991 ....................................
    ...................... 5
  • Melvin and Patt 1991 .........................
    .............. 8
  • Butler et al. 1991 ...........................
    .................. 17
  • Large variance due to difference in
  • application domain investigated (numerical versus
    non-numerical)
  • capabilities of processor modeled

29
ILP Ideal Potential
  • Infinite resources and fetch bandwidth, perfect
    branch prediction and renaming
  • real caches and non-zero miss latencies

30
Results of ILP Studies
  • Concentrate on parallelism for 4-issue machines
  • Realistic studies show only 2-fold speedup
  • More recent work examines ILP that looks across
    threads for parallelism

31
Architectural Trends Bus-based MPs
  • Micro on a chip makes it natural to connect many
    to shared memory
  • dominates server and enterprise market, moving
    down to desktop
  • Faster processors began to saturate bus, then bus
    technology advanced
  • today, range of sizes for bus-based systems,
    desktop to large servers

No. of processors in fully configured commercial
shared-memory systems
32
Bus Bandwidth
33
Do Buses Scale?
  • Buses are a convenient way to extend architecture
    to parallelism, but they do not scale
  • bandwidth doesnt grow as CPUs are added
  • Scalable systems use physically distributed memory

34
Drivers of Parallel Computing
  • Application Needs
  • Technology Trends
  • Architecture Trends
  • Economics

35
Finally, Economics
  • Commodity microprocessors not only fast but CHEAP
  • Development cost is tens of millions of dollars
    (5-100 typical)
  • BUT, many more are sold compared to
    supercomputers
  • Crucial to take advantage of the investment, and
    use the commodity building block
  • Exotic parallel architectures no more than
    special-purpose
  • Multiprocessors being pushed by software vendors
    (e.g. database) as well as hardware vendors
  • Standardization by Intel makes small, bus-based
    SMPs commodity
  • Desktop few smaller processors versus one larger
    one?
  • Multiprocessor on a chip

36
Summary Why Parallel Architecture?
  • Increasingly attractive
  • Economics, technology, architecture, application
    demand
  • Increasingly central and mainstream
  • Parallelism exploited at many levels
  • Instruction-level parallelism
  • Multiprocessor servers
  • Large-scale multiprocessors (MPPs)
  • Focus of this class multiprocessor level of
    parallelism
  • Same story from memory (and storage) system
    perspective
  • Increase bandwidth, reduce average latency with
    many local memories
  • Wide range of parallel architectures make sense
  • Different cost, performance and scalability

37
Outline
  • Drivers of Parallel Computing
  • Trends in Supercomputers for Scientific
    Computing
  • Evolution and Convergence of Parallel
    Architectures
  • Fundamental Issues in Programming Models and
    Architecture

38
Scientific Supercomputing
  • Proving ground and driver for innovative
    architecture and techniques
  • Market smaller relative to commercial as MPs
    become mainstream
  • Dominated by vector machines starting in 70s
  • Microprocessors have made huge gains in
    floating-point performance
  • high clock rates
  • pipelined floating point units (e.g. mult-add)
  • instruction-level parallelism
  • effective use of caches
  • Plus economics
  • Large-scale multiprocessors replace vector
    supercomputers

39
Raw Uniprocessor Performance LINPACK
40
Raw Parallel Performance LINPACK
  • Even vector Crays became parallel X-MP (2-4)
    Y-MP (8), C-90 (16), T94 (32)
  • Since 1993, Cray produces MPPs too (T3D, T3E)

41
500 Fastest Computers
42
Top 500 as of 2003
  • Earth Simulator, built by NEC, remains the
    unchallenged 1, at 38 Tflop/s
  • ASCI Q at Los Alamos is 2 at 13.88 TFlop/s.
  • The third system ever to exceed the 10 TFflop/s
    mark is Virgina Tech's X a cluster with the
    Apple G5 as building blocks and the new
    Infiniband interconnect.
  • 4 is also a cluster, at NCSA. Dell PowerEdge
    system with Myrinet interconnect
  • 5 is also a cluster, with upgraded
    Itanium2-based HP system at DOE's Pacific
    Northwest National Lab, with Quadrics
    interconnect
  • 6 is the based on AMD's Opteron chip. It was
    installed by Linux Networx at the Los Alamos
    National Laboratory and also uses a Myrinet
    interconnect
  • The list of clusters in the TOP10 has grown to
    seven systems. The Earth Simulator and two IBM SP
    systems at Livermore and LBL are the
    non-clusters.
  • The performance of the 10 system is 6.6 TFlop/s.

43
(No Transcript)
44
Another View of Performance Growth
45
Another View of Performance Growth
46
Another View of Performance Growth
47
Another View of Performance Growth
48
Outline
  • Drivers of Parallel Computing
  • Trends in Supercomputers for Scientific
    Computing
  • Evolution and Convergence of Parallel
    Architectures
  • Fundamental Issues in Programming Models and
    Architecture

49
History
  • Historically, parallel architectures tied to
    programming models
  • Divergent architectures, with no predictable
    pattern of growth.

Application Software
System Software
Systolic Arrays
SIMD
Architecture
Message Passing
Dataflow
Shared Memory
  • Uncertainty of direction paralyzed parallel
    software development!

50
Today
  • Extension of computer architecture to support
    communication and cooperation
  • OLD Instruction Set Architecture
  • NEW Communication Architecture
  • Defines
  • Critical abstractions, boundaries, and primitives
    (interfaces)
  • Organizational structures that implement
    interfaces (hw or sw)
  • Compilers, libraries and OS are important bridges
    between application and architecture today

51
Modern Layered Framework
52
Parallel Programming Model
  • What the programmer uses in writing applications
  • Specifies communication and synchronization
  • Examples
  • Multiprogramming no communication or synch. at
    program level
  • Shared address space like bulletin board
  • Message passing like letters or phone calls,
    explicit point to point
  • Data parallel more regimented, global actions on
    data
  • Implemented with shared address space or message
    passing

53
Communication Abstraction
  • User level communication primitives provided by
    system
  • Realizes the programming model
  • Mapping exists between language primitives of
    programming model and these primitives
  • Supported directly by hw, or via OS, or via user
    sw
  • Lot of debate about what to support in sw and gap
    between layers
  • Today
  • Hw/sw interface tends to be flat, i.e. complexity
    roughly uniform
  • Compilers and software play important roles as
    bridges today
  • Technology trends exert strong influence
  • Result is convergence in organizational structure
  • Relatively simple, general purpose communication
    primitives

54
Communication Architecture
  • User/System Interface Implementation
  • User/System Interface
  • Comm. primitives exposed to user-level by hw and
    system-level sw
  • (May be additional user-level software between
    this and prog model)
  • Implementation
  • Organizational structures that implement the
    primitives hw or OS
  • How optimized are they? How integrated into
    processing node?
  • Structure of network
  • Goals
  • Performance
  • Broad applicability
  • Programmability
  • Scalability
  • Low Cost
Write a Comment
User Comments (0)
About PowerShow.com