WELCOME TO ADVANCED COMPUTER ARCHITECTURE CS622 - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

WELCOME TO ADVANCED COMPUTER ARCHITECTURE CS622

Description:

... and offer massive parallelism (also economics kicks in: should be low-cost) ... bus or switch-based fabric on-chip (can be custom designed and clocked faster) ... – PowerPoint PPT presentation

Number of Views:262
Avg rating:3.0/5.0
Slides: 35
Provided by: cseIi
Category:

less

Transcript and Presenter's Notes

Title: WELCOME TO ADVANCED COMPUTER ARCHITECTURE CS622


1
WELCOME TOADVANCED COMPUTER ARCHITECTURECS622
2
Course info
  • http//www.cse.iitk.ac.in/mainakc/lectures622.htm
    l
  • Parallel Computer Architecture A
    Hardware/Software Approach, Culler and Singh with
    Gupta.
  • I am located in CS211 stop by and we can have a
    chat. I like talking about this stuff (I mean
    it)!
  • Or send me mail mainakc_at_cse.iitk.ac.in

3
Course info
  • Grading Two exams (1030), three designs
    (101025), three homeworks (555)
  • Designs will be done in groups
  • Exams are open books/open notes

4
  • Parallel Computer Architecture
  • Today and Tomorrow

5
What is computer architecture ?
  • Amdahl, Blaauw and Brookes, 1964 (IBM 360 team)
  • The structure of a computer that a machine
    language programmer must understand to write a
    correct (timing independent) program for that
    machine
  • Loosely speaking, it is the science of designing
    computers leading to glorious failures and some
    notable successes

6
Architects job
  • Design and engineer various parts of a computer
    system to maximize performance and
    programmability within the technology limits and
    cost budget
  • Technology limit could mean process/circuit
    technology in case of microprocessor architecture
  • For bigger systems technology limit could mean
    interconnect technology (how one component talks
    to another at macro level)

7
Architects job
Slightly outdated data
8
58 growth rate
  • Two major architectural reasons
  • Advent of RISC (Reduced Instruction Set Computer)
    made it easy to implement many aggressive
    architectural techniques for extracting
    parallelism
  • Introduction of caches
  • Made easy by Moores law
  • Two major impacts
  • Highest performance microprocessors today
    outperform supercomputers designed less than 10
    years ago
  • Microprocessor-based products have dominated all
    sectors of computing desktops, workstations,
    minicomputers are replaced by servers, mainframes
    are replaced by multiprocessors, supercomputers
    are built out of commodity microprocessors (also
    a cost factor dictated this trend)

9
The computer market
  • Three major sectors
  • Desktop ranges from low-end PCs to high-end
    workstations market trend is very sensitive to
    price-performance ratio
  • Server used in large-scale computing or
    service-oriented market such as heavy-weight
    scientific computing, databases, web services,
    etc reliability, availability and scalability
    are very important servers are normally designed
    for high throughput
  • Embedded fast growing sector very
    price-sensitive present in most day-to-day
    appliances such as microwave ovens, washing
    machines, printers, network switches, palmtops,
    cell phones, smart cards, game engines software
    is usually specialized/tuned for one particular
    system

10
The applications
  • Very different in three sectors
  • This difference is the main reason for different
    design styles in these three areas
  • Desktop market demands leading-edge
    microprocessors, high-performance graphics
    engines must offer balanced performance for a
    wide range of applications customers are happy
    to spend a reasonable amount of money for high
    performance i.e. the metric is price-performance
  • Server market integrates high-end microprocessors
    into scalable multiprocessors throughput is very
    important could be floating-point or graphics or
    transaction throughput
  • Embedded market adopts high-end microprocessor
    techniques paying immense attention to low price
    and low power processors are either general
    purpose (to some extent) or application-specific

11
Parallel architecture
  • Collection of processing elements that co-operate
    to solve large problems fast
  • Design questions that need to be answered
  • How many processing elements (scalability)?
  • How capable is each processor (computing power)?
  • How to address memory (shared or distributed)?
  • How much addressable memory (address bit
    allocation)?
  • How do the processors communicate (through memory
    or by messages)?
  • How do the processors avoid data races
    (synchronization)?
  • How do you answer all these to achieve highest
    performance within your cost envelope?

12
Why parallel arch.?
  • Parallelism helps
  • There are applications that can be parallelized
    easily
  • There are important applications that require
    enormous amount of computation (10 GFLOPS to 1
    TFLOPS)
  • NASA taps SGI, Intel for Supercomputers 20 512p
    SGI Altix using Itanium 2 (http//zdnet.com.com/21
    00-1103_2-5286156.html) 27th July, 2004
  • There are important applications that need to
    deliver high throughput

13
Why study it?
  • Parallelism is ubiquitous
  • Need to understand the design trade-offs
  • Microprocessors are now multiprocessors (more
    later)
  • Today a computer architects primary job is to
    find out how to efficiently extract parallelism
  • Get involved in interesting research projects
  • Make an impact
  • Shape the future development
  • Have fun

14
Performance metrics
  • Need benchmark applications
  • SPLASH (Stanford ParalleL Applications for SHared
    memory)
  • SPEC (Standard Performance Evaluation Corp.) OMP
  • ScaLAPACK (Scalable Linear Algebra PACKage) for
    message-passing machines
  • TPC (Transaction Processing Performance Council)
    for database/transaction processing performance
  • NAS (Numerical Aerodynamic Simulation) for
    aerophysics applications
  • NPB2 port to MPI for message-passing only
  • PARKBENCH (PARallel Kernels and BENCHmarks) for
    message-passing only

15
Performance metrics
  • Comparing two different parallel computers
  • Execution time is the most reliable metric
  • Sometimes MFLOPS, GFLOPS, TFLOPS are used, but
    could be misleading
  • Evaluating a particular machine
  • Use speedup to gauge scalability of the machine
    (provided the application itself scales)
  • Speedup(P) Uniprocessor time/Time on P
    processors
  • Normally the input data set is kept constant when
    measuring speedup

16
Throughput metrics
  • Sometimes metrics like jobs/hour may be more
    important than just the turn-around time of a job
  • This is the case for transaction processing (the
    biggest commercial application for servers)
  • Needs to serve as many transactions as possible
    in a given time provided time per transaction is
    reasonable
  • Transactions are largely independent so throw in
    as many hardware threads as possible
  • Known as throughput computing (more later)

17
Application trends
  • Equal to or below 1 GFLOPS requirements
  • 2D airfoil, oil reservoir modeling, 3D plasma
    modeling, 48-hour weather
  • Below 100 GFLOPS requirements
  • Chemical dynamics, structural biology, 72-hour
    weather
  • Tomorrows applications (beyond 100 GFLOPS)
  • Human genome, protein folding, superconductor
    modeling, quantum chromodynamics, molecular
    geometry, real-time vision and speech
    recognition, graphics, CAD, space exploration,
    global-warming etc.
  • Demand for insatiable CPU cycles (need
    large-scale supercomputers)

18
Commercial sector
  • Slightly different story
  • Transactions per minute (tpm)
  • Scale of computers is much smaller
  • 4P machines to maybe 32P servers
  • But use of parallelism is tremendous
  • Need to serve as many transaction threads as
    possible (maximize the number of database users)
  • Need to handle large data footprint and offer
    massive parallelism (also economics kicks in
    should be low-cost)

19
Desktop market
  • Demand to improve throughput for sequential
    multi-programmed workload
  • I want to run as many simulations as I can and
    want them to finish before I come back next
    morning
  • Possibly the biggest application for small-scale
    multiprocessors (e.g. 2 or 4-way SMPs)
  • Even on a uniprocessor machine I would be happy
    if I could play AOE without affecting the
    performance of my simulation running in
    background (simultaneous multi-threading and chip
    multi-processing more later)

20
Technology trends
  • The natural building block for multiprocessors is
    microprocessor
  • Microprocessor performance increases 50 every
    year
  • Transistor count doubles every 18 months
  • Intel Pentium 4 EE 3.4 GHz has 178 M on a 237 mm2
    die
  • 130 nm Itanium 2 has 410 M transistors on a 374
    mm2 die
  • 90 nm Intel Montecito has 1.7 B transistors on a
    596 mm2 die
  • Die area is also growing
  • Intel Prescott had 125 M transistors on a 112 mm2
    die

21
Technology trends
  • Ever-shrinking process technology
  • Shorter gate length of transistors
  • Can afford to sweep electrons through channel
    faster
  • Transistors can be clocked at faster rate
  • Transistors also get smaller
  • Can afford to pack more on the die
  • And die size is also increasing
  • What to do with so many transistors?

22
Technology trends
  • Could increase L2 or L3 cache size
  • Does not help much beyond a certain point
  • Burns more power
  • Could improve microarchitecture
  • Better branch predictor or novel designs to
    improve instruction-level parallelism (ILP)
  • If cannot improve single-thread performance have
    to look for thread-level parallelism (TLP)
  • Multiple cores on the die (chip multiprocessors)
    IBM POWER4, POWER5, Intel Montecito, Intel
    Pentium 4, AMD Opteron, Sun UltraSPARC IV

23
Technology trends
  • TLP on chip
  • Instead of putting multiple cores could put extra
    resources and logic to run multiple threads
    simultaneously (simultaneous multi-threading)
    Alpha 21464 (cancelled), Intel Pentium 4, IBM
    POWER5, Intel Montecito
  • Todays microprocessors are small-scale
    multiprocessors (dual-core, 2-way SMT)
  • Tomorrows microprocessors will be larger-scale
    multiprocessors or highly multi-threaded
  • Sun Niagara is an 8-core (each 4-way threaded)
    chip 32 threads on a single chip

24
Architectural trends
  • Circuits bit-level parallelism
  • Started with 4 bits (Intel 4004)
    http//www.intel4004.com/
  • Now 32-bit processor is the norm
  • 64-bit processors are taking over (AMD Opteron,
    Intel Itanium, Pentium 4 family) started with
    Alpha, MIPS, Sun families
  • Architecture instruction-level parallelism (ILP)
  • Extract independent instruction stream
  • Key to advanced microprocessor design
  • Gradually hitting a limit memory wall
  • Memory operations are bottleneck
  • Need memory-level parallelism (MLP)
  • Also technology limits such as wire delay are
    pushing for a more distributed control rather
    than the centralized control in todays processors

25
Architectural trends
  • If cannot boost ILP what can be done?
  • Thread-level parallelism (TLP)
  • Explicit parallel programs already have TLP
    (inherent)
  • Sequential programs that are hard to parallelize
    or ILP-limited can be speculatively parallelized
    in hardware
  • Thread-level speculation (TLS)
  • Todays trend if cannot do anything to boost
    single-thread performance invest transistors and
    resources to exploit TLP

26
Exploiting TLP NOW
  • Simplest solution take the commodity boxes,
    connect them over gigabit ethernet and let them
    talk via messages
  • The simplest possible message-passing machine
  • Also known as Network of Workstations (NOW)
  • Normally PVM (Parallel Virtual Machine) or MPI
    (Message Passing Interface) is used for
    programming
  • Each processor sees only local memory
  • Any remote data access must happen through
    explicit messages (send/recv calls trapping into
    kernel)
  • Optimizations in the messaging layer are possible
    (user level messages, active messages)

27
Supercomputers
  • Historically used for scientific computing
  • Initially used vector processors
  • But uniprocessor performance gap of vector
    processors and microprocessors is narrowing down
  • Microprocessors now have heavily pipelined
    floating-point units, large on-chip caches,
    modern techniques to extract ILP
  • Microprocessor based supercomputers come in
    large-scale 100 to 1000 (called massively
    parallel processors or MPPs)
  • However, vector processor based supercomputers
    are much smaller scale due to cost disadvantage
  • Cray finally decided to use Alpha µP in T3D

28
Exploiting TLP Shared memory
  • Hard to build, but offers better programmability
    compared to message-passing clusters
  • The conventional load/store architecture
    continues to work
  • Communication takes place through load/store
    instructions
  • Central to design a cache coherence protocol
  • Handling data coherency among different caches
  • Special care needed for synchronization

29
Shared memory MPs
  • What is the communication protocol?
  • Could be bus-based
  • Processors share a bus and snoop every
    transaction on the bus
  • The most common design in server and enterprise
    market

P
P
P
P
MEM
30
Bus-based MPs
  • The memory is equidistant from all processors
  • Normally called symmetric multiprocessors (SMPs)
  • Fast processors can easily saturate the bus
  • Bus bandwidth becomes a scalability bottleneck
  • In 90s when processors were slow 32P SMPs could
    be seen
  • Now mostly Sun pushes for large-scale SMPs with
    advanced bus architecture/technology
  • The bus speed and width have also increased
    dramatically Intel Pentium 4 boxes normally come
    with 400 MHz front-side bus, Xeons have 533 MHz
    or 800 MHz FSB, PowerPC G5 can clock the bus up
    to 1.25 GHz

31
Scaling DSMs
  • Large-scale shared memory MPs are normally built
    over a scalable switch-based network
  • Now each node has its local memory
  • Access to remote memory happens through
    load/store, but may take longer
  • Non-Uniform Memory Access (NUMA)
  • Distributed Shared Memory (DSM)
  • The underlying coherence protocol is quite
    different compared to a bus-based SMP
  • Need specialized memory controller to handle
    coherence requests and a router to connect to the
    network

32
On-chip TLP
  • Current trend
  • Tight integration
  • Minimize communication latency (data
    communication is the bottleneck)
  • Since we have transistors
  • Put multiple cores on chip (Chip multiprocessing)
  • They can communicate via either a shared bus or
    switch-based fabric on-chip (can be custom
    designed and clocked faster)
  • Or put support for multiple threads without
    replicating cores (Simultaneous multi-threading)
  • Both choices provide a good cost/performance
    trade-off

33
Economics
  • Ultimately who controls what gets built?
  • It is cost vs. performance trade-off
  • Given a time budget (to market) and a revenue
    projection, how much performance can be afforded
  • Normal trend is to use commodity microprocessors
    as building blocks unless there is a very good
    reason
  • Reuse existing technology as much as possible
  • Large-scale scientific computing mostly exploits
    message-passing machines (easy to build, less
    costly) even google uses same kind of
    architecture use commodity parts
  • Small to medium-scale shared memory
    multiprocessors are needed in the commercial
    market (databases)
  • Although large-scale DSMs (256 or 512 nodes) are
    built by SGI, demand is less

34
Summary
  • Parallel architectures will be ubiquitous soon
  • Even on desktop (already we have SMT/HT,
    multi-core)
  • Economically attractive can build with COTS
    (commodity-off-the-shelf) parts
  • Enormous application demand (scientific as well
    as commercial)
  • More attractive today with positive technology
    and architectural trends
  • Wide range of parallel architectures SMP
    servers, DSMs, large clusters, CMP, SMT, CMT,
  • Todays microprocessors are, in fact, complex
    parallel machines trying to extract ILP as well
    as TLP
Write a Comment
User Comments (0)
About PowerShow.com