Advanced Architectures CSE 190 - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Architectures CSE 190

Description:

Times Helvetica Times New Roman Palatino Arial Default Design Advanced Architectures CSE 190 Course Organization Seminars Supercomputers for Simulation and Data ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 36
Provided by: 65793
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Advanced Architectures CSE 190


1
Advanced ArchitecturesCSE 190
  • Reagan W. Moore
  • San Diego Supercomputer Center
  • moore_at_sdsc.edu
  • http//www.npaci.edu/DICE

2
Course Organization
  • Professors / TA
  • Sid Karin - Director, San Diego Supercomputer
    Center, ltskarin_at_sdsc.edugt
  • Reagan Moore - Associate Director, SDSC
    ltmoore_at_sdsc.edugt
  • Holly Dail - UCSD TA lthdail_at_cs.ucsd.edugt
  • Seminars
  • State of the art computer architectures
  • Mid-term / SDSC tour
  • Final exam

3
Seminars
  • 4/3 Reagan Moore- Performance evaluation
    heuristics modeling
  • 4/10 Sid Karin - Historical perspective
  • 4/17 Richard Kaufmann, Compaq - Teraflops
    systems
  • 4/24 IBM or Sun
  • 5/1 Mark Seager, LLNL - ASCI 10 Tflops
    computer
  • 5/8 Midterm / SDSC Tour
  • 5/15 John Feo, Tera - Multi-threaded
    architectures
  • 5/22 Peter Beckman, LANL - Clusters
  • 5/29 Holiday / no class
  • 6/5 Thomas Sterling, Caltech - Petaflops
    computers
  • 6/12 Final exam

4
Supercomputers for Simulation and Data Mining
Data Mining

Distributed Archives
Application


Collection Building
Information Discovery


Digital Library
5
Heuristics for Characterizing Supercomputers
  • Generators of data - numerically intensive
    computing
  • Usage models for the rate at which supercomputers
    move data between memory, disk, and archives
  • Usage models for capacity of the data caches
    (memory size, local disk, and archival storage)
  • Analyzers of data - data intensive computing
  • Performance models for combining data analysis
    with data movement (between caches, disks,
    archives)

6
Heuristics
  • Experience based models of computer usage
  • Dependent on computer architecture
  • Presence of data caches, memory-mapped I/O
  • Architectures used at SDSC
  • CRAY vector computers
  • X/MP, Y/MP, C-90, T-90
  • Parallel computers
  • MPPs - Ipsc 860, Paragon, T3D, T3E
  • Clusters - SP

7
Supercomputer Data Flow Model
8
Y-MP Heuristics
  • Utilization measured on Cray Y-MP
  • Real memory architecture - entire job context is
    in memory, no paging of data
  • Exceptional memory bandwidth
  • I/O rate from CPU to memory was 28 Bytes per
    cycle
  • Maximum execution rate was 2 Flops per cycle
  • Scaled memory on C-90 to test heuristics
  • Noted that increasing memory from 1 GB to 2 GBs
    decreased idle time from 10 to 2
  • Sustained execution rate was 1.8 GFlops

9
Data Generation Metrics
7 Bytes/Flop
CPU
Memory
1 Byte of storage per Flops
1 Byte/60 Flop
Local Disk
Hold data for 1 day
1/7 of data persists for a day
1/7 of data sent to archive
Archive Disk
Hold data for 1 week
All data sent to tape
Archive tape
Hold data forever
10
Peak Teraflops System
? TB
TeraFlops System
Compute Engine
Local Disk
? GB/sec
1 day cache
0.5-1 TB memory Sustain ? GF
? MB/sec
Archive Disk
Archive Tape
1 week cache
? MB/sec
? TB
? PB
11
Data Sizes on Disk
  • How much scratch space is used by each job?
  • Disk space is 20 - 40 times the memory size.
  • Data lasts for about one day
  • Average execution time for long running jobs
  • 30 minutes to 1 hour
  • For jobs using all of memory
  • Between 48 and 24 jobs per day
  • Each job uses (Disk space) / (Number of jobs)
  • Or 40/48 Memory 80 of memory

12
Peak Teraflops Data Flow Model
10 TB
TeraFlops System
Compute Engine
Local Disk
1 GB/sec
1 day cache
0.5-1 TB memory Sustain 150 GF
40 MB/sec
Archive Disk
Archive Tape
1 week cache
40 MB/sec
5 TB
0.5-1 PB
13
HPSS Archival Storage System
14
Equivalent of Ohms Law for Computer Science
  • How does one relate application requirements to
    computation rates and I/O bandwidths?
  • Use prototype data movement problem to derive
    physical parameters that characterize
    applications.

15
Data Distribution Comparison
Reduce size of data from S bytes to s bytes and
analyze
Data Handling Platform
Supercomputer
Data
B
b
Execution rate r lt R
Bandwidths linking systems are B b
Operations per bit for analysis is C
Operations per bit for data transfer is c
Should the data reduction be done before
transmission?
16
Distributing Services
Compare times for analyzing data with size
reduction from S to s
Supercomputer
Data Handling Platform
Read Data
Reduce Data
Transmit Data
Network
Receive Data
S / B
C S / r
c s / r
s / b
c s / R
Supercomputer
Data Handling Platform
Read Data
Reduce Data
Transmit Data
Receive Data
Network
c S / R
c S / r
S / b
C S / R
S / B
17
Comparison of Time
18
Optimization Parameter Selection
Have algebraic equation with eight independent
variables. T (Super) lt T (Archive) S/B CS/r
cs/r s/b cs/R lt S/B cS/r S/b cS/R
CS/R Which variable provides the simplest
optimization criterion?
19
Scaling Parameters
Data size reduction ratio s/S Execution slow
down ratio r/R Problem complexity c/C Communica
tion/Execution balance r/(cb)
Note (r/c) is the number of bits/sec that can be
processed.
When r/(cb) 1, the data processing rate is the
same as the data transmission rate.
Optimal designs have r/(cb) 1
20
Bandwidth Optimization
Moving all of the data is faster, T(Super) lt
T(Archive) Sufficiently fast network
21
Execution Rate Optimization
Moving all of the data is faster, T(Super) lt
T(Archive) Sufficiently fast supercomputer R gt r
1 (c/C) (1 - s/S) / 1 - (c/C) (1 - s/S) (1
r/(cb) Note the denominator changes sign when C
lt c (1 - s/S) 1 r/(cb) Even with an
infinitely fast supercomputer, it is better
to process at the archive if the complexity is
too small.
22
Data Reduction Optimization
Moving all of the data is faster, T(Super) lt
T(Archive) Data reduction is small enough s gt S
1 - (C/c)(1 - r/R) / 1 r/R r/(cb) Note
criteria changes sign when C gt c 1 r/R
r/(cb) / (1 - r/R) When the complexity is
sufficiently large, it is faster to process on
the supercomputer even when data can be reduced
to one bit.
23
Complexity Analysis
Moving all of the data is faster, T(Super) lt
T(Archive) Sufficiently complex analysis
24
Characterization of Supercomputer Systems
  • Sufficiently high complexity
  • Move data to processing engine
  • Digital Library execution of remote services
  • Traditional supercomputer processing of
    applications
  • Sufficiently low complexity
  • Move process to the data source
  • Metacomputing execution of remote applications
  • Traditional digital library service

25
Computer Architectures
  • Processor in memory
  • Do computations within memory
  • Complexity of supported operations
  • Commodity processors
  • L2 caches
  • L3 caches
  • Parallel computers
  • Memory bandwidth between nodes
  • MPP - shared memory
  • Cluster - distributed memory

26
Characterization Metric
  • Describe systems in terms of their balance
  • Optimal designs have r/(cb) 1
  • Equivalent of Ohms law
  • R C B
  • Characterize applications in terms of their
    complexity
  • Operations per byte of data
  • C R / B

27
Second Example
  • Inclusion of latency (time for process to start)
    and overhead (time to execute communication
    protocol)
  • Illustrate with combined optimization of use of
    network and CPU

28
Optimizing Use of Resources
  • Compare time needed to do calculations with time
    needed to access data over a network
  • Time spent using a CPU
  • Execution time protocol processing time
  • Cc Sc / Rc Cp St / Rp
  • Where
  • St size of transmitted data (bytes)
  • Sc size of application data (bytes)
  • Cc number of operations per byte of transmitted
    data for the application
  • Cp number of operations per byte to process
    protocol
  • Rc execution rate of application
  • Rp execution rate of protocol

29
Characterizing Latency
  • Time during which a network transmits data
  • Latency for initiating transfer transmission
    time
  • L St / B
  • Where
  • L is the round trip latency at the speed of light
    (sec)
  • B is the bandwidth (bytes/sec)

30
Solve for Balanced System
  • CPU utilization time Network utilization time
  • Solve for transmission size as a function of
    Sc/St
  • St L B / B Cp / Rp (B Cc / Rc) (Sc /
    St) -1
  • Solution exists when Sc/St gt Rc / (BCc) 1 -
    BCp / Rp
  • and B Cp / Rp lt 1

31
Comparing Utilization of Resources
  • Network utilization
  • Un Transmission time / (Transmission latency)
  • 1 / 1 (L B / St)
  • CPU utilization
  • Uc Execution time / (Execution Protocol
    processing)
  • 1 / 1 (Cp Rc) / (Cc Rp) (St /
    Sc)
  • Define h Sc / St

32
Comparing Efficiencies
Utilization
U-cpu
U-network
h S-compute / S-transmit
33
Crossover Point
  • When utilization of bandwidth and execution
    resources is balanced
  • 1 / 1 (L B / St) 1 / 1 (Cp Rc) /
    (Cc Rp) / h
  • For optimal St, solve for h Sc/St, and find
  • h (Rc Cp / 2 Rp Cc) sqrt(1 4 Rp / Cp B) -1
  • For small B Cp / Rp
  • h Rc / Cc B or St / B Sc Cc / Rc
  • And transmission time execution time

34
Application Summary
  • Optimal application for a given architecture
  • B Cc / Rc 1
  • (Bytes/sec) (Operations/byte) / (Operations/sec)
  • Cc Rc / B
  • Also need cost of network utilization to be small
  • B Cp / Rp lt 1
  • And amount of data transmitted proportional to
    latency
  • St L B / B Cp / Rp (B Cc / Rc) (Sc /
    St) -1

35
Further Information
http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com