Benchmarks for Parallel Systems - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Benchmarks for Parallel Systems

Description:

Benchmarks for Parallel Systems Sources/Credits: Performance of Various Computers Using Standard Linear Equations Software , Jack Dongarra, University of ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 22
Provided by: Sathi64
Category:

less

Transcript and Presenter's Notes

Title: Benchmarks for Parallel Systems


1
Benchmarks for Parallel Systems
  • Sources/Credits
  • Performance of Various Computers Using Standard
    Linear Equations Software, Jack Dongarra,
    University of Tennessee, Knoxville TN, 37996,
    Computer Science Technical Report Number CS - 89
    85, April 8, 2004, urlhttp//www.netlib.org/ben
    chmark/performance.ps
  • http//www.top500.org
  • FAQ http//www.netlib.org/utk/people/JackDongarra
    /faq-linpack.html
  • Courtesy Jack Dongarra (Top500)
  • http//www.top500.org
  • The LINPACK Benchmark Past, Present, and Future,
    Jack Dongarra, Piotr Luszczek, and Antoine
    Petitet
  • NAS Parallel Benchmarks. http//www.nas.nasa.gov/S
    oftware/NPB/

2
LINPACK (Dongarra 1979)
  • Dense system of linear equations
  • Initially used as a users guide for LINPACK
    package
  • LINPACK 1979
  • N100 benchmark, N1000 benchmark, Highly
    Parallel Computing benchmark

3
LINPACK benchmark
  • Implemented on top of BLAS1
  • 2 main operations DGEFA(Gaussian elimination -
    O(n3)) and DGESL(Ax b O(n2))
  • Major operation (97) DAXPY y y a.x
  • Called n3/3 n2 times. Hence 2n3/3 2n2 flops
    (approx.)
  • 64-bit floating point arithmetic

4
LINPACK
  • N100, 100x100 system of equations. No change in
    code. User asked to give a timing routine called
    SECOND, no compiler optimizations
  • N1000, 1000x1000 user can implement any code,
    should provide the required accuracy Towards
    Peak Performance (TPP). Driver program always
    uses 2n3/3 2n2
  • Highly Parallel Computing benchmark any
    software, matrix size can be chosen. Used in
    Top500
  • Based on 64-bit floating point arithmetic

5
LINPACK
  • 100x100 inner loop optimization
  • 1000x1000 three-loop/whole program optimization
  • Scalable parallel program Largest problem that
    can fit in memory
  • Template of Linpack code
  • Generate
  • Solve
  • Check
  • Time

6
HPL (Implementation of HPLinpack Benchmark)
7
HPL Algorithm
  • 2-D block-cyclic data distribution
  • Right-looking LU
  • Panel factorization various options
  • - Crout, left or right-looking recursive
    variants based on matrix multiply
  • - Number of sub-panels
  • - recursive stopping criteria
  • - pivot search and broadcast by
    binary-exchange

8
HPL algorithm
  • Panel broadcast
  • -
  • Update of trailing matrix
  • - look-ahead pipeline
  • Validity check
  • - should be O(1)

9
Top500 (www.top500.org)
  • Top500 1993
  • Twice a year June and November
  • Top500 gives Nmax, Rmax, N1/2, Rpeak

10
TOP500 list Data shown
  • Manufacturer Manufacturer or vendor
  • Computer Type indicated by manufacturer or
    vendor
  • Installation Site Customer
  • Location Location and country
  • Year Year of installation/last major update
  • Installation Type Academic, Research, Industry,
    Vendor, Classified, Government
  • Installation Area e.g. Research Energy /
    Industry Finance
  • Processors Number of processors
  • Rmax Maxmimal LINPACK performance achieved
  • Rpeak Theoretical peak performance
  • Nmax Problem size for achieving Rmax
  • N1/2 Problem size for achieving half of Rmax
  • Nworld Position within the TOP500 ranking

11
(No Transcript)
12
India and Top 500
Rank Site SystemVendor Processors Rmax Rpeak
111 Geoscience (B)India BladeCenter HS20 Cluster, Xeon EM64T 3.4 GHz - Gig-Ethernet IBM 1024 3755 6963.2
204 Semiconductor Company (L)India eServer, Opteron 2.6 GHz, GigEthernet IBM 1024 2791 5324.8
231 Semiconductor Company (K)India xSeries x336 Cluster Xeon EM64T 3.6 GHz - Gig-Ethernet IBM 730 2676.88 5256
293 Institute of Genomics and Integrative BiologyIndia Cluster Platform 3000 DL140G2 Xeon 3.6 GHz Infiniband Hewlett-Packard 576 2156 4147.2
13
(No Transcript)
14
(No Transcript)
15
NAS Parallel Benchmarks - NPB
  • Also for evaluation of Supercomputers
  • A set of 8 programs from CFD
  • 5 kernels, 3 pseudo applications
  • NPB 1 Original benchmarks
  • NPB 2 NASs MPI implementation. NPB 2.4 Class D
    has more work and more I/O
  • NPB 3 based on OpenMP, HPF, Java
  • GridNPB3 for computational grids
  • NPB 3 multi-zone for hybrid parallelism

16
NPB 1.0 (March 1994)
  • Defines class A and class B versions
  • Paper and pencil algorithmic specifications
  • Generic benchmarks as compared to MPI-based
    LinPack
  • General rules for implementations Fortran90 or
    C, 64-bit arithmetic etc.
  • Sample implementations provided

17
Kernel Benchmarks
  • EP embarrassingly parallel
  • MG multigrid. Regular communication
  • CG conjugate gradient. Irregular long distance
    communication
  • FT a 3-D PDE using FFT. Rigorous test of long
    distance communication
  • IS large integer sort
  • Detailed rules regarding
  • - brief statement of the problem
  • - algorithm to be practiced
  • - validation of results
  • - where to insert timing calls
  • - method for generating random numbers
  • - submission of results

18
Pseudo applications / Synthetic CFDs
  • Benchmark 1 perform few iterations of the
    approximate factorization algorithm (SP)
  • Benchmark 2 - perform few iterations of diagonal
    form of the approximate factorization algorithm
    (BT)
  • Benchmark 3 - perform few iterations of SSOR (LU)

19
Class A and Class B
Class A
Sample Code
Class B
20
NPB 2.0 (1995)
  • MPI and Fortran 77 implementations
  • 2 parallel kernels (MG, FT) and 3 simulated
    applications (LU, SP, BT)
  • Class C bigger size
  • Benchmark rules 0, 5, gt5 change in source
    code

21
NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003)
  • EP and IS added
  • FT rewritten
  • NPB 2.4 class D and rationale for class D sizes
  • 2.4 I/O a new benchmark problem based on BT
    (BTIO) to test the output capabilities
  • A MPI implementation of the same (MPI-IO)
    different options using collective buffering or
    not etc.

22
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com