Title: Benchmarks for Parallel Systems
1Benchmarks for Parallel Systems
- Sources/Credits
- Performance of Various Computers Using Standard
Linear Equations Software, Jack Dongarra,
University of Tennessee, Knoxville TN, 37996,
Computer Science Technical Report Number CS - 89
85, April 8, 2004, urlhttp//www.netlib.org/ben
chmark/performance.ps - http//www.top500.org
- FAQ http//www.netlib.org/utk/people/JackDongarra
/faq-linpack.html - Courtesy Jack Dongarra (Top500)
- http//www.top500.org
- The LINPACK Benchmark Past, Present, and Future,
Jack Dongarra, Piotr Luszczek, and Antoine
Petitet - NAS Parallel Benchmarks. http//www.nas.nasa.gov/S
oftware/NPB/
2LINPACK (Dongarra 1979)
- Dense system of linear equations
- Initially used as a users guide for LINPACK
package - LINPACK 1979
- N100 benchmark, N1000 benchmark, Highly
Parallel Computing benchmark
3LINPACK benchmark
- Implemented on top of BLAS1
- 2 main operations DGEFA(Gaussian elimination -
O(n3)) and DGESL(Ax b O(n2)) - Major operation (97) DAXPY y y a.x
- Called n3/3 n2 times. Hence 2n3/3 2n2 flops
(approx.) - 64-bit floating point arithmetic
4LINPACK
- N100, 100x100 system of equations. No change in
code. User asked to give a timing routine called
SECOND, no compiler optimizations - N1000, 1000x1000 user can implement any code,
should provide the required accuracy Towards
Peak Performance (TPP). Driver program always
uses 2n3/3 2n2 - Highly Parallel Computing benchmark any
software, matrix size can be chosen. Used in
Top500 - Based on 64-bit floating point arithmetic
5LINPACK
- 100x100 inner loop optimization
- 1000x1000 three-loop/whole program optimization
- Scalable parallel program Largest problem that
can fit in memory - Template of Linpack code
- Generate
- Solve
- Check
- Time
6HPL (Implementation of HPLinpack Benchmark)
7HPL Algorithm
- 2-D block-cyclic data distribution
- Right-looking LU
- Panel factorization various options
- - Crout, left or right-looking recursive
variants based on matrix multiply - - Number of sub-panels
- - recursive stopping criteria
- - pivot search and broadcast by
binary-exchange
8HPL algorithm
- Panel broadcast
- -
- Update of trailing matrix
- - look-ahead pipeline
- Validity check
- - should be O(1)
9Top500 (www.top500.org)
- Top500 1993
- Twice a year June and November
- Top500 gives Nmax, Rmax, N1/2, Rpeak
10TOP500 list Data shown
- Manufacturer Manufacturer or vendor
- Computer Type indicated by manufacturer or
vendor - Installation Site Customer
- Location Location and country
- Year Year of installation/last major update
- Installation Type Academic, Research, Industry,
Vendor, Classified, Government - Installation Area e.g. Research Energy /
Industry Finance - Processors Number of processors
- Rmax Maxmimal LINPACK performance achieved
- Rpeak Theoretical peak performance
- Nmax Problem size for achieving Rmax
- N1/2 Problem size for achieving half of Rmax
- Nworld Position within the TOP500 ranking
11(No Transcript)
12India and Top 500
Rank Site SystemVendor Processors Rmax Rpeak
111 Geoscience (B)India BladeCenter HS20 Cluster, Xeon EM64T 3.4 GHz - Gig-Ethernet IBM 1024 3755 6963.2
204 Semiconductor Company (L)India eServer, Opteron 2.6 GHz, GigEthernet IBM 1024 2791 5324.8
231 Semiconductor Company (K)India xSeries x336 Cluster Xeon EM64T 3.6 GHz - Gig-Ethernet IBM 730 2676.88 5256
293 Institute of Genomics and Integrative BiologyIndia Cluster Platform 3000 DL140G2 Xeon 3.6 GHz Infiniband Hewlett-Packard 576 2156 4147.2
13(No Transcript)
14(No Transcript)
15NAS Parallel Benchmarks - NPB
- Also for evaluation of Supercomputers
- A set of 8 programs from CFD
- 5 kernels, 3 pseudo applications
- NPB 1 Original benchmarks
- NPB 2 NASs MPI implementation. NPB 2.4 Class D
has more work and more I/O - NPB 3 based on OpenMP, HPF, Java
- GridNPB3 for computational grids
- NPB 3 multi-zone for hybrid parallelism
16NPB 1.0 (March 1994)
- Defines class A and class B versions
- Paper and pencil algorithmic specifications
- Generic benchmarks as compared to MPI-based
LinPack - General rules for implementations Fortran90 or
C, 64-bit arithmetic etc. - Sample implementations provided
17Kernel Benchmarks
- EP embarrassingly parallel
- MG multigrid. Regular communication
- CG conjugate gradient. Irregular long distance
communication - FT a 3-D PDE using FFT. Rigorous test of long
distance communication - IS large integer sort
- Detailed rules regarding
- - brief statement of the problem
- - algorithm to be practiced
- - validation of results
- - where to insert timing calls
- - method for generating random numbers
- - submission of results
18Pseudo applications / Synthetic CFDs
- Benchmark 1 perform few iterations of the
approximate factorization algorithm (SP) - Benchmark 2 - perform few iterations of diagonal
form of the approximate factorization algorithm
(BT) - Benchmark 3 - perform few iterations of SSOR (LU)
19Class A and Class B
Class A
Sample Code
Class B
20NPB 2.0 (1995)
- MPI and Fortran 77 implementations
- 2 parallel kernels (MG, FT) and 3 simulated
applications (LU, SP, BT) - Class C bigger size
- Benchmark rules 0, 5, gt5 change in source
code
21NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003)
- EP and IS added
- FT rewritten
- NPB 2.4 class D and rationale for class D sizes
- 2.4 I/O a new benchmark problem based on BT
(BTIO) to test the output capabilities - A MPI implementation of the same (MPI-IO)
different options using collective buffering or
not etc.
22Thank You !