Title: HPCS OSC Slides
1HPCS Benchmarks andTest and SpecificationEnviron
ments
Ashok Krishnamurthy Ohio Supercomputer Center
HPCS Workshop SC05 14 Nov. 2005
2Outline of Talk
- HPCS Benchmarks
- HPC Challenge
- Synthetic Scalable Compact Applications (SSCAs)
- Overview of Benchmarks and Implementations
- HPCS Test Environment
- QMTest features
- HPCS specific additions
3Contributors
David Bader Kamesh Madduri
Piotr Luszczek
John Gilbert Viral Shah
4HPCS Benchmark Spectrum
3. 3. Signal Processing Knowledge Formation
X
X
X
5HPC Challenge Benchmarks
- http//icl.cs.utk.edu/hpcc/
- Benchmarks
- HPL
- DGEMM
- STREAM
- PTRANS
- RandomAccess
- FFTE
- Comm. bandwidth latency
6HPC Challenge
7HPC Challenge Awards
- Class 1 Best Performance (4 awards)
- 500 Certificate
- Class 2 Most Productivity
- 1,500 Certificate
- 2005 HPCchallenge Award BOFTuesday, 11/15/05,
1115-1215 PM - Room 602-604
- Sponsored by HPCWire
8Synthetic Scalable Compact Applications The
Vision
- To bridge the gap between scalable synthetic
kernel benchmarks and (non-scalable) real
applications ? an important future benchmarking
tool - Must be representative of actual workloads within
an application while not being numerically
rigorous - memory access characteristics
- communications characteristics
- I/O characteristics
- etc.
- Will have no limits on the distribution to
vendors and universities - Scalable Synthetic Compact Applications (SSCAs)
will try to represent the wide spectrum of
potential HPCS Mission Partner applications
9SSCAs The Goal
- Building on a motivation slide from Fred
Johnson(15 January 2004)
NextGen Apps
Full Apps
HPCS Compact Apps
APP SIZE/COMPLEXITY
Identify which dimensions that must be examined
at full complexity and which dimensions that can
be examined at reduced scale while providing
understanding of both full applications today and
future applications
Micro BMKs
SYSTEM SIZE/COMPLEXITY
10SSCA 1 Bioinformatics
Intent
- To develop a scalable synthetic compact
application that has multiple analysis techniques
(multiple kernels) identifying similarities
between sequences of symbols - Symbols can be identical, closely related, or
entirely different - A symbol in one sequence can match a gap in
another - Each of the five kernels is based on an
application from bioinformatics, including - Local alignment
- Searching for similarities
- Global alignment
- Multiple alignment
- Each kernel operates on either the original
sequences, the results of the previous kernel, or
both - To be entirely integer and character based
- Except for incidental statistics
11SSCA 1 Status
Protein Alignment
- Written and Serial Executable Specification v0.6
has been released - Components
- Scalable Data Generator
- Kernel 1 Local Alignment
- Kernel 2 Sequence Extraction
- Kernel 3 Sequence Matching
- Kernel 4 Global Alignment
- Kernel 5 Multiple Alignment
- Seeking comment/feedback from the community
Protein in 3D
12SSCA 2 Graph Analysis
Intent
- To develop a scalable synthetic compact
application that has multiple analysis techniques
(multiple kernels) accessing a single data
structure representing a directed asymmetric
weighted multigraph with no self loops - In addition to a kernel to construct the graph
from the input tuple list, there will be three
computational kernels to operate on the graph - Each of the kernels will require irregular access
to the graphs data structures - No single data layout will be optimal for all
three computational kernels - To be entirely integer and character based
- Except for statistics
13SSCA 2 Status
- Written and Serial Executable Specification v1.0
has been released - Over 1200 lines of well commented MATLAB code
- Also works with Octave 2.9.0
- Carefully picked functional breakdown, data
structures, variable names, and comments - Accompanying documentation
- Written Specification, and slides
- MANIFEST.txt list of files with brief
description - README.txt installation and run time
instructions code overview - KNOWN_ISSUES.txt known issues in current
release - parallelization.txt Design notes on
parallelization issues
14Georgia Techs shared-memory CDavid A. Bader
Kamesh Madduri
Execution times of various kernels, and Relative
Speedup Plots
15Cray MTA-2 John Feo
- Shared-Memory, Multithreaded
- 1039 Source Lines (35 fewer lines than in
Baders shared memory version) - Execution Time
- 1M vertices, 1 processor 32.47 s.
- 1M vertices, 8 processors 4.48 s.
- Speedup of 7.24
- 3X faster on 1P than Bader shared-memory
- 7X faster on 8P than Bader shared-memory
16More SSCA2 Implementations
- MatlabP implementation
- Alan Edelman, John Gilbert et al.
- Largest Problem Size solved
- 67M vertices, 500M edges
- IBM X10 Implementation (in progress)
- David A. Bader and Mehmet F. Su
- Uses JAVA port of SPRNG 2.0 pseudo-random number
generation suite - Enhanced X10 array based adjacency list storage
for graph structures - Generic implementation for clusters
- Bader and Madduri
17SSCA 2 Movie from John Gilbert and Viral Shah
- URL is http//csc.cs.ucsb.edu/ssca/ssca.mov
18OpenMP Contest
- The OpenMP ARB chose to use SSCA2 as a basis for
a programming contest because it was reasonably
well specified and had two independent
shared-memory implementations (Bader/Madduri and
Feo). - URL http//www.openmp.org/drupal/sc05/omp-contest
.htm - First prize 1000 plus a 60GB iPod.
- Second prize 500 plus a 4GB iPod nano.
- Third prize 250 plus a 1GB iPod shuffle.
- Larry Meadows, for the OpenMP ARB
lawrence.f.meadows_at_intel.com - Prizes will be announced at OpenMP BOF
- Wednesday 11/16/05, 515PM 645PM
- Room 6A
19SSCA 3 Sensor Processing, Knowledge Formation
and File I/O
Intent
- SSCA 3 Focuses on two stages
- Front end image processing and storage
- Back end image retrieval and knowledge formation
- Two stages is representative of many areas
- Medical imaging (e.g. tumor growth)
- Image many patients daily
- Later compare images of same patient over time
- Astronomical image processing (e.g. monitor
supernovae) - Image many regions of the sky daily
- Later compare images of a region over time
- Reconnaissance monitoring (e.g. enemy movement)
- Image many areas daily
- Later compare images of a given region over time
- Benchmark has a significant file IO component
20SSCA 3 Status
SAR Image from Kernel 1
- Written and Serial Executable Specification v0.7
has been released - Components
- Synthetic Scalable Data Generator
- Kernel 1 SAR Image Formation
- Kernel 2 Image Storage
- Kernel 3 Image Retrieval
- Kernel 4 Detection
- Validation
- Seeking comment/feedback from the community
System Diagram
21Benchmark Summary andComputational Challenges
Front-End Sensor Processing
Back-End Knowledge Formation
- Pulse compression
- Polar Interpolation
- FFT, IFFT (corner turn)
- Sequential store
- Non-sequential
- retrieve
- Large small IO
- Large Images
- difference
- Threshold
- Many small
- correlations on
- random pieces of
- large image
- Scalable synthetic data generation
ISC has finished v0.7 C version of SSCA3.
22Benchmarks Implementations
http//www.highproductivity.org
23Test Environment Goals
- Model the relationship between application
performance and developer effort for various HPC
programming languages - Develop an automated method of collecting
software metric and runtime data for various HPC
benchmark codes - Automatically generate a large number of tests
based on relatively few user-specified parameters
HPCS Workflow Model
QMTest HPC Benchmark Analysis
Existing Codes Analysis
Development Time Experiments
24Test Environment
- QMTest from CodeSourcery
- Open source software
- Work done as part of HPCS effort is being folded
into mainstream QMTest release
25QMTest Reporting
NPB Test Database
Parsing rules regexp
Benchmark Database QMTest Extension
QMTest
NPB QMTest Extension
R Script Plots Results
xml2csv.py Parses report Collates results
Test Report ltxmlgt
Context File (platform-specific)
Benchmark Code e.g. NAS Parallel Benchmarks
26Using QMTest for HPCS
Benchmark Integrators Step 1 Create Test
Database
HPC System Users Step 2 Specify Platform
Configuration
Benchmark Test Database
Benchmark output parsing rules
Context file (compilers, libraries, etc.)
Step 4 Automated Analysis
Step 3 Run Tests
(Excel, Matlab, R, etc.)
Test Report ltXMLgt
Benchmark Results ltCSVgt
generates
Parsing Script
run
QMTest
27QMTest Reporting
- QMTest report (excerpt from 2500 line report)
- ltresult id"npb.ser.cg.s" kind"test"
outcome"FAIL"gt - ltannotation name"ExecTest.stdout"gt
- quotltpregt Lines Blank Cmnts NCSL
TPtoks -
- 1031 194 341 501 3252
/home/afunk/Projects/HPCS/NPB3.2/NPB3.2-SER/CG/cg.
f (FORTRAN) - 97 11 0 86 495
/home/afunk/Projects/HPCS/NPB3.2/NPB3.2-SER/CG/glo
bals.h (C) - 34 0 0 34 180
/home/afunk/Projects/HPCS/NPB3.2/NPB3.2-SER/CG/npb
params.h (C) - 131 11 0 120 675 ----- C -----
(2 files) - 1031 194 341 501 3252 ----- FORTRAN
----- (1 file) - 1162 205 341 621 3927 TOTAL
(3 files)
- xml2csv output
- Test, NCSL
- npb.ser.bt.s, 2576
- npb.ser.cg.s, 621
- npb.ser.ep.s, 179
- npb.ser.ft.s, 625
- npb.ser.is.s,
- npb.ser.lu.s, 2612
- npb.ser.mg.s, 906
- npb.ser.sp.s, 2168
28QMTest Reporting
- gt qmtest run npb.ser.s ? run tests to count NCSL
- gt qmtest report o npb.ser.s.xml results.qmr ?
generate report - gt xml2csv npb.ser.s.xml gt npb.ser.s.csv ? parse
report to cull data - gt NCSL_benchmark.R ? plot data using R script
29Test Environment Status
- QMTest can run multiple tests with varying
parameters, e.g. all permutations of NPB 500
tests - QMTest can automate collection of software
metrics and performance data SLOC, complexity,
runtime, profiling - QMTest has been used successfully to test various
HPC benchmark suites - NPB, HPChallenge, SSCA 2
- Example parameter sets for the above are
available - The sclc tool has been modified to recognize and
count SLOC for - Ada, Assembly, Awk, C, C, Eiffel, FORTRAN,
Java, Lisp, MATLAB, Octave, Pascal, Perl,
Tcl, ZPL, shell, make,
30Test Environment Planned Additions
- QMTest parallel enhancements
- GUI for parameterized tests
- Output processing for easing further analysis
- Token counting for selected languages