OpenWLC: A Workload Characterization System - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

OpenWLC: A Workload Characterization System

Description:

Determine a machine's ... Focus: Evaluation of emerging platforms to understand their benefits. What we want to know? ... Benefit system designers: ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 27

Provided by: hong94

Category:

more less

Transcript and Presenter's Notes

Title: OpenWLC: A Workload Characterization System

1
OpenWLC A Workload Characterization System

Hong Ong,
Rajagopal Subramaniyan,
R. Scott Studham
ORNL
July 19, 2005

2
Outline

Background and Motivation
Existing Tools and Services
NWPerf
OpenWLC
Overview
Goals
Preliminary Design
Timeline

3
Typical Machine Evaluation

Determine a machines applicability to a range of
applications.
Use synthetic codes and application kernels to
capture performance properties of
Hardware architecture (e.g., NAS, SPEC,
SPLASH).
Operating System components (e.g., lmbench).
Software systems and applications (e.g., NPB,
HPCtoolkits, Genesis, STSWM).
Various kinds of middleware.
Work with vendors and system architects to
improve the systems.
Focus Evaluation of emerging platforms to
understand their benefits.

4
What we want to know?

Methods to answer the following
What percentage of jobs use all that memory/disk?
What is the average job size?
How does job size impact efficiency of the
calculation?
Where is the bottleneck for the average run of a
job?
What is the sustained performance for the average
job?
What are the impacts of slow memory to the
average user?
Could we have less storage on the next system if
we had a shared memory system or a shared disk
pool?
and more.

What is the average user doing vs. what the box
was designed for?
5
Workload Characterization

Focus Evaluation of deployed platforms to
understand how they are used by average users.
How effectively is the box used as opposed to
what the box was designed for?
Determine how existing platforms are utilized by
applications and general users
Efficiency of applications,
System utilization,
Average job sizes.
Work with application developers, users and
manufacturers to decrease application runtime and
increase resource utilization.

6
Most users do not utilize the full capability of
the system.
Most jobs use lt25 of available memory (max
avail is 6-8G) Large jobs use more memory.
Memory footprint per node during FY04 on 11.4TF
HPCS2 system at PNNL
Percent of jobs
CPUs Used
7
Aggregate Results
10 of the gt256CPU jobs have the CPU scheduled
for idle gt50 of the time.
Sustained Performance as a function of CPU count
Busy Cycles as a function of job size
The median sustained performance for jobs over
256CPUs is 3.
8
Aggregate Results
Stalled cycles as a function of job size
Sum of all stalls due to BE_FLUSH_BUBBLE_ALL
Branch misprediction flush or exception
BE_EXE_BUBBLE_ALL Execution unit stalls
BE_L1D_FPU_BUBBLE_ALL Stalls due to L1D (L1 data
cache) micropipeline or FPU (floating-point unit)
micropipeline BE_RSE_BUBBLE_ALL Register stack
engine (RSE) stalls BACK_END_BUBBLE_FE Front-end
stalls
9
Job size - HPCS2 is primarily focused on small
capacity jobs.
10
Some Observation

After analyzing 19,883 jobs
Median sustained FLOP performance for jobs over
256CPUs is 3
Less than 20 of the memory is in use at any
given point in time.
Less than 5 of the jobs use heavy IO (gt25MB/s
per GF DGEMM)
Over 60 of the cycles are for jobs that use less
than 1/8 of the system.
10 of the gt256CPU jobs have the CPUs scheduled
for idle gt50 of the time.

we have determined that most users do not use
the system as designed.
11
Existing Tools

System monitoring
Provide a continuous collection and aggregation
of system performance data.
Examples
PerfMiner
Ganglia
NWPerf
SuperMon
CluMon
Nagios
PCP
MonALISA

Application monitoring
Measure actual application performance via a
batch system.
Examples
PerfMiner
NWPerf

Profiling
Provide a static, instrumentation tool, which
focuses on source code that users have direct
control.
Examples
PerfSuite
HPCToolkits
SvPablo
TAU
Kojak
Vampir
IPM

Our main focus
12
Existing Tools Limitations

Ganglia Great tool, easy to use.
Difficult to extend in a low system impact
fashion.
Unpredictable system performance due to
unsynchronized collection (ref Petrini).
RRD storage works well for fixed time series
graphs, but not suitable for ad-hoc queries.
Supermon Good per-node performance.
Polling-based systems are almost implicitly
unsynchronized.
Yet to provide a storage solution.
Node management seemed excessively cumbersome and
manual.
Kernel module reliance reduces portability and
increases administrative cost of deployment
(similar to dproc).

13
Existing Tools Limitations

Other solutions such as dproc, CARD, PARAMON have
not proved to be able to scale to large clusters.
Clumon/PCP oriented towards point-in-time data,
not long term systematic analysis (although
potential exists for extension).

14
NWPerf

27 metrics are collected on all nodes once per
minute
Hardware performance counters including flops,
memory bytes/cycle, total stalls
Local scratch usage (obtained via fstat() )
Memory swapped out (total), swap blocks in and
out
Memory free, used, and used as system buffers
Block I/O in, and out
Kernel scheduler CPU allocation to user, kernel,
and idle time
Processes running, and blocked
Interrupts, and context switches per second.
Lustre I/O (Shared global filesystem)

The 3 graphs are from the same 3 day 600CPU run
NWPerf Ryan Mooney, Scott Studham, Ken Schmidt
15
NWPerf Unresolved Issues

Deployment and configuration
Difficult to deploy and tune.
User interface
Requires work to standardize and extend
existing APIs.
Software architecture
Centralized scheduler and collection server
constitute single point of failure.
Unable to scale to O(1000) nodes
Data management
Collection and storage may not be able to scale.
Visualization
Simple.
Portability
Specific to Linux cluster.
PROJECT IS HALTED!

16
What is next?
17
OpenWLC Goals

Resolve issues in NWPerf, i.e.,
Cross-system portability.
Standardized system interface.
Better visualization.
Improved scalability.
Perform cross node and cross job event
correlation.
For example, we want to be able to say job one
is using the shared file system so jobs ten and
twenty were slowed down .
Perform validation tests at different sampling
rates to determine preferred sample rates for
different data points.
Detect and describe event anomaly based on
gathered profiles.
Perform (potentially) hardware fault analysis.
More importantly, we want to characterize more
accurately what parts of the system are important
for efficient job performance from an average
users perspective.
Deployed at DOD MOD and DOE test locations.

18
OpenWLC Design constraints

Load gt O(40) metrics (e.g., FLOPs, Int Ops, Mem
BW, Network BW, Disk BW.) for all jobs run on the
system in a (central) database.
Cannot impact jobs performance by greater than
1
Fine data granularity to see the needs to use
different algorithms.

Keep all data to develop mathematical center
profiles.
Cannot be architecture specific for portability

19
Foreseen Critical Challenges

Scalability
Single node
Software scalability and features.
Doing more without breaking or rewriting code.
of metrics.
System-wide (due to increasing number of nodes)
Communication management.
Network load.
Data volume at the collection node.
Data management.
Visualization.
Storage.
A sensible workload model to characterize what
parts of the system are important for efficient
job performance from an average users
perspective.

20
Preliminary OpenWLC Framework
21
Solutions to Scalability

Solving single node scalability
Variable sampling frequency e.g., collect set
of 10 parameter.
Modules/driver may run as a kernel thread.
Encode collected data for transmission to
collection node to reduce transmission overhead
and network traffic.
Solving system-wide scalability
Hierarchically layered architecture.
Scheduled communication with sub-collectors.
Identify a medium term data storage for highly
structured time series data. Existing solutions,
MySQL and Postgresql, are less than ideal.
Better data aggregation methods

22
Data Aggregation and Management

Want a high throughput, low overhead, highly
synchronized, massively scalable data collection
subsystem for collecting arbitrary data.
NWPerf does a pretty good job of most of this
except the arbitrary data part and ease of use.
Supermon has some interesting capabilities in its
data encapsulation, but their collection
infrastructure has some weaknesses.
Maybe a hybrid approach.

23
Analytical/Modeling subsystem.

Could use variations of clustering algorithms.
Need to identify workload model for different
platforms.
Problem This is a huge project in itself.
Potential future work.
Do not have to re-engineer our current
(preliminary) design.

24
Summary
Work with the application community, to raise the
users awareness on how to efficiently use the
systems and ensure the users applications scale
on the given NLCF platforms.
25
Benefits

Benefit system designers
What is really important (memory, flops,
something else, all of the above?)
Benefit users
Is my application sized properly? Is it having
obvious problems?
Benefit code developers
What resources is the application using, does it
have good performance?
Stop the hype
What is performance, we needed a simple way to
compare performance characteristics for a large
set of real world applications.

26
Status

Collaborators ANL, Louisiana Tech.
Kick-off meeting in June 2005.
Presently focusing on OpenWLC system design and
APIs.
Timeline
October 2005 Finalized design and APIs.
Early 2006 Beta version.

27
Acknowledgements
Funding for OpenWLC is from the Department of
Defense High performance Computing Modernization
Office. This NWPerf research described in this
presentation was performed using the Molecular
Science Computing Facility (MSCF) in the William
R. Wiley Environmental Molecular Sciences
Laboratory, a national scientific user facility
sponsored by the U.S. Department of Energy's
Office of Biological and Environmental Research
and located at the Pacific Northwest National
Laboratory. PNNL is operated for the Department
of Energy by Battelle. Ryan Mooney and Ken
Schmidt for their tireless work to develop
NWPerf. Jarek Nieplocha for his guidance on how
to quantify system impacts. Experiments and data
collection were performed on the Pacific
Northwest National Laboratory (PNNL) 977-node
Linux 11.8 TFLOPs cluster (HPCS2) with 1954
Itanium-2 processors. The data collection server
is a Dual Xeon Dell system with a 1TB ACNC IDE to
SCSI Raid Array. This research is sponsored by
the Office of Advanced Scientific Computing
Research U.S. Department of Energy. The work was
performed at the Oak Ridge National Laboratory,
which is managed by UT-Battelle, LLC under
Contract No. De-AC05-00OR22725.

Write a Comment

User Comments (0)