Title: OpenWLC: A Workload Characterization System
1OpenWLC A Workload Characterization System
- Hong Ong,
- Rajagopal Subramaniyan,
- R. Scott Studham
- ORNL
- July 19, 2005
2Outline
- Background and Motivation
- Existing Tools and Services
- NWPerf
- OpenWLC
- Overview
- Goals
- Preliminary Design
- Timeline
3Typical Machine Evaluation
- Determine a machines applicability to a range of
applications. - Use synthetic codes and application kernels to
capture performance properties of - Hardware architecture (e.g., NAS, SPEC,
SPLASH). - Operating System components (e.g., lmbench).
- Software systems and applications (e.g., NPB,
HPCtoolkits, Genesis, STSWM). - Various kinds of middleware.
- Work with vendors and system architects to
improve the systems. - Focus Evaluation of emerging platforms to
understand their benefits.
4What we want to know?
- Methods to answer the following
- What percentage of jobs use all that memory/disk?
- What is the average job size?
- How does job size impact efficiency of the
calculation? - Where is the bottleneck for the average run of a
job? - What is the sustained performance for the average
job? - What are the impacts of slow memory to the
average user? - Could we have less storage on the next system if
we had a shared memory system or a shared disk
pool? - and more.
What is the average user doing vs. what the box
was designed for?
5Workload Characterization
- Focus Evaluation of deployed platforms to
understand how they are used by average users. - How effectively is the box used as opposed to
what the box was designed for? - Determine how existing platforms are utilized by
applications and general users - Efficiency of applications,
- System utilization,
- Average job sizes.
- Work with application developers, users and
manufacturers to decrease application runtime and
increase resource utilization.
6Most users do not utilize the full capability of
the system.
Most jobs use lt25 of available memory (max
avail is 6-8G) Large jobs use more memory.
Memory footprint per node during FY04 on 11.4TF
HPCS2 system at PNNL
Percent of jobs
CPUs Used
7Aggregate Results
10 of the gt256CPU jobs have the CPU scheduled
for idle gt50 of the time.
Sustained Performance as a function of CPU count
Busy Cycles as a function of job size
The median sustained performance for jobs over
256CPUs is 3.
8Aggregate Results
Stalled cycles as a function of job size
Sum of all stalls due to BE_FLUSH_BUBBLE_ALL
Branch misprediction flush or exception
BE_EXE_BUBBLE_ALL Execution unit stalls
BE_L1D_FPU_BUBBLE_ALL Stalls due to L1D (L1 data
cache) micropipeline or FPU (floating-point unit)
micropipeline BE_RSE_BUBBLE_ALL Register stack
engine (RSE) stalls BACK_END_BUBBLE_FE Front-end
stalls
9Job size - HPCS2 is primarily focused on small
capacity jobs.
10Some Observation
- After analyzing 19,883 jobs
- Median sustained FLOP performance for jobs over
256CPUs is 3 - Less than 20 of the memory is in use at any
given point in time. - Less than 5 of the jobs use heavy IO (gt25MB/s
per GF DGEMM) - Over 60 of the cycles are for jobs that use less
than 1/8 of the system. - 10 of the gt256CPU jobs have the CPUs scheduled
for idle gt50 of the time.
we have determined that most users do not use
the system as designed.
11Existing Tools
- System monitoring
- Provide a continuous collection and aggregation
of system performance data. - Examples
- PerfMiner
- Ganglia
- NWPerf
- SuperMon
- CluMon
- Nagios
- PCP
- MonALISA
- Application monitoring
- Measure actual application performance via a
batch system. - Examples
- PerfMiner
- NWPerf
- Profiling
- Provide a static, instrumentation tool, which
focuses on source code that users have direct
control. - Examples
- PerfSuite
- HPCToolkits
- SvPablo
- TAU
- Kojak
- Vampir
- IPM
Our main focus
12Existing Tools Limitations
- Ganglia Great tool, easy to use.
- Difficult to extend in a low system impact
fashion. - Unpredictable system performance due to
unsynchronized collection (ref Petrini). - RRD storage works well for fixed time series
graphs, but not suitable for ad-hoc queries. - Supermon Good per-node performance.
- Polling-based systems are almost implicitly
unsynchronized. - Yet to provide a storage solution.
- Node management seemed excessively cumbersome and
manual. - Kernel module reliance reduces portability and
increases administrative cost of deployment
(similar to dproc).
13Existing Tools Limitations
- Other solutions such as dproc, CARD, PARAMON have
not proved to be able to scale to large clusters. - Clumon/PCP oriented towards point-in-time data,
not long term systematic analysis (although
potential exists for extension).
14NWPerf
- 27 metrics are collected on all nodes once per
minute - Hardware performance counters including flops,
memory bytes/cycle, total stalls - Local scratch usage (obtained via fstat() )
- Memory swapped out (total), swap blocks in and
out - Memory free, used, and used as system buffers
- Block I/O in, and out
- Kernel scheduler CPU allocation to user, kernel,
and idle time - Processes running, and blocked
- Interrupts, and context switches per second.
- Lustre I/O (Shared global filesystem)
The 3 graphs are from the same 3 day 600CPU run
NWPerf Ryan Mooney, Scott Studham, Ken Schmidt
15NWPerf Unresolved Issues
- Deployment and configuration
- Difficult to deploy and tune.
- User interface
- Requires work to standardize and extend
existing APIs. - Software architecture
- Centralized scheduler and collection server
constitute single point of failure. - Unable to scale to O(1000) nodes
- Data management
- Collection and storage may not be able to scale.
- Visualization
- Simple.
- Portability
- Specific to Linux cluster.
- PROJECT IS HALTED!
16What is next?
17OpenWLC Goals
- Resolve issues in NWPerf, i.e.,
- Cross-system portability.
- Standardized system interface.
- Better visualization.
- Improved scalability.
- Perform cross node and cross job event
correlation. - For example, we want to be able to say job one
is using the shared file system so jobs ten and
twenty were slowed down . - Perform validation tests at different sampling
rates to determine preferred sample rates for
different data points. - Detect and describe event anomaly based on
gathered profiles. - Perform (potentially) hardware fault analysis.
- More importantly, we want to characterize more
accurately what parts of the system are important
for efficient job performance from an average
users perspective. - Deployed at DOD MOD and DOE test locations.
18OpenWLC Design constraints
- Load gt O(40) metrics (e.g., FLOPs, Int Ops, Mem
BW, Network BW, Disk BW.) for all jobs run on the
system in a (central) database. - Cannot impact jobs performance by greater than
1 - Fine data granularity to see the needs to use
different algorithms.
- Keep all data to develop mathematical center
profiles. - Cannot be architecture specific for portability
19Foreseen Critical Challenges
- Scalability
- Single node
- Software scalability and features.
- Doing more without breaking or rewriting code.
- of metrics.
- System-wide (due to increasing number of nodes)
- Communication management.
- Network load.
- Data volume at the collection node.
- Data management.
- Visualization.
- Storage.
- A sensible workload model to characterize what
parts of the system are important for efficient
job performance from an average users
perspective.
20Preliminary OpenWLC Framework
21Solutions to Scalability
- Solving single node scalability
- Variable sampling frequency e.g., collect set
of 10 parameter. - Modules/driver may run as a kernel thread.
- Encode collected data for transmission to
collection node to reduce transmission overhead
and network traffic. - Solving system-wide scalability
- Hierarchically layered architecture.
- Scheduled communication with sub-collectors.
- Identify a medium term data storage for highly
structured time series data. Existing solutions,
MySQL and Postgresql, are less than ideal. - Better data aggregation methods
22Data Aggregation and Management
- Want a high throughput, low overhead, highly
synchronized, massively scalable data collection
subsystem for collecting arbitrary data. - NWPerf does a pretty good job of most of this
except the arbitrary data part and ease of use. - Supermon has some interesting capabilities in its
data encapsulation, but their collection
infrastructure has some weaknesses. - Maybe a hybrid approach.
23Analytical/Modeling subsystem.
- Could use variations of clustering algorithms.
- Need to identify workload model for different
platforms. - Problem This is a huge project in itself.
- Potential future work.
- Do not have to re-engineer our current
(preliminary) design.
24Summary
Work with the application community, to raise the
users awareness on how to efficiently use the
systems and ensure the users applications scale
on the given NLCF platforms.
25Benefits
- Benefit system designers
- What is really important (memory, flops,
something else, all of the above?) - Benefit users
- Is my application sized properly? Is it having
obvious problems? - Benefit code developers
- What resources is the application using, does it
have good performance? - Stop the hype
- What is performance, we needed a simple way to
compare performance characteristics for a large
set of real world applications.
26Status
- Collaborators ANL, Louisiana Tech.
- Kick-off meeting in June 2005.
- Presently focusing on OpenWLC system design and
APIs. - Timeline
- October 2005 Finalized design and APIs.
- Early 2006 Beta version.
27Acknowledgements
Funding for OpenWLC is from the Department of
Defense High performance Computing Modernization
Office. This NWPerf research described in this
presentation was performed using the Molecular
Science Computing Facility (MSCF) in the William
R. Wiley Environmental Molecular Sciences
Laboratory, a national scientific user facility
sponsored by the U.S. Department of Energy's
Office of Biological and Environmental Research
and located at the Pacific Northwest National
Laboratory. PNNL is operated for the Department
of Energy by Battelle. Ryan Mooney and Ken
Schmidt for their tireless work to develop
NWPerf. Jarek Nieplocha for his guidance on how
to quantify system impacts. Experiments and data
collection were performed on the Pacific
Northwest National Laboratory (PNNL) 977-node
Linux 11.8 TFLOPs cluster (HPCS2) with 1954
Itanium-2 processors. The data collection server
is a Dual Xeon Dell system with a 1TB ACNC IDE to
SCSI Raid Array. This research is sponsored by
the Office of Advanced Scientific Computing
Research U.S. Department of Energy. The work was
performed at the Oak Ridge National Laboratory,
which is managed by UT-Battelle, LLC under
Contract No. De-AC05-00OR22725.