OpenWLC: A Workload Characterization System - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

OpenWLC: A Workload Characterization System

Description:

Determine a machine's ... Focus: Evaluation of emerging platforms to understand their benefits. What we want to know? ... Benefit system designers: ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 27
Provided by: hong94
Category:

less

Transcript and Presenter's Notes

Title: OpenWLC: A Workload Characterization System


1
OpenWLC A Workload Characterization System
  • Hong Ong,
  • Rajagopal Subramaniyan,
  • R. Scott Studham
  • ORNL
  • July 19, 2005

2
Outline
  • Background and Motivation
  • Existing Tools and Services
  • NWPerf
  • OpenWLC
  • Overview
  • Goals
  • Preliminary Design
  • Timeline

3
Typical Machine Evaluation
  • Determine a machines applicability to a range of
    applications.
  • Use synthetic codes and application kernels to
    capture performance properties of
  • Hardware architecture (e.g., NAS, SPEC,
    SPLASH).
  • Operating System components (e.g., lmbench).
  • Software systems and applications (e.g., NPB,
    HPCtoolkits, Genesis, STSWM).
  • Various kinds of middleware.
  • Work with vendors and system architects to
    improve the systems.
  • Focus Evaluation of emerging platforms to
    understand their benefits.

4
What we want to know?
  • Methods to answer the following
  • What percentage of jobs use all that memory/disk?
  • What is the average job size?
  • How does job size impact efficiency of the
    calculation?
  • Where is the bottleneck for the average run of a
    job?
  • What is the sustained performance for the average
    job?
  • What are the impacts of slow memory to the
    average user?
  • Could we have less storage on the next system if
    we had a shared memory system or a shared disk
    pool?
  • and more.

What is the average user doing vs. what the box
was designed for?
5
Workload Characterization
  • Focus Evaluation of deployed platforms to
    understand how they are used by average users.
  • How effectively is the box used as opposed to
    what the box was designed for?
  • Determine how existing platforms are utilized by
    applications and general users
  • Efficiency of applications,
  • System utilization,
  • Average job sizes.
  • Work with application developers, users and
    manufacturers to decrease application runtime and
    increase resource utilization.

6
Most users do not utilize the full capability of
the system.
Most jobs use lt25 of available memory (max
avail is 6-8G) Large jobs use more memory.
Memory footprint per node during FY04 on 11.4TF
HPCS2 system at PNNL
Percent of jobs
CPUs Used
7
Aggregate Results
10 of the gt256CPU jobs have the CPU scheduled
for idle gt50 of the time.
Sustained Performance as a function of CPU count
Busy Cycles as a function of job size
The median sustained performance for jobs over
256CPUs is 3.
8
Aggregate Results
Stalled cycles as a function of job size
Sum of all stalls due to BE_FLUSH_BUBBLE_ALL
Branch misprediction flush or exception
BE_EXE_BUBBLE_ALL Execution unit stalls
BE_L1D_FPU_BUBBLE_ALL Stalls due to L1D (L1 data
cache) micropipeline or FPU (floating-point unit)
micropipeline BE_RSE_BUBBLE_ALL Register stack
engine (RSE) stalls BACK_END_BUBBLE_FE Front-end
stalls
9
Job size - HPCS2 is primarily focused on small
capacity jobs.
10
Some Observation
  • After analyzing 19,883 jobs
  • Median sustained FLOP performance for jobs over
    256CPUs is 3
  • Less than 20 of the memory is in use at any
    given point in time.
  • Less than 5 of the jobs use heavy IO (gt25MB/s
    per GF DGEMM)
  • Over 60 of the cycles are for jobs that use less
    than 1/8 of the system.
  • 10 of the gt256CPU jobs have the CPUs scheduled
    for idle gt50 of the time.

we have determined that most users do not use
the system as designed.
11
Existing Tools
  • System monitoring
  • Provide a continuous collection and aggregation
    of system performance data.
  • Examples
  • PerfMiner
  • Ganglia
  • NWPerf
  • SuperMon
  • CluMon
  • Nagios
  • PCP
  • MonALISA
  • Application monitoring
  • Measure actual application performance via a
    batch system.
  • Examples
  • PerfMiner
  • NWPerf
  • Profiling
  • Provide a static, instrumentation tool, which
    focuses on source code that users have direct
    control.
  • Examples
  • PerfSuite
  • HPCToolkits
  • SvPablo
  • TAU
  • Kojak
  • Vampir
  • IPM

Our main focus
12
Existing Tools Limitations
  • Ganglia Great tool, easy to use.
  • Difficult to extend in a low system impact
    fashion.
  • Unpredictable system performance due to
    unsynchronized collection (ref Petrini).
  • RRD storage works well for fixed time series
    graphs, but not suitable for ad-hoc queries.
  • Supermon Good per-node performance.
  • Polling-based systems are almost implicitly
    unsynchronized.
  • Yet to provide a storage solution.
  • Node management seemed excessively cumbersome and
    manual.
  • Kernel module reliance reduces portability and
    increases administrative cost of deployment
    (similar to dproc).

13
Existing Tools Limitations
  • Other solutions such as dproc, CARD, PARAMON have
    not proved to be able to scale to large clusters.
  • Clumon/PCP oriented towards point-in-time data,
    not long term systematic analysis (although
    potential exists for extension).

14
NWPerf
  • 27 metrics are collected on all nodes once per
    minute
  • Hardware performance counters including flops,
    memory bytes/cycle, total stalls
  • Local scratch usage (obtained via fstat() )
  • Memory swapped out (total), swap blocks in and
    out
  • Memory free, used, and used as system buffers
  • Block I/O in, and out
  • Kernel scheduler CPU allocation to user, kernel,
    and idle time
  • Processes running, and blocked
  • Interrupts, and context switches per second.
  • Lustre I/O (Shared global filesystem)

The 3 graphs are from the same 3 day 600CPU run
NWPerf Ryan Mooney, Scott Studham, Ken Schmidt
15
NWPerf Unresolved Issues
  • Deployment and configuration
  • Difficult to deploy and tune.
  • User interface
  • Requires work to standardize and extend
    existing APIs.
  • Software architecture
  • Centralized scheduler and collection server
    constitute single point of failure.
  • Unable to scale to O(1000) nodes
  • Data management
  • Collection and storage may not be able to scale.
  • Visualization
  • Simple.
  • Portability
  • Specific to Linux cluster.
  • PROJECT IS HALTED!

16
What is next?
17
OpenWLC Goals
  • Resolve issues in NWPerf, i.e.,
  • Cross-system portability.
  • Standardized system interface.
  • Better visualization.
  • Improved scalability.
  • Perform cross node and cross job event
    correlation.
  • For example, we want to be able to say job one
    is using the shared file system so jobs ten and
    twenty were slowed down .
  • Perform validation tests at different sampling
    rates to determine preferred sample rates for
    different data points.
  • Detect and describe event anomaly based on
    gathered profiles.
  • Perform (potentially) hardware fault analysis.
  • More importantly, we want to characterize more
    accurately what parts of the system are important
    for efficient job performance from an average
    users perspective.
  • Deployed at DOD MOD and DOE test locations.

18
OpenWLC Design constraints
  • Load gt O(40) metrics (e.g., FLOPs, Int Ops, Mem
    BW, Network BW, Disk BW.) for all jobs run on the
    system in a (central) database.
  • Cannot impact jobs performance by greater than
    1
  • Fine data granularity to see the needs to use
    different algorithms.
  • Keep all data to develop mathematical center
    profiles.
  • Cannot be architecture specific for portability

19
Foreseen Critical Challenges
  • Scalability
  • Single node
  • Software scalability and features.
  • Doing more without breaking or rewriting code.
  • of metrics.
  • System-wide (due to increasing number of nodes)
  • Communication management.
  • Network load.
  • Data volume at the collection node.
  • Data management.
  • Visualization.
  • Storage.
  • A sensible workload model to characterize what
    parts of the system are important for efficient
    job performance from an average users
    perspective.

20
Preliminary OpenWLC Framework
21
Solutions to Scalability
  • Solving single node scalability
  • Variable sampling frequency e.g., collect set
    of 10 parameter.
  • Modules/driver may run as a kernel thread.
  • Encode collected data for transmission to
    collection node to reduce transmission overhead
    and network traffic.
  • Solving system-wide scalability
  • Hierarchically layered architecture.
  • Scheduled communication with sub-collectors.
  • Identify a medium term data storage for highly
    structured time series data. Existing solutions,
    MySQL and Postgresql, are less than ideal.
  • Better data aggregation methods

22
Data Aggregation and Management
  • Want a high throughput, low overhead, highly
    synchronized, massively scalable data collection
    subsystem for collecting arbitrary data.
  • NWPerf does a pretty good job of most of this
    except the arbitrary data part and ease of use.
  • Supermon has some interesting capabilities in its
    data encapsulation, but their collection
    infrastructure has some weaknesses.
  • Maybe a hybrid approach.

23
Analytical/Modeling subsystem.
  • Could use variations of clustering algorithms.
  • Need to identify workload model for different
    platforms.
  • Problem This is a huge project in itself.
  • Potential future work.
  • Do not have to re-engineer our current
    (preliminary) design.

24
Summary
Work with the application community, to raise the
users awareness on how to efficiently use the
systems and ensure the users applications scale
on the given NLCF platforms.
25
Benefits
  • Benefit system designers
  • What is really important (memory, flops,
    something else, all of the above?)
  • Benefit users
  • Is my application sized properly? Is it having
    obvious problems?
  • Benefit code developers
  • What resources is the application using, does it
    have good performance?
  • Stop the hype
  • What is performance, we needed a simple way to
    compare performance characteristics for a large
    set of real world applications.

26
Status
  • Collaborators ANL, Louisiana Tech.
  • Kick-off meeting in June 2005.
  • Presently focusing on OpenWLC system design and
    APIs.
  • Timeline
  • October 2005 Finalized design and APIs.
  • Early 2006 Beta version.

27
Acknowledgements
Funding for OpenWLC is from the Department of
Defense High performance Computing Modernization
Office. This NWPerf research described in this
presentation was performed using the Molecular
Science Computing Facility (MSCF) in the William
R. Wiley Environmental Molecular Sciences
Laboratory, a national scientific user facility
sponsored by the U.S. Department of Energy's
Office of Biological and Environmental Research
and located at the Pacific Northwest National
Laboratory. PNNL is operated for the Department
of Energy by Battelle. Ryan Mooney and Ken
Schmidt for their tireless work to develop
NWPerf. Jarek Nieplocha for his guidance on how
to quantify system impacts. Experiments and data
collection were performed on the Pacific
Northwest National Laboratory (PNNL) 977-node
Linux 11.8 TFLOPs cluster (HPCS2) with 1954
Itanium-2 processors. The data collection server
is a Dual Xeon Dell system with a 1TB ACNC IDE to
SCSI Raid Array. This research is sponsored by
the Office of Advanced Scientific Computing
Research U.S. Department of Energy. The work was
performed at the Oak Ridge National Laboratory,
which is managed by UT-Battelle, LLC under
Contract No. De-AC05-00OR22725.
Write a Comment
User Comments (0)
About PowerShow.com