Title: Wake%20Up%20and%20Smell%20the%20Coffee:%20Performance%20Analysis%20Methodologies%20for%20the%2021st%20Century
1Wake Up and Smell the Coffee Performance
Analysis Methodologies for the 21st Century
- Kathryn S McKinley
- Department of Computer Sciences
- University of Texas at Austin
2Shocking News!
- In 2000, Java overtook C and C as the most
popular programming language - TIOBE 2000--2008
3Systems Researchin Industry and Academia
- ISCA 2006
- 20 papers use C and/or C
- 5 papers are orthogonal to the programming
language - 2 papers use specialized programming languages
- 2 papers use Java and C from SPEC
- 1 paper uses only Java from SPEC
4What is Experimental Computer Science?
5What is Experimental Computer Science?
- An idea
- An implementation in some system
- An evaluation
6The success of most systems innovation hinges on
evaluation methodologies.
- Benchmarks reflect current and ideally, future
reality - Experimental design is appropriate
- Statistical data analysis
7The success of most systems innovation hinges on
experimental methodologies.
?
- Benchmarks reflect current and ideally, future
reality DaCapo Benchmarks 2006 - Experimental design is appropriate.
- Statistical Data Analysis Georges et al. 2006
?
8Experimental Design
- Were not in Kansas anymore!
- JIT compilation, GC, dynamic checks, etc
- Methodology has not adapted
- Needs to be updated and institutionalized
this sophistication provides a significant
challenge to understanding complete system
performance, not found in traditional languages
such as C or C Hauswirth et al OOPSLA 04
9Experimental Design
- Comprehensive comparison
- 3 state-of-the-art JVMs
- Best of 5 executions
- 19 benchmarks
- Platform 2GHz Pentium-M, 1GB RAM, linux 2.6.15
10Experimental Design
11Experimental Design
12Experimental Design
13Experimental Design
First Iteration
Second Iteration
Third Iteration
14Experimental Design
- Another Experiment
- Compare two garbage collectors
- Semispace Full Heap Garbage Collector
- Marksweep Full Heap Garbage Collector
15Experimental Design
- Another Experiment
- Compare two garbage collectors
- Semispace Full Heap Garbage Collector
- Marksweep Full Heap Garbage Collector
- Experimental Design
- Same JVM, same compiler settings
- Second iteration for both
- Best of 5 executions
- One benchmark - SPEC 209_db
- Platform 2GHz Pentium-M, 1GB RAM, linux 2.6.15
16Marksweep vs Semispace
17Marksweep vs Semispace
18Marksweep vs Semispace
19Experimental Design
20Experimental DesignBest Practices
- Measuring JVM innovations
- Measuring JIT innovations
- Measuring GC innovations
- Measuring Architecture innovations
21JVM InnovationBest Practices
- Examples
- Thread scheduling
- Performance monitoring
- Workload triggers differences
- real workloads perhaps microbenchmarks
- e.g., force frequency of thread switching
- Measure report multiple iterations
- start up
- steady state (aka server mode)
- never configure the VM to use completely
unoptimized code! - Use a modest or multiple heap sizes computed as a
function of maximum live size of the application - Use report multiple architectures
22Best Practices
23JIT Innovation Best Practices
- Example new compiler optimization
- Code quality Does it improve the application
code? - Compile time How much compile time does it add?
- Total time compiler and application time
together - Problem adaptive compilation responds to
compilation load - Question How do we tease all these effects apart?
24JIT Innovation Best Practices
- Teasing apart compile time and code quality
- requires multiple experiments
- Total time Mix methodology
- Run adaptive system as intended
- Result mixture of optimized and unoptimized code
- First second iterations (that include compile
time) - Set and/or report the heap size as a function of
maximum live size of the application - Report average and show statistical error
- Code quality
- OK Run iterations until performance stabilizes
on best, or - Better Run several iterations of the benchmark,
turn off the compiler, and measure a run
guaranteed to have no compilation - Best Replay mix compilation
- Compile time
- Requires the compiler to be deterministic
- Replay mix compilation
25Replay Compilation
- Force the JIT to produce a deterministic result
- Make a compilation profiler replayer
- Profiler
- Profile first or later iterations with adaptive
JIT, pick best or average - Record profiling information used in compilation
decisions, e.g., dynamic profiles of edges,
paths, /or dynamic call graph - Record compilation decisions, e.g., compile
method bar at level two, inline method foo into
bar - Mix of optimized and unoptimized, or all
optimized/unoptimized - Replayer
- Reads in profile
- As the system loads each class, apply profile /-
innovation - Result
- controlled experiments with deterministic
compiler behavior - reduces statistical variance in measurements
- Still not a perfect methodology for inlining
26GC Innovation Best Practices
- Requires more than one experiment...
- Use report a range of fixed heap sizes
- Explore the space time tradeoff
- Measure heap size with respect to the maximum
live size of the application - VMs should report total memory not just
application memory - Different GC algorithms vary in the meta-data
they require - JIT and VM use memory...
- Measure time with a constant workload
- Do not measure through put
- Best run two experiments
- mix with adaptive methodology what users are
likely to see in practice - replay hold the compiler activity constant
- Choose a profile with best application
performance in order to keep from hiding mutator
overheads in bad code.
27Architecture Innovation Best Practices
- Requires more than one experiment...
- Use more than one VM
- Set a modest heap size and/or report heap size as
a function of maximum live size - Use a mixture of optimized and uncompiled code
- Simulator needs the same code in many cases to
perform comparisons - Best for microarchitecture only changes
- Multiple traces from live system with adaptive
methodology - start up and steady state with compiler turned
off - what users are likely to see in practice
- Wont work if architecture change requires
recompilation, e.g., new sampling mechanism - Use replay to make the code as similar as possible
28benchmarks
There are lies, damn lies, and
29Conclusions
- Methodology includes
- Benchmarks
- Experimental design
- Statistical analysis OOPSLA 2007
- Poor Methodology
- can focus or misdirect innovation and energy
- We have a unique opportunity
- Transactional memory, multicore performance,
dynamic languages, - What we can do
- Enlist VM builders to include replay
- Fund and broaden participation in benchmarking
- Research and industrial partnerships
- Funding through NSF, ACM, SPEC, industry or ??
- Participate in building community workloads
30Thank you!
www.dacapobench.org