Title: Simulating a $2M Commercial Server on a $2K PC
1Simulatinga 2M Commercial Serveron a 2K PC
- Alaa Alameldeen, Milo Martin, Carl Mauer,Kevin
Moore, Min Xu, Daniel Sorin,Mark D. Hill,
David A. Wood - Multifacet Project (www.cs.wisc.edu/multifacet)
- Computer Sciences Department
- University of WisconsinMadison
- February 2003
2Summary
- Context
- Commercial server design is important
- Multifacet project seeks improved designs
- Must evaluate alternatives
- Commercial Servers
- Processors, memory, disks ? 2M
- Run large multithreaded transaction-oriented
workloads - Use commercial applications on commercial OS
- To Simulate on 2K PC
- Scale tune workloads
- Manage simulation complexity
- Cope with workload variability
Keep L2 miss rates, etc.
Separate timing function
Use randomness statistics
3Outline
- Context
- Commercial Servers
- Multifacet Project
- Workload Simulation Methods
- Separate Timing Functional Simulation
- Cope with Workload Variability
- Summary
4Why Commercial Servers?
- Many (Academic) Architects
- Desktop computing
- Wireless appliances
- We focus on servers
- (Important Market)
- Performance Challenges
- Robustness Challenges
- Methodological Challenges
53-Tier Internet Service
LAN / SAN
LAN / SAN
Servers runningapplicationsfor business rules
Servers running databases for hard state
PCs w/ soft state
6Multifacet Commercial Server Design
- Wisconsin Multifacet Project
- Directed by Mark D. Hill David A. Wood
- Sponsors NSF, WI, Compaq, IBM, Intel, Sun
- Current Contributors Alaa Alameldeen, Brad
Beckman,Nikhil Gupta, Pacia Harper, Jarrod
Lewis, Milo Martin, Carl Mauer,Kevin Moore,
Daniel Sorin, Min Xu - Past Contributors Anastassia Ailamaki, Ender
Bilir,Ross Dickson, Ying Hu, Manoj Plakal,
Anne Condon - Analysis
- Want 4-64 processors
- Many cache-to-cache misses
- Neither snooping nor directories ideal
- Multifacet Designs
- Snooping w/ multicast ISCA99 or unordered
network ASPLOS00 - Bandwidth-adaptive HPCA02 token coherence
ISCA03
7Outline
- Context
- Workload Simulation Methods
- Select, scale, tune workloads
- Transition workload to simulator
- Specify test the proposed design
- Evaluate design with simple/detailed processor
models - Separate Timing Functional Simulation
- Cope with Workload Variability
- Summary
8Multifacet Simulation Overview
Commercial Server(Sun E6000)
Scaled Workloads
Full Workloads
Workload Development
Full System FunctionalSimulator (Simics)
Memory Protocol Generator (SLICC)
Pseudo-RandomProtocol Checker
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)
- Virtutech Simics (www.virtutech.com)
- Rest is Multifacet software
9Select Important Workloads
Full Workloads
- Online Transaction Processing DB2 w/ TPC-C-like
- Java Server Workload SPECjbb
- Static web content serving Apache
- Dynamic web content serving Slashcode
- Java-based Middleware (soon)
10Setup Tune Workloads (on real hardware)
Commercial Server(Sun E6000)
Full Workloads
- Tune workload, OS parameters
- Measure transaction rate, speed-up, miss rates,
I/O - Compare to published results
11Scale Re-tune Workloads
Commercial Server(Sun E6000)
Scaled Workloads
- Scale-down for PC memory limits
- Retaining similar behavior (e.g., L2 cache miss
rate) - Re-tune to achieve higher transaction
rates(OLTP raw disk, multiple disks, more
users, etc.)
12Transition Workloads to Simulation
Scaled Workloads
Full System FunctionalSimulator (Simics)
- Create disk dumps of tuned workloads
- In simulator Boot OS, start, warm application
- Create Simics checkpoint (snapshot)
13Specify Proposed Computer Design
Memory Protocol Generator (SLICC)
Memory TimingSimulator (Ruby)
- Coherence Protocol (control tables states X
events) - Cache Hierarchy (parameters queues)
- Interconnect (switches queues)
- Processor (later)
14Test Proposed Computer Design
Memory TimingSimulator (Ruby)
Pseudo-RandomProtocol Checker
- Randomly select write action later read check
- Massive false-sharing for interaction
- Perverse network stresses design
- Transient error deadlock detection
- Sound but not complete
15Simulate with Simple Blocking Processor
Scaled Workloads
Full System FunctionalSimulator (Simics)
Memory TimingSimulator (Ruby)
- Warm-up caches or sometimes sufficient
(SafetyNet) - Run for fixed number of transactions
- Some transaction partially done at start
- Other transactions partially done at end
- Cope with workload variability (later)
16Simulate with Detailed Processor
Scaled Workloads
Full System FunctionalSimulator (Simics)
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)
- Accurate (future) timing (current) function
- Simulation complexity decoupled (discussed soon)
- Same transaction methodology work variability
issues
17Simulation Infrastructure Workload Process
Commercial Server(Sun E6000)
Scaled Workloads
Full Workloads
Full System FunctionalSimulator (Simics)
Memory Protocol Generator (SLICC)
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)
Pseudo-RandomProtocol Checker
- Select important workloads run, tune, scale,
re-tune - Specify system pseudo-randomly test
- Create warm workload checkpoint
- Simulate with simple or detailed processor
- Fixed transactions, manage simulation complexity
(next),cope with workload variability (next next)
18Outline
- Context
- Simulation Infrastructure Workload Process
- Separate Timing Functional Simulation
- Simulation Challenges
- Managing Simulation Complexity
- Timing-First Simulation
- Evaluation
- Cope with Workload Variability
- Summary
19Challenges to Timing Simulation
- Execution driven simulation is getting harder
- Micro-architecture complexity
- Multiple in-flight instructions
- Speculative execution
- Out-of-order execution
- Thread-level parallelism
- Hardware Multi-threading
- Traditional Multi-processing
20Challenges to Functional Simulation
- Commercial workloads have high functional
fidelity demands
Database
Web Server
Operating System
SPEC Benchmarks
Kernels
21Managing Simulator Complexity
Timing feedback - Tight Coupling - Performance?
Timing feedback Using existing simulators
Software development advantages
22Timing-First Simulation
- Timing Simulator
- does functional execution of user and privileged
operations - does speculative, out-of-order multiprocessor
timing simulation - does NOT implement functionality of full
instruction set or any devices - Functional Simulator
- does full-system multiprocessor simulation
- does NOT model detailed micro-architectural timing
System
CPU
Network
RAM
Timing Simulator
Functional Simulator
23Timing-First Operation
- As instruction retires, step CPU in functional
simulator - Verify instructions execution
- Reload state if timing simulator deviates from
functional - Loads in multi-processors
- Instructions with unidentified side-effects
- NOT loads/store to I/O devices
System
CPU
Network
RAM
Timing Simulator
Functional Simulator
24Benefits of Timing-First
- Supports speculative multi-processor timing
models - Leverages existing simulators
- Software development advantages
- Increases flexibility and reduces code complexity
- Immediate, precise check on timing simulator
- However
- How much performance error is introduced in this
approach? - Are there simulation performance penalties?
25Evaluation
- Our implementation, TFsim uses
- Functional Simulator Virtutech Simics
- Timing simulator Implemented less than
one-person year - Evaluated using OS intensive commercial workloads
- OS Boot gt 1 billion instructions of Solaris 8
startup - OLTP TPC-C-like benchmark using a 1 GB database
- Dynamic Web Apache serving message board, using
code and data similar to slashdot.org - Static Web Apache web server serving static web
pages - Barnes-Hut Scientific SPLASH-2 benchmark
26Measured Deviations
- Less than 20 deviations per 100,000 instructions
(0.02)
27If the Timing Simulator Modeled Fewer Events
28Sensitivity Results
29Analysis of Results
- Runs full-system workloads!
- Timing performance impact of deviations
- Worst case less than 3 performance error
- Overhead of redundant execution
- 18 on average for uniprocessors
- 18 (2 processors) up to 36 (16 processors)
30Performance Comparison
- Absolute simulation performance comparison
- In kilo-instructions committed per second (KIPS)
- RSIM Scaled 107 KIPS
- Uniprocessor TFsim 119 KIPS
match
close
different
TFsim
RSIM
31Bundled Retires
32Timing-First Conclusions
- Execution-driven simulators are increasingly
complex - How to manage complexity?
- Our answer
- Introduces relatively little performance error
(worst case 3) - Has low-overhead (18 uniprocessor average)
- Rapid development time
33Outline
- Context
- Workload Process Infrastructure
- Separate Timing Functional Simulation
- Cope with Workload Variability
- Variability in Multithreaded Workloads
- Coping in Simulation
- Examples Statistics
- Summary
34What is Happening Here?
OLTP
35What is Happening Here?
- How can slower memory lead to faster workload?
- Answer Multithreaded workload takes different
path - Different lock race outcomes
- Different scheduling decisions
- (1) Does this happen for real hardware?
- (2) If so, what should we do about it?
36 One Second Intervals (on real hardware)
OLTP
37 60 Second Intervals (on real hardware)
16-day simulation
OLTP
38Coping with Workload Variability
- Running (simulating) long enough not appealing
- Need to separate coincidental real effects
- Standard statistics on real hardware
- Variation within base system runs
- vs. variation between base enhanced system
runs - But deterministic simulation has no within
variation - Solution with deterministic simulation
- Add pseudo-random delay on L2 misses
- Simulate base (enhanced) system many times
- Use simple or complex statistics
39Coincidental (Space) Variability
40Wrong Conclusion Ratio
- WCR (16,32) 18
- WCR (16,64) 7.5
- WCR (32,64) 26
41More Generally Use Standard Statistics
- As one would for a measurement of a live system
- Confidence Intervals
- 95 confidence intervals contain true value 95
of the time - Non-overlapping confidence intervals give
statistically significant conclusions - Use ANOVA or Hypothesis Testing even better!
42Confidence Interval Example
ROB
- Estimate runs to getnon-overlapping confidence
intervals
43Also Time Variability (on real hardware)
OLTP
- Therefore, select checkpoint(s) carefully
44Workload Variability Summary
- Variability is a real phenomenon for
multi-threaded workloads - Runs from same initial conditions are different
- Variability is a challenge for simulations
- Simulations are short
- Wrong conclusions may be drawn
- Our solution accounts for variability
- Multiple runs, confidence intervals
- Reduces wrong conclusion probability
45Talk Summary
- Simulations of 2M Commercial Servers must
- Complete in reasonable time (on 2K PCs)
- Handle OS, devices, multithreaded hardware
- Cope with variability of multithreaded software
- Multifacet
- Scale tune transactional workloads
- Separate timing functional simulation
- Cope w/ workload variability via randomness
statistics - References (www.cs.wisc.edu/multifacet/papers)
- Simulating a 2M Commercial Server on a 2K PC
Computer03 - Full-System Timing-First Simulation
Sigmetrics02 - Variability in Architectural Simulations
HPCA03
46Other Multifacet Methods Work
- Specifying Verifying Coherence Protocols
- SPAA98, HPCA99, SPAA99, TPDS02
- Workload Analysis Improvement
- Database systems VLDB99 VLDB01
- Pointer-based PLDI99 Computer00
- Middleware HPCA03
- Modeling Simulation
- Commercial workloads Computer02 HPCA03
- Decoupling timing/functional simulation
Sigmetrics02 - Simulation generation PLDI01
- Analytic modeling Sigmetrics00 TPDS TBA
- Micro-architectural slack ISCA02
47Backup Slides
48One Ongoing/Future Methods Direction
- Middleware Applications
- Memory system behavior of Java Middleware HPCA
03 - Machine measurements
- Full-system simulation
- Future Work Multi-Machine Simulation
- Isolate middle-tier from client emulators and
database - Understand fundamental workload behaviors
- Drives future system design
49ECPerf vs. SpecJBB
- Different cache-to-cache transfer ratios!
50Online Transaction Processing (OLTP)
- DB2 with a TPC-C-like workload. The TPC-C
benchmark is widely used to evaluate system
performance for the on-line transaction
processing market. The benchmark itself is a
specification that describes the schema, scaling
rules, transaction types and transaction mix, but
not the exact implementation of the database.
TPC-C transactions are of five transaction types,
all related to an order-processing environment.
Performance is measured by the number of New
Order transactions performed per minute (tpmC). - Our OLTP workload is based on the TPC-C v3.0
benchmark. We use IBMs DB2 V7.2 EEE database
management system and an IBM benchmark kit to
build the database and emulate users. We build an
800Â MB 4000-warehouse database on five raw disks
and an additional dedicated database log disk. We
scaled down the sizes of each warehouse by
maintaining the reduced ratios of 3 sales
districts per warehouse, 30 customers per
district, and 100 items per warehouse (compared
to 10, 30,000 and 100,000 required by the TPC-C
specification). Each user randomly executes
transactions according to the TPC-C transaction
mix specifications, and we set the think and
keying times for users to zero. A different
database thread is started for each user. We
measure all completed transactions, even those
that do not satisfy timing constraints of the
TPC-C benchmark specification.
51Java Server Workload (SPECjbb)
- Java-based middleware applications are
increasingly used in modern e-business settings.
SPECjbb is a Java benchmark emulating a 3-tier
system with emphasis on the middle tier server
business logic. SPECjbb runs in a single Java
Virtual Machine (JVM) in which threads represent
terminals in a warehouse. Each thread
independently generates random input (tier 1
emulation) before calling transaction-specific
business logic. The business logic operates on
the data held in binary trees of java objects
(tier 3 emulation). The specification states that
the benchmark does no disk or network I/O. - We used Suns HotSpot 1.4.0 Server JVM and
Solariss native thread implementation. The
benchmark includes driver threads to generate
transactions. We set the system heap size to
1.8Â GB and the new object heap size to 256Â MB to
reduce the frequency of garbage collection. Our
experiments used 24 warehouses, with a data size
of approximately 500Â MB.
52Static Web Content Serving Apache
- Web servers such as Apache represent an important
enterprise server application. Apache is a
popular open-source web server used in many
internet/intranet settings. In this benchmark, we
focus on static web content serving. - We use Apache 2.0.39 for SPARC/Solaris 8
configured to use pthread locks and minimal
logging at the web server. We use the Scalable
URL Request Generator (SURGE) as the client.
SURGE generates a sequence of static URL requests
which exhibit representative distributions for
document popularity, document sizes, request
sizes, temporal and spatial locality, and
embedded document count. We use a repository of
20,000 files (totalling 500 MB), and use clients
with zero think time. We compiled both Apache and
Surge using Suns WorkShop C 6.1 with aggressive
optimization.
53Dynamic Web Content Serving Slashcode
- Dynamic web content serving has become
increasingly important for web sites that serve
large amount of information. Dynamic content is
used by online stores, instant news, and
community message board systems. Slashcode is an
open-source dynamic web message posting system
used by the popular slashdot.org message board
system. - We used Slashcode 2.0, Apache 1.3.20, and
Apaches mod_perl module 1.25 (with perl 5.6) on
the server side. We used MySQL 3.23.39 as the
database engine. The server content is a snapshot
from the slashcode.com site, containing
approximately 3000 messages with a total size of
5Â MB. Most of the run time is spent on dynamic
web page generation. We use a multi-threaded user
emulation program to emulate user browsing and
posting behavior. Each user independently and
randomly generates browsing and posting requests
to the server according to a transaction mix
specification. We compiled both server and client
programs using Suns WorkShop C 6.1 with
aggressive optimization.