Simulating a $2M Commercial Server on a $2K PC

About This Presentation

Title:

Simulating a $2M Commercial Server on a $2K PC

Description:

... 22155.16 3894.68 22404.10 3937.96 22653.03 3981.23 22901.96 4024.50 23150.90 4067.78 23399.83 4111.05 23648.77 4154.33 23897.70 4197.60 24146.64 4240.88 24395 ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 52

Provided by: Multiface3

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Simulating a $2M Commercial Server on a $2K PC

1
Simulatinga 2M Commercial Serveron a 2K PC

Alaa Alameldeen, Milo Martin, Carl Mauer,Kevin
Moore, Min Xu, Daniel Sorin,Mark D. Hill,
David A. Wood
Multifacet Project (www.cs.wisc.edu/multifacet)
Computer Sciences Department
University of WisconsinMadison
February 2003

2
Summary

Context
Commercial server design is important
Multifacet project seeks improved designs
Must evaluate alternatives
Commercial Servers
Processors, memory, disks ? 2M
Run large multithreaded transaction-oriented
workloads
Use commercial applications on commercial OS
To Simulate on 2K PC
Scale tune workloads
Manage simulation complexity
Cope with workload variability

Keep L2 miss rates, etc.
Separate timing function
Use randomness statistics
3
Outline

Context
Commercial Servers
Multifacet Project
Workload Simulation Methods
Separate Timing Functional Simulation
Cope with Workload Variability
Summary

4
Why Commercial Servers?

Many (Academic) Architects
Desktop computing
Wireless appliances

We focus on servers
(Important Market)
Performance Challenges
Robustness Challenges
Methodological Challenges

5
3-Tier Internet Service
LAN / SAN
LAN / SAN
Servers runningapplicationsfor business rules
Servers running databases for hard state
PCs w/ soft state
6
Multifacet Commercial Server Design

Wisconsin Multifacet Project
Directed by Mark D. Hill David A. Wood
Sponsors NSF, WI, Compaq, IBM, Intel, Sun
Current Contributors Alaa Alameldeen, Brad
Beckman,Nikhil Gupta, Pacia Harper, Jarrod
Lewis, Milo Martin, Carl Mauer,Kevin Moore,
Daniel Sorin, Min Xu
Past Contributors Anastassia Ailamaki, Ender
Bilir,Ross Dickson, Ying Hu, Manoj Plakal,
Anne Condon
Analysis
Want 4-64 processors
Many cache-to-cache misses
Neither snooping nor directories ideal
Multifacet Designs
Snooping w/ multicast ISCA99 or unordered
network ASPLOS00
Bandwidth-adaptive HPCA02 token coherence
ISCA03

7
Outline

Context
Workload Simulation Methods
Select, scale, tune workloads
Transition workload to simulator
Specify test the proposed design
Evaluate design with simple/detailed processor
models
Separate Timing Functional Simulation
Cope with Workload Variability
Summary

8
Multifacet Simulation Overview
Commercial Server(Sun E6000)
Scaled Workloads
Full Workloads
Workload Development
Full System FunctionalSimulator (Simics)
Memory Protocol Generator (SLICC)
Pseudo-RandomProtocol Checker
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)

Virtutech Simics (www.virtutech.com)
Rest is Multifacet software

9
Select Important Workloads
Full Workloads

Online Transaction Processing DB2 w/ TPC-C-like
Java Server Workload SPECjbb
Static web content serving Apache
Dynamic web content serving Slashcode
Java-based Middleware (soon)

10
Setup Tune Workloads (on real hardware)
Commercial Server(Sun E6000)
Full Workloads

Tune workload, OS parameters
Measure transaction rate, speed-up, miss rates,
I/O
Compare to published results

11
Scale Re-tune Workloads
Commercial Server(Sun E6000)
Scaled Workloads

Scale-down for PC memory limits
Retaining similar behavior (e.g., L2 cache miss
rate)
Re-tune to achieve higher transaction
rates(OLTP raw disk, multiple disks, more
users, etc.)

12
Transition Workloads to Simulation
Scaled Workloads
Full System FunctionalSimulator (Simics)

Create disk dumps of tuned workloads
In simulator Boot OS, start, warm application
Create Simics checkpoint (snapshot)

13
Specify Proposed Computer Design
Memory Protocol Generator (SLICC)
Memory TimingSimulator (Ruby)

Coherence Protocol (control tables states X
events)
Cache Hierarchy (parameters queues)
Interconnect (switches queues)
Processor (later)

14
Test Proposed Computer Design
Memory TimingSimulator (Ruby)
Pseudo-RandomProtocol Checker

Randomly select write action later read check
Massive false-sharing for interaction
Perverse network stresses design
Transient error deadlock detection
Sound but not complete

15
Simulate with Simple Blocking Processor
Scaled Workloads
Full System FunctionalSimulator (Simics)
Memory TimingSimulator (Ruby)

Warm-up caches or sometimes sufficient
(SafetyNet)
Run for fixed number of transactions
Some transaction partially done at start
Other transactions partially done at end
Cope with workload variability (later)

16
Simulate with Detailed Processor
Scaled Workloads
Full System FunctionalSimulator (Simics)
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)

Accurate (future) timing (current) function
Simulation complexity decoupled (discussed soon)
Same transaction methodology work variability
issues

17
Simulation Infrastructure Workload Process
Commercial Server(Sun E6000)
Scaled Workloads
Full Workloads
Full System FunctionalSimulator (Simics)
Memory Protocol Generator (SLICC)
Memory TimingSimulator (Ruby)
Processor TimingSimulator (Opal)
Pseudo-RandomProtocol Checker

Select important workloads run, tune, scale,
re-tune
Specify system pseudo-randomly test
Create warm workload checkpoint
Simulate with simple or detailed processor
Fixed transactions, manage simulation complexity
(next),cope with workload variability (next next)

18
Outline

Context
Simulation Infrastructure Workload Process
Separate Timing Functional Simulation
Simulation Challenges
Managing Simulation Complexity
Timing-First Simulation
Evaluation
Cope with Workload Variability
Summary

19
Challenges to Timing Simulation

Execution driven simulation is getting harder
Micro-architecture complexity
Multiple in-flight instructions
Speculative execution
Out-of-order execution
Thread-level parallelism
Hardware Multi-threading
Traditional Multi-processing

20
Challenges to Functional Simulation

Commercial workloads have high functional
fidelity demands

Database
Web Server
Operating System
SPEC Benchmarks
Kernels
21
Managing Simulator Complexity
Timing feedback - Tight Coupling - Performance?
Timing feedback Using existing simulators
Software development advantages
22
Timing-First Simulation

Timing Simulator
does functional execution of user and privileged
operations
does speculative, out-of-order multiprocessor
timing simulation
does NOT implement functionality of full
instruction set or any devices
Functional Simulator
does full-system multiprocessor simulation
does NOT model detailed micro-architectural timing

System
CPU
Network
RAM
Timing Simulator
Functional Simulator
23
Timing-First Operation

As instruction retires, step CPU in functional
simulator
Verify instructions execution
Reload state if timing simulator deviates from
functional
Loads in multi-processors
Instructions with unidentified side-effects
NOT loads/store to I/O devices

System
CPU
Network
RAM
Timing Simulator
Functional Simulator
24
Benefits of Timing-First

Supports speculative multi-processor timing
models
Leverages existing simulators
Software development advantages
Increases flexibility and reduces code complexity
Immediate, precise check on timing simulator
However
How much performance error is introduced in this
approach?
Are there simulation performance penalties?

25
Evaluation

Our implementation, TFsim uses
Functional Simulator Virtutech Simics
Timing simulator Implemented less than
one-person year
Evaluated using OS intensive commercial workloads
OS Boot gt 1 billion instructions of Solaris 8
startup
OLTP TPC-C-like benchmark using a 1 GB database
Dynamic Web Apache serving message board, using
code and data similar to slashdot.org
Static Web Apache web server serving static web
pages
Barnes-Hut Scientific SPLASH-2 benchmark

26
Measured Deviations

Less than 20 deviations per 100,000 instructions

(0.02)
27
If the Timing Simulator Modeled Fewer Events
28
Sensitivity Results
29
Analysis of Results

Runs full-system workloads!
Timing performance impact of deviations
Worst case less than 3 performance error
Overhead of redundant execution
18 on average for uniprocessors
18 (2 processors) up to 36 (16 processors)

30
Performance Comparison

Absolute simulation performance comparison
In kilo-instructions committed per second (KIPS)
RSIM Scaled 107 KIPS
Uniprocessor TFsim 119 KIPS

match
close
different
TFsim
RSIM
31
Bundled Retires
32
Timing-First Conclusions

Execution-driven simulators are increasingly
complex
How to manage complexity?
Our answer
Introduces relatively little performance error
(worst case 3)
Has low-overhead (18 uniprocessor average)
Rapid development time

33
Outline

Context
Workload Process Infrastructure
Separate Timing Functional Simulation
Cope with Workload Variability
Variability in Multithreaded Workloads
Coping in Simulation
Examples Statistics
Summary

34
What is Happening Here?
OLTP
35
What is Happening Here?

How can slower memory lead to faster workload?
Answer Multithreaded workload takes different
path
Different lock race outcomes
Different scheduling decisions
(1) Does this happen for real hardware?
(2) If so, what should we do about it?

36
One Second Intervals (on real hardware)
OLTP
37
60 Second Intervals (on real hardware)
16-day simulation
OLTP
38
Coping with Workload Variability

Running (simulating) long enough not appealing
Need to separate coincidental real effects
Standard statistics on real hardware
Variation within base system runs
vs. variation between base enhanced system
runs
But deterministic simulation has no within
variation
Solution with deterministic simulation
Add pseudo-random delay on L2 misses
Simulate base (enhanced) system many times
Use simple or complex statistics

39
Coincidental (Space) Variability
40
Wrong Conclusion Ratio

WCR (16,32) 18
WCR (16,64) 7.5
WCR (32,64) 26

41
More Generally Use Standard Statistics

As one would for a measurement of a live system
Confidence Intervals
95 confidence intervals contain true value 95
of the time
Non-overlapping confidence intervals give
statistically significant conclusions
Use ANOVA or Hypothesis Testing even better!

42
Confidence Interval Example
ROB

Estimate runs to getnon-overlapping confidence
intervals

43
Also Time Variability (on real hardware)
OLTP

Therefore, select checkpoint(s) carefully

44
Workload Variability Summary

Variability is a real phenomenon for
multi-threaded workloads
Runs from same initial conditions are different
Variability is a challenge for simulations
Simulations are short
Wrong conclusions may be drawn
Our solution accounts for variability
Multiple runs, confidence intervals
Reduces wrong conclusion probability

45
Talk Summary

Simulations of 2M Commercial Servers must
Complete in reasonable time (on 2K PCs)
Handle OS, devices, multithreaded hardware
Cope with variability of multithreaded software
Multifacet
Scale tune transactional workloads
Separate timing functional simulation
Cope w/ workload variability via randomness
statistics
References (www.cs.wisc.edu/multifacet/papers)
Simulating a 2M Commercial Server on a 2K PC
Computer03
Full-System Timing-First Simulation
Sigmetrics02
Variability in Architectural Simulations
HPCA03

46
Other Multifacet Methods Work

Specifying Verifying Coherence Protocols
SPAA98, HPCA99, SPAA99, TPDS02
Workload Analysis Improvement
Database systems VLDB99 VLDB01
Pointer-based PLDI99 Computer00
Middleware HPCA03
Modeling Simulation
Commercial workloads Computer02 HPCA03
Decoupling timing/functional simulation
Sigmetrics02
Simulation generation PLDI01
Analytic modeling Sigmetrics00 TPDS TBA
Micro-architectural slack ISCA02

47
Backup Slides
48
One Ongoing/Future Methods Direction

Middleware Applications
Memory system behavior of Java Middleware HPCA
03
Machine measurements
Full-system simulation
Future Work Multi-Machine Simulation
Isolate middle-tier from client emulators and
database
Understand fundamental workload behaviors
Drives future system design

49
ECPerf vs. SpecJBB

Different cache-to-cache transfer ratios!

50
Online Transaction Processing (OLTP)

DB2 with a TPC-C-like workload. The TPC-C
benchmark is widely used to evaluate system
performance for the on-line transaction
processing market. The benchmark itself is a
specification that describes the schema, scaling
rules, transaction types and transaction mix, but
not the exact implementation of the database.
TPC-C transactions are of five transaction types,
all related to an order-processing environment.
Performance is measured by the number of New
Order transactions performed per minute (tpmC).
Our OLTP workload is based on the TPC-C v3.0
benchmark. We use IBMs DB2 V7.2 EEE database
management system and an IBM benchmark kit to
build the database and emulate users. We build an
800 MB 4000-warehouse database on five raw disks
and an additional dedicated database log disk. We
scaled down the sizes of each warehouse by
maintaining the reduced ratios of 3 sales
districts per warehouse, 30 customers per
district, and 100 items per warehouse (compared
to 10, 30,000 and 100,000 required by the TPC-C
specification). Each user randomly executes
transactions according to the TPC-C transaction
mix specifications, and we set the think and
keying times for users to zero. A different
database thread is started for each user. We
measure all completed transactions, even those
that do not satisfy timing constraints of the
TPC-C benchmark specification.

51
Java Server Workload (SPECjbb)

Java-based middleware applications are
increasingly used in modern e-business settings.
SPECjbb is a Java benchmark emulating a 3-tier
system with emphasis on the middle tier server
business logic. SPECjbb runs in a single Java
Virtual Machine (JVM) in which threads represent
terminals in a warehouse. Each thread
independently generates random input (tier 1
emulation) before calling transaction-specific
business logic. The business logic operates on
the data held in binary trees of java objects
(tier 3 emulation). The specification states that
the benchmark does no disk or network I/O.
We used Suns HotSpot 1.4.0 Server JVM and
Solariss native thread implementation. The
benchmark includes driver threads to generate
transactions. We set the system heap size to
1.8 GB and the new object heap size to 256 MB to
reduce the frequency of garbage collection. Our
experiments used 24 warehouses, with a data size
of approximately 500 MB.

52
Static Web Content Serving Apache

Web servers such as Apache represent an important
enterprise server application. Apache is a
popular open-source web server used in many
internet/intranet settings. In this benchmark, we
focus on static web content serving.
We use Apache 2.0.39 for SPARC/Solaris 8
configured to use pthread locks and minimal
logging at the web server. We use the Scalable
URL Request Generator (SURGE) as the client.
SURGE generates a sequence of static URL requests
which exhibit representative distributions for
document popularity, document sizes, request
sizes, temporal and spatial locality, and
embedded document count. We use a repository of
20,000 files (totalling 500 MB), and use clients
with zero think time. We compiled both Apache and
Surge using Suns WorkShop C 6.1 with aggressive
optimization.

53
Dynamic Web Content Serving Slashcode

Dynamic web content serving has become
increasingly important for web sites that serve
large amount of information. Dynamic content is
used by online stores, instant news, and
community message board systems. Slashcode is an
open-source dynamic web message posting system
used by the popular slashdot.org message board
system.
We used Slashcode 2.0, Apache 1.3.20, and
Apaches mod_perl module 1.25 (with perl 5.6) on
the server side. We used MySQL 3.23.39 as the
database engine. The server content is a snapshot
from the slashcode.com site, containing
approximately 3000 messages with a total size of
5 MB. Most of the run time is spent on dynamic
web page generation. We use a multi-threaded user
emulation program to emulate user browsing and
posting behavior. Each user independently and
randomly generates browsing and posting requests
to the server according to a transaction mix
specification. We compiled both server and client
programs using Suns WorkShop C 6.1 with
aggressive optimization.

Write a Comment

User Comments (0)

About PowerShow.com

Simulating a $2M Commercial Server on a $2K PC - PowerPoint PPT Presentation

Simulating a $2M Commercial Server on a $2K PC

... 22155.16 3894.68 22404.10 3937.96 22653.03 3981.23 22901.96 4024.50 23150.90 4067.78 23399.83 4111.05 23648.77 4154.33 23897.70 4197.60 24146.64 4240.88 24395 ... – PowerPoint PPT presentation