Title: Design and Evaluation of Architectures for Commercial Applications
1Design and Evaluation of Architectures for
Commercial Applications
Part III architecture studies
2Overview (3)
- Day III architecture studies
- Memory system characterization
- Impact of out-of-order processors
- Simultaneous multithreading
- Final remarks
3Memory system performance studies
- Collaboration with Kourosh Gharachorloo and
Edouard Bugnion - Presented at ISCA98
4Motivations
- Market shift for high-performance systems
- yesterday technical/numerical applications
- today databases, Web servers, e-mail services,
etc. - Bottleneck shift in commercial application
- yesterday I/O
- today memory system
- Lack of data on behavior of commercial workloads
- Re-evaluate memory system design trade-offs
5Bottleneck Shift
- Just a few years back ThakkarSweiger90 I/O was
the only important bottleneck - Since then, several improvements
- better DB engines can tolerate I/O latencies
- better OSs do more efficient I/O operations and
are more scalable - better parallelism in the disk subsystem (RAIDs)
provide more bandwidth - and memory keeps getting slower
- faster processors
- bigger machines
- Result memory system is a primary factor today
6Workloads
- OLTP (on-line transaction processing)
- modeled after TPC-B, using Oracle7 DB engine
- short transactions, intense process communication
context switching - multiple transactions in-transit
- DSS (decision support systems)
- modeled after TPC-D, using Oracle7
- long running transactions, low process
communication - parallelized queries
- AltaVista
- Web index search application using custom threads
package - medium sized transactions, low process
communication - multiple transactions in-transit
7Methodology Platform
- AlphaServer4100 5/300
- 4x 300 MHz processors (8KB/8KB I/D caches, 96KB
L2 cache) - 2MB board-level cache
- 2GB main memory
- latencies 172180/125 cycles
- 3-channel HSZ disk array controller
- Digital Unix 4.0B
8Methodology Tools
- Monitoring tools
- IPROBE
- DCPI
- ATOM
- Simulation tools
- tracing preliminary user-level studies
- SimOS-Alpha full system simulation, including OS
9Scaling
- Workload sizes make them difficult to study
- Scaling the problem size is critical
- Validation criteria similar memory system
behavior to larger run - Requires good understanding of workload
- make sure system is well tuned
- keep SGA many times larger than hardware caches
(1GB) - use the same number of servers/processor as
audit-sized runs (4-8/CPU)
10CPU Cycle Breakdown
- Instruction and data related stalls are equally
important
11Cache behavior
12Stall Cycle Breakdown
- OLTP dominated by non-primary cache and memory
stalls
- DSS and AltaVista stalls are mostly Scache hits
13Impact of On-Chip Cache Size
P4 2MB, 2-way off-chip cache
- 64KB on-chip caches are enough for DSS
14OLTP Effect of Off-Chip Cache Organization
P4
- Significant benefits from large off-chip caches
(up to 8MB)
15OLTP Impact of system size
P4 2MB, 2-way off-chip cache
- Communication misses become dominant for larger
systems
16OLTP Contribution of Dirty Misses
P4, 8MB Bcache
- Shared metadata is the important region
- 80 of off-chip misses
- 95 of dirty misses
- Fraction of dirty misses increases with cache
and system size
17OLTP Impact of Off-Chip Cache Line Size
P4 2MB, 2-way off-chip cache
- Good spatial locality on communication for OLTP
- Very little false sharing in Oracle itself
18Summary of Results
- On-chip cache
- 64KB I/D sufficient for DSS AltaVista
- Off-chip cache
- OLTP benefits from larger caches (up to 8MB)
- Dirty misses
- Can become dominant for OLTP
19Conclusion
- Memory system is the current challenge in DB
performance - Careful scaling enables detailed studies
- Combination of monitoring and simulation is very
powerful - Diverging memory system designs
- OLTP benefits from large off-chip caches, fast
communication - DSS AltaVista may perform better without an
off-chip cache
20Impact of out-of-order processors
- Collaboration with
- Kourosh Gharachorloo (Compaq)
- Parthasarathy Ranghanathan and Sarita Adve (Rice)
- Presented at ASPLOS98
21 Motivation
- Databases fastest-growing market for
shared-memory servers - Online transaction processing (OLTP)
- Decision-support systems (DSS)
- But current systems optimized for
engineering/scientific workloads - Aggressive use of Instruction-Level Parallelism
(ILP) - Multiple issue, out-of-order issue,
- non-blocking loads, speculative execution
- Need to re-evaluate system design for database
workloads
22Contributions
- Detailed simulation study of Oracle with ILP
processors - Is ILP design complexity warranted for database
workloads? - Improve performance (1.5X OLTP, 2.6X DSS)
- Reduce performance gap between consistency models
- How can we improve performance for OLTP
workloads? - OLTP limited by instruction and migratory data
misses - Small stream buffer close to perfect instruction
cache - Prefetching/flush appear promising
23Simulation Environment - Workloads
- Oracle 7.3.2 commercial DBMS engine
- Database workloads
- Online transaction processing (OLTP) - TPC-B-like
- Day-to-day business operations
- Decision-support System (DSS) - TPC-D/Query 6
- Offline business analysis
24Simulation Environment - Methodology
- Used RSIM - Rice Simulator for ILP
Multiprocessors - Detailed simulation of processor, memory, and
network - But simulating commercial-grade database engine
hard - Some simplifications
- Similar to Lo et al. and Barroso et al., ISCA98
25Simulation Methodology - Simplifications
- Trace-driven simulation
- OS/system-call simulation
- OS not a large component
- Model only key effects
- Page-mapping, TLB misses, process scheduling
- System-call and I/O time dilation effects
- Multiple processes per processor to hide I/O
latency - Database scaling
26Simulated Environment - Hardware
- 4-processor shared-memory system - 8 processes
per processor - Directory-based MESI protocol with invalidations
- Next-generation processing nodes
- Aggressive ILP processor
- 128 KB 2-way separate instruction and data L1
caches - 8M 4-way unified L2 cache
- Representative miss latencies
27Outline
- Motivation
- Simulation Environment
- Impact of ILP on Database Workloads
- Multiple issue and OOO issue for OLTP
- Multiple outstanding misses for OLTP
- ILP techniques for DSS
- ILP-enabled consistency optimizations
- Improving Performance of OLTP
- Conclusions
28Multiple Issue and OOO Issue for OLTP
100.0
92.1
90.1
88.8
86.8
74.3
68.4
67.8
In-order processors
Out-of-order processors
- Multiple issue and OOO improve performance by
1.5X - But 4-way, 64-element window enough
- Instruction misses and dirty misses are key
bottlenecks
29Multiple Outstanding Misses for OLTP
100.0
83.2
79.4
79.4
- Support for two distinct outstanding misses
enough - Data-dependent computation
30Impact of ILP Techniques for DSS
100.0
89.2
74.1
68.4
68.1
52.1
39.7
39.0
In-order processors
Out-of-order processors
- Multiple issue and OOO improve performance by
2.6X - 4-way, 64-element window, 4 outstanding misses
enough - Memory is not a bottleneck
31ILP-Enabled Consistency Optimizations
- Memory consistency model of shared-memory system
- Specifies ordering and overlap of memory
operations - Performance /programmability tradeoff
- Sequential consistency (SC)
- Processor consistency (PC)
- Release consistency (RC)
- ILP-enabled consistency optimizations
- Hardware prefetching, Speculative loads
- Impact on database workloads?
32 ILP-Enabled Consistency Optimizations
SC sequential consistency PC processor
consistency RC release consistency
- ILP-enabled optimizations
- OLTP RC only 1.1X better than SC (was 1.4X)
- DSS RC only 1.18X better than SC (was 1.85X)
- Consistency model choice in hardware less
important
33Outline
- Motivation
- Simulation Environment
- Impact of ILP on Database Workloads
- Improving Performance of OLTP
- Improving OLTP - Instruction Misses
- Improving OLTP - Dirty misses
- Conclusions
34Improving OLTP - Instruction Misses
100
83
71
- 4-element instruction cache stream buffer
- hardware prefetching of instructions
- 1.21X performance improvement
- Simple and effective for database servers
35Improving OLTP - Dirty Misses
- Dirty misses
- Mostly to migratory data
- Due to few instructions in critical sections
- Solutions for migratory reads
- Software prefetching producer-initiated flushes
- Preliminary results without access to source code
- 1.14X performance improvement
36Summary
- Detailed simulation study of Oracle with
out-of-order processors - Impact of ILP techniques on database workloads
- Improve performance (1.5X OLTP, 2.6X DSS)
- Reduce performance gap between consistency models
- Improving performance of OLTP
- OLTP limited by instruction and migratory data
misses - Small stream buffer close to perfect instruction
cache - Prefetching/flush appear promising
37Simultaneous Multithreading (SMT)
- Collaboration with
- Kourosh Gharachorloo (Compaq)
- Jack Lo, Susan Eggers, Hank Levy, Sujay Parekh
(U.Washington) - Exploit multithreaded nature of commercial
applications - Aggressive Wide-issue OOO superscalar saturates
at 4-issue slots - Potential to increase utilization of issue slots
- Potential to exploit parallelism in the memory
system
38SMT what is it?
- SMT enables multiple threads to issue
instructions to multiple functional units in a
single cycle - SMT exploits instruction-level thread-level
parallelism - Hides long latencies
- Increases resource utilization and instruction
throughput
fine-grain multithreading
superscalar
SMT
thread 1
thread 2
thread 3
thread 4
39SMT and database workloads
- Pro
- SMT a good match, can take advantage of SMTs
multithreading HW - Low throughput
- High cache miss rates
- Con
- Fine-grain interleaving can cause cache
interference - What software techniques can help avoid
interference?
40SMT studies methodology
- Trace-driven simulation
- Same traces used in previous ILP study
- New front-end to SMT simulator
- Used OLTP and DSS workloads
41SMT Configuration
- 21264-like superscalar base, augmented with
- up to 8 hardware contexts
- 8-wide superscalar
- 128KB, 2-way I and D, L1 cache, 2 cycle access
- 16MB, direct-mapped L2 cache, 12 cycle access
- 80 cycle memory latency
- 10 functional units (6 integer (4 ld/st), 4 FP)
- 100 additional integer FP renaming registers
- integer and FP instruction queues, 32 entries each
42OLTP Characterization
- Memory behavior (1 context, 16 server processes)
- High miss rates large footprints
43Cache interference (16 server processes)
- With 8-context SMT, many conflict misses
- DSS data set fits in L2
44Where are the misses?
16 server processes,
8-context SMT
Misses (PGA)
Misses (Instructions)
- L1 and L2 misses dominated by PGA references
- Misses result from unnecessary address conflicts
45L2 conflicts page mapping
- Page coloring can be augmented with random first
seed
46Results for different page mapping schemes
16 MB, direct-mapped L2 cache, 16 server processes
DSS
10.0
)
l
a
9.0
b
o
8.0
l
g
(
e
t
a
r
s
s
i
m
e
h
c
a
c
2
L
47Why the steady L2 miss rates?
- Not all footprint has temporal locality
- Critical working sets are being cached
- 87 of instruction refs are to 31 of the
I-footprint - 41 of metadata refs are to 26KB
- SMT and superscalar cache misses comparable
- SMT changes interleaving, not total footprint
- With proper global policies, working sets still
fit in caches SMT is effective
48L1 conflicts application-level offseting
- Base of each threads PGA is at same virtual
address - Causes unnecessary conflicts in virtually-indexed
cache - Address offsets can avoid interference
- Offset by thread id 8KB
49Offsetting results
128KB, 2-way set associative L1 cache
bin hopping no offset
bin hopping with offset
50SMT constructive interference
- Cache interference can also be beneficial
- Instruction segment is shared
- SMT exploits instruction sharing
- Improves I-cache locality
- Reduces I-cache miss rate (OLTP)
- 14 with superscalar ? 9 with 8-context SMT
51SMT overall performance
16 server processes
OLTP
4.0
e
3.0
l
c
y
c
/
s
n
2.0
o
i
t
c
u
r
t
s
1.0
n
I
0.0
52Why SMT is effective
- Exploits memory-system concurrency
- Improves instruction fetch
- Improves instruction issue
53Exploiting memory system concurrency
- OLTP has lots of pointer chasing
54Improving instruction fetch
- SMT can fetch from multiple threads
- Tolerate I-cache misses and branch mispredictions
- Fetch fewer speculative instructions
55Improving instruction issue
- SMT exposes more parallelism
- use instruction-level and thread-level parallelism
56SMT Performance
57Summary
- Critical working sets for DB workloads
- can still fit in caches even for SMT
- fine-granularity interleaving can be accommodated
- Cache interference can be avoided with simple
policies - page mapping and application level offsetting
- SMT miss rates comparable to superscalar
- SMT is effective
- 4.6x speedup on OLTP, 1.5x on DSS
58Final remarks
- We understand architectural requirements of
commercial applications more than we did a couple
of years ago - But both technology and applications are moving
targets - Lots to be done!
59Final remarks(2)
- Important emerging workloads
- ERP benchmarks from software vendors
- more representative of end-user OLTP performance
- Better decision support system algorithms
- Many Web-based applications
- very young field
- good benchmarks are still to come
- TPC-W may be a good start
- Enterprise-scale mail servers
- Packet-switching servers for high-bandwidth
subscriber connections (e.g., ADSL)
60Final remarks(3)
- New technological/architectural challenges
- Large-scale NUMA architectures
- worsen dirty miss problem
- reliability and fault-containment
- Increased integration
- whats the next subsystem to move on-chip?
- Explicitly parallel ISA?
- Impact of next generation Direct Rambus DRAMs
- very low latency
- LogicDRAM integration
- What if memory were non-volatile?
61Final remarks(3)
- More short term issues
- How to reduce I-stream related stalls for OLTP?
- How to reduce communication penalties in OLTP?
- Prefetch/post-store?
- Smarter coherence protocols?
- How to deal with 100s of threads per processor?
- Innovative ways to reduce latency of
pointer-based access patterns? - Can clusters become competitive in REAL OLTP
environments?