Design and Evaluation of Architectures for Commercial Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Design and Evaluation of Architectures for Commercial Applications

Description:

long running transactions, low process communication. parallelized queries. AltaVista ... 2MB board-level cache. 2GB main memory. latencies: 1:7:21:80/125 cycles ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 62

Provided by: barr93

Learn more at: https://research.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Design and Evaluation of Architectures for Commercial Applications

1
Design and Evaluation of Architectures for
Commercial Applications
Part III architecture studies

Luiz André Barroso

2
Overview (3)

Day III architecture studies
Memory system characterization
Impact of out-of-order processors
Simultaneous multithreading
Final remarks

3
Memory system performance studies

Collaboration with Kourosh Gharachorloo and
Edouard Bugnion
Presented at ISCA98

4
Motivations

Market shift for high-performance systems
yesterday technical/numerical applications
today databases, Web servers, e-mail services,
etc.
Bottleneck shift in commercial application
yesterday I/O
today memory system
Lack of data on behavior of commercial workloads
Re-evaluate memory system design trade-offs

5
Bottleneck Shift

Just a few years back ThakkarSweiger90 I/O was
the only important bottleneck
Since then, several improvements
better DB engines can tolerate I/O latencies
better OSs do more efficient I/O operations and
are more scalable
better parallelism in the disk subsystem (RAIDs)
provide more bandwidth
and memory keeps getting slower
faster processors
bigger machines
Result memory system is a primary factor today

6
Workloads

OLTP (on-line transaction processing)
modeled after TPC-B, using Oracle7 DB engine
short transactions, intense process communication
context switching
multiple transactions in-transit
DSS (decision support systems)
modeled after TPC-D, using Oracle7
long running transactions, low process
communication
parallelized queries
AltaVista
Web index search application using custom threads
package
medium sized transactions, low process
communication
multiple transactions in-transit

7
Methodology Platform

AlphaServer4100 5/300
4x 300 MHz processors (8KB/8KB I/D caches, 96KB
L2 cache)
2MB board-level cache
2GB main memory
latencies 172180/125 cycles
3-channel HSZ disk array controller
Digital Unix 4.0B

8
Methodology Tools

Monitoring tools
IPROBE
DCPI
ATOM
Simulation tools
tracing preliminary user-level studies
SimOS-Alpha full system simulation, including OS

9
Scaling

Workload sizes make them difficult to study
Scaling the problem size is critical
Validation criteria similar memory system
behavior to larger run
Requires good understanding of workload
make sure system is well tuned
keep SGA many times larger than hardware caches
(1GB)
use the same number of servers/processor as
audit-sized runs (4-8/CPU)

10
CPU Cycle Breakdown

Very high CPI for OLTP

Instruction and data related stalls are equally
important

11
Cache behavior
12
Stall Cycle Breakdown

OLTP dominated by non-primary cache and memory
stalls

DSS and AltaVista stalls are mostly Scache hits

13
Impact of On-Chip Cache Size
P4 2MB, 2-way off-chip cache

64KB on-chip caches are enough for DSS

14
OLTP Effect of Off-Chip Cache Organization
P4

Significant benefits from large off-chip caches
(up to 8MB)

15
OLTP Impact of system size
P4 2MB, 2-way off-chip cache

Communication misses become dominant for larger
systems

16
OLTP Contribution of Dirty Misses
P4, 8MB Bcache

Shared metadata is the important region
80 of off-chip misses
95 of dirty misses
Fraction of dirty misses increases with cache
and system size

17
OLTP Impact of Off-Chip Cache Line Size
P4 2MB, 2-way off-chip cache

Good spatial locality on communication for OLTP

Very little false sharing in Oracle itself

18
Summary of Results

On-chip cache
64KB I/D sufficient for DSS AltaVista
Off-chip cache
OLTP benefits from larger caches (up to 8MB)
Dirty misses
Can become dominant for OLTP

19
Conclusion

Memory system is the current challenge in DB
performance
Careful scaling enables detailed studies
Combination of monitoring and simulation is very
powerful
Diverging memory system designs
OLTP benefits from large off-chip caches, fast
communication
DSS AltaVista may perform better without an
off-chip cache

20
Impact of out-of-order processors

Collaboration with
Kourosh Gharachorloo (Compaq)
Parthasarathy Ranghanathan and Sarita Adve (Rice)
Presented at ASPLOS98

21
Motivation

Databases fastest-growing market for
shared-memory servers
Online transaction processing (OLTP)
Decision-support systems (DSS)
But current systems optimized for
engineering/scientific workloads
Aggressive use of Instruction-Level Parallelism
(ILP)
Multiple issue, out-of-order issue,
non-blocking loads, speculative execution
Need to re-evaluate system design for database
workloads

22
Contributions

Detailed simulation study of Oracle with ILP
processors
Is ILP design complexity warranted for database
workloads?
Improve performance (1.5X OLTP, 2.6X DSS)
Reduce performance gap between consistency models
How can we improve performance for OLTP
workloads?
OLTP limited by instruction and migratory data
misses
Small stream buffer close to perfect instruction
cache
Prefetching/flush appear promising

23
Simulation Environment - Workloads

Oracle 7.3.2 commercial DBMS engine
Database workloads
Online transaction processing (OLTP) - TPC-B-like
Day-to-day business operations
Decision-support System (DSS) - TPC-D/Query 6
Offline business analysis

24
Simulation Environment - Methodology

Used RSIM - Rice Simulator for ILP
Multiprocessors
Detailed simulation of processor, memory, and
network
But simulating commercial-grade database engine
hard
Some simplifications
Similar to Lo et al. and Barroso et al., ISCA98

25
Simulation Methodology - Simplifications

Trace-driven simulation
OS/system-call simulation
OS not a large component
Model only key effects
Page-mapping, TLB misses, process scheduling
System-call and I/O time dilation effects
Multiple processes per processor to hide I/O
latency
Database scaling

26
Simulated Environment - Hardware

4-processor shared-memory system - 8 processes
per processor
Directory-based MESI protocol with invalidations
Next-generation processing nodes
Aggressive ILP processor
128 KB 2-way separate instruction and data L1
caches
8M 4-way unified L2 cache
Representative miss latencies

27
Outline

Motivation
Simulation Environment
Impact of ILP on Database Workloads
Multiple issue and OOO issue for OLTP
Multiple outstanding misses for OLTP
ILP techniques for DSS
ILP-enabled consistency optimizations
Improving Performance of OLTP
Conclusions

28
Multiple Issue and OOO Issue for OLTP
100.0
92.1
90.1
88.8
86.8
74.3
68.4
67.8
In-order processors
Out-of-order processors

Multiple issue and OOO improve performance by
1.5X
But 4-way, 64-element window enough
Instruction misses and dirty misses are key
bottlenecks

29
Multiple Outstanding Misses for OLTP
100.0
83.2
79.4
79.4

Support for two distinct outstanding misses
enough
Data-dependent computation

30
Impact of ILP Techniques for DSS
100.0
89.2
74.1
68.4
68.1
52.1
39.7
39.0
In-order processors
Out-of-order processors

Multiple issue and OOO improve performance by
2.6X
4-way, 64-element window, 4 outstanding misses
enough
Memory is not a bottleneck

31
ILP-Enabled Consistency Optimizations

Memory consistency model of shared-memory system
Specifies ordering and overlap of memory
operations
Performance /programmability tradeoff
Sequential consistency (SC)
Processor consistency (PC)
Release consistency (RC)
ILP-enabled consistency optimizations
Hardware prefetching, Speculative loads
Impact on database workloads?

32
ILP-Enabled Consistency Optimizations
SC sequential consistency PC processor
consistency RC release consistency

ILP-enabled optimizations
OLTP RC only 1.1X better than SC (was 1.4X)
DSS RC only 1.18X better than SC (was 1.85X)
Consistency model choice in hardware less
important

33
Outline

Motivation
Simulation Environment
Impact of ILP on Database Workloads
Improving Performance of OLTP
Improving OLTP - Instruction Misses
Improving OLTP - Dirty misses
Conclusions

34
Improving OLTP - Instruction Misses
100
83
71

4-element instruction cache stream buffer
hardware prefetching of instructions
1.21X performance improvement
Simple and effective for database servers

35
Improving OLTP - Dirty Misses

Dirty misses
Mostly to migratory data
Due to few instructions in critical sections
Solutions for migratory reads
Software prefetching producer-initiated flushes
Preliminary results without access to source code
1.14X performance improvement

36
Summary

Detailed simulation study of Oracle with
out-of-order processors
Impact of ILP techniques on database workloads
Improve performance (1.5X OLTP, 2.6X DSS)
Reduce performance gap between consistency models
Improving performance of OLTP
OLTP limited by instruction and migratory data
misses
Small stream buffer close to perfect instruction
cache
Prefetching/flush appear promising

37
Simultaneous Multithreading (SMT)

Collaboration with
Kourosh Gharachorloo (Compaq)
Jack Lo, Susan Eggers, Hank Levy, Sujay Parekh
(U.Washington)
Exploit multithreaded nature of commercial
applications
Aggressive Wide-issue OOO superscalar saturates
at 4-issue slots
Potential to increase utilization of issue slots
Potential to exploit parallelism in the memory
system

38
SMT what is it?

SMT enables multiple threads to issue
instructions to multiple functional units in a
single cycle
SMT exploits instruction-level thread-level
parallelism
Hides long latencies
Increases resource utilization and instruction
throughput

fine-grain multithreading
superscalar
SMT
thread 1
thread 2
thread 3
thread 4
39
SMT and database workloads

Pro
SMT a good match, can take advantage of SMTs
multithreading HW
Low throughput
High cache miss rates
Con
Fine-grain interleaving can cause cache
interference
What software techniques can help avoid
interference?

40
SMT studies methodology

Trace-driven simulation
Same traces used in previous ILP study
New front-end to SMT simulator
Used OLTP and DSS workloads

41
SMT Configuration

21264-like superscalar base, augmented with
up to 8 hardware contexts
8-wide superscalar
128KB, 2-way I and D, L1 cache, 2 cycle access
16MB, direct-mapped L2 cache, 12 cycle access
80 cycle memory latency
10 functional units (6 integer (4 ld/st), 4 FP)
100 additional integer FP renaming registers
integer and FP instruction queues, 32 entries each

42
OLTP Characterization

Memory behavior (1 context, 16 server processes)
High miss rates large footprints

43
Cache interference (16 server processes)

With 8-context SMT, many conflict misses
DSS data set fits in L2

44
Where are the misses?
16 server processes,
8-context SMT
Misses (PGA)
Misses (Instructions)

L1 and L2 misses dominated by PGA references
Misses result from unnecessary address conflicts

45
L2 conflicts page mapping

Page coloring can be augmented with random first
seed

46
Results for different page mapping schemes
16 MB, direct-mapped L2 cache, 16 server processes
DSS
10.0
)
l
a
9.0
b
o
8.0
l
g
(

e
t
a
r

s
s
i
m

e
h
c
a
c

2
L
47
Why the steady L2 miss rates?

Not all footprint has temporal locality
Critical working sets are being cached
87 of instruction refs are to 31 of the
I-footprint
41 of metadata refs are to 26KB
SMT and superscalar cache misses comparable
SMT changes interleaving, not total footprint
With proper global policies, working sets still
fit in caches SMT is effective

48
L1 conflicts application-level offseting

Base of each threads PGA is at same virtual
address
Causes unnecessary conflicts in virtually-indexed
cache
Address offsets can avoid interference
Offset by thread id 8KB

49
Offsetting results
128KB, 2-way set associative L1 cache
bin hopping no offset
bin hopping with offset
50
SMT constructive interference

Cache interference can also be beneficial
Instruction segment is shared
SMT exploits instruction sharing
Improves I-cache locality
Reduces I-cache miss rate (OLTP)
14 with superscalar ? 9 with 8-context SMT

51
SMT overall performance
16 server processes
OLTP
4.0
e
3.0
l
c
y
c
/
s
n
2.0
o
i
t
c
u
r
t
s
1.0
n
I
0.0
52
Why SMT is effective

Exploits memory-system concurrency
Improves instruction fetch
Improves instruction issue

53
Exploiting memory system concurrency

OLTP has lots of pointer chasing

54
Improving instruction fetch

SMT can fetch from multiple threads
Tolerate I-cache misses and branch mispredictions
Fetch fewer speculative instructions

55
Improving instruction issue

SMT exposes more parallelism
use instruction-level and thread-level parallelism

56
SMT Performance
57
Summary

Critical working sets for DB workloads
can still fit in caches even for SMT
fine-granularity interleaving can be accommodated
Cache interference can be avoided with simple
policies
page mapping and application level offsetting
SMT miss rates comparable to superscalar
SMT is effective
4.6x speedup on OLTP, 1.5x on DSS

58
Final remarks

We understand architectural requirements of
commercial applications more than we did a couple
of years ago
But both technology and applications are moving
targets
Lots to be done!

59
Final remarks(2)