Design and Evaluation of Architectures for Commercial Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Design and Evaluation of Architectures for Commercial Applications

Description:

long running transactions, low process communication. parallelized queries. AltaVista ... 2MB board-level cache. 2GB main memory. latencies: 1:7:21:80/125 cycles ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 62
Provided by: barr93
Category:

less

Transcript and Presenter's Notes

Title: Design and Evaluation of Architectures for Commercial Applications


1
Design and Evaluation of Architectures for
Commercial Applications
Part III architecture studies
  • Luiz André Barroso

2
Overview (3)
  • Day III architecture studies
  • Memory system characterization
  • Impact of out-of-order processors
  • Simultaneous multithreading
  • Final remarks

3
Memory system performance studies
  • Collaboration with Kourosh Gharachorloo and
    Edouard Bugnion
  • Presented at ISCA98

4
Motivations
  • Market shift for high-performance systems
  • yesterday technical/numerical applications
  • today databases, Web servers, e-mail services,
    etc.
  • Bottleneck shift in commercial application
  • yesterday I/O
  • today memory system
  • Lack of data on behavior of commercial workloads
  • Re-evaluate memory system design trade-offs

5
Bottleneck Shift
  • Just a few years back ThakkarSweiger90 I/O was
    the only important bottleneck
  • Since then, several improvements
  • better DB engines can tolerate I/O latencies
  • better OSs do more efficient I/O operations and
    are more scalable
  • better parallelism in the disk subsystem (RAIDs)
    provide more bandwidth
  • and memory keeps getting slower
  • faster processors
  • bigger machines
  • Result memory system is a primary factor today

6
Workloads
  • OLTP (on-line transaction processing)
  • modeled after TPC-B, using Oracle7 DB engine
  • short transactions, intense process communication
    context switching
  • multiple transactions in-transit
  • DSS (decision support systems)
  • modeled after TPC-D, using Oracle7
  • long running transactions, low process
    communication
  • parallelized queries
  • AltaVista
  • Web index search application using custom threads
    package
  • medium sized transactions, low process
    communication
  • multiple transactions in-transit

7
Methodology Platform
  • AlphaServer4100 5/300
  • 4x 300 MHz processors (8KB/8KB I/D caches, 96KB
    L2 cache)
  • 2MB board-level cache
  • 2GB main memory
  • latencies 172180/125 cycles
  • 3-channel HSZ disk array controller
  • Digital Unix 4.0B

8
Methodology Tools
  • Monitoring tools
  • IPROBE
  • DCPI
  • ATOM
  • Simulation tools
  • tracing preliminary user-level studies
  • SimOS-Alpha full system simulation, including OS

9
Scaling
  • Workload sizes make them difficult to study
  • Scaling the problem size is critical
  • Validation criteria similar memory system
    behavior to larger run
  • Requires good understanding of workload
  • make sure system is well tuned
  • keep SGA many times larger than hardware caches
    (1GB)
  • use the same number of servers/processor as
    audit-sized runs (4-8/CPU)

10
CPU Cycle Breakdown
  • Very high CPI for OLTP
  • Instruction and data related stalls are equally
    important

11
Cache behavior
12
Stall Cycle Breakdown
  • OLTP dominated by non-primary cache and memory
    stalls
  • DSS and AltaVista stalls are mostly Scache hits

13
Impact of On-Chip Cache Size
P4 2MB, 2-way off-chip cache
  • 64KB on-chip caches are enough for DSS

14
OLTP Effect of Off-Chip Cache Organization
P4
  • Significant benefits from large off-chip caches
    (up to 8MB)

15
OLTP Impact of system size
P4 2MB, 2-way off-chip cache
  • Communication misses become dominant for larger
    systems

16
OLTP Contribution of Dirty Misses
P4, 8MB Bcache
  • Shared metadata is the important region
  • 80 of off-chip misses
  • 95 of dirty misses
  • Fraction of dirty misses increases with cache
    and system size

17
OLTP Impact of Off-Chip Cache Line Size
P4 2MB, 2-way off-chip cache
  • Good spatial locality on communication for OLTP
  • Very little false sharing in Oracle itself

18
Summary of Results
  • On-chip cache
  • 64KB I/D sufficient for DSS AltaVista
  • Off-chip cache
  • OLTP benefits from larger caches (up to 8MB)
  • Dirty misses
  • Can become dominant for OLTP

19
Conclusion
  • Memory system is the current challenge in DB
    performance
  • Careful scaling enables detailed studies
  • Combination of monitoring and simulation is very
    powerful
  • Diverging memory system designs
  • OLTP benefits from large off-chip caches, fast
    communication
  • DSS AltaVista may perform better without an
    off-chip cache

20
Impact of out-of-order processors
  • Collaboration with
  • Kourosh Gharachorloo (Compaq)
  • Parthasarathy Ranghanathan and Sarita Adve (Rice)
  • Presented at ASPLOS98

21
Motivation
  • Databases fastest-growing market for
    shared-memory servers
  • Online transaction processing (OLTP)
  • Decision-support systems (DSS)
  • But current systems optimized for
    engineering/scientific workloads
  • Aggressive use of Instruction-Level Parallelism
    (ILP)
  • Multiple issue, out-of-order issue,
  • non-blocking loads, speculative execution
  • Need to re-evaluate system design for database
    workloads

22
Contributions
  • Detailed simulation study of Oracle with ILP
    processors
  • Is ILP design complexity warranted for database
    workloads?
  • Improve performance (1.5X OLTP, 2.6X DSS)
  • Reduce performance gap between consistency models
  • How can we improve performance for OLTP
    workloads?
  • OLTP limited by instruction and migratory data
    misses
  • Small stream buffer close to perfect instruction
    cache
  • Prefetching/flush appear promising

23
Simulation Environment - Workloads
  • Oracle 7.3.2 commercial DBMS engine
  • Database workloads
  • Online transaction processing (OLTP) - TPC-B-like
  • Day-to-day business operations
  • Decision-support System (DSS) - TPC-D/Query 6
  • Offline business analysis

24
Simulation Environment - Methodology
  • Used RSIM - Rice Simulator for ILP
    Multiprocessors
  • Detailed simulation of processor, memory, and
    network
  • But simulating commercial-grade database engine
    hard
  • Some simplifications
  • Similar to Lo et al. and Barroso et al., ISCA98

25
Simulation Methodology - Simplifications
  • Trace-driven simulation
  • OS/system-call simulation
  • OS not a large component
  • Model only key effects
  • Page-mapping, TLB misses, process scheduling
  • System-call and I/O time dilation effects
  • Multiple processes per processor to hide I/O
    latency
  • Database scaling

26
Simulated Environment - Hardware
  • 4-processor shared-memory system - 8 processes
    per processor
  • Directory-based MESI protocol with invalidations
  • Next-generation processing nodes
  • Aggressive ILP processor
  • 128 KB 2-way separate instruction and data L1
    caches
  • 8M 4-way unified L2 cache
  • Representative miss latencies

27
Outline
  • Motivation
  • Simulation Environment
  • Impact of ILP on Database Workloads
  • Multiple issue and OOO issue for OLTP
  • Multiple outstanding misses for OLTP
  • ILP techniques for DSS
  • ILP-enabled consistency optimizations
  • Improving Performance of OLTP
  • Conclusions

28
Multiple Issue and OOO Issue for OLTP
100.0
92.1
90.1
88.8
86.8
74.3
68.4
67.8
In-order processors
Out-of-order processors
  • Multiple issue and OOO improve performance by
    1.5X
  • But 4-way, 64-element window enough
  • Instruction misses and dirty misses are key
    bottlenecks

29
Multiple Outstanding Misses for OLTP
100.0
83.2
79.4
79.4
  • Support for two distinct outstanding misses
    enough
  • Data-dependent computation

30
Impact of ILP Techniques for DSS
100.0
89.2
74.1
68.4
68.1
52.1
39.7
39.0
In-order processors
Out-of-order processors
  • Multiple issue and OOO improve performance by
    2.6X
  • 4-way, 64-element window, 4 outstanding misses
    enough
  • Memory is not a bottleneck

31
ILP-Enabled Consistency Optimizations
  • Memory consistency model of shared-memory system
  • Specifies ordering and overlap of memory
    operations
  • Performance /programmability tradeoff
  • Sequential consistency (SC)
  • Processor consistency (PC)
  • Release consistency (RC)
  • ILP-enabled consistency optimizations
  • Hardware prefetching, Speculative loads
  • Impact on database workloads?

32
ILP-Enabled Consistency Optimizations
SC sequential consistency PC processor
consistency RC release consistency
  • ILP-enabled optimizations
  • OLTP RC only 1.1X better than SC (was 1.4X)
  • DSS RC only 1.18X better than SC (was 1.85X)
  • Consistency model choice in hardware less
    important

33
Outline
  • Motivation
  • Simulation Environment
  • Impact of ILP on Database Workloads
  • Improving Performance of OLTP
  • Improving OLTP - Instruction Misses
  • Improving OLTP - Dirty misses
  • Conclusions

34
Improving OLTP - Instruction Misses
100
83
71
  • 4-element instruction cache stream buffer
  • hardware prefetching of instructions
  • 1.21X performance improvement
  • Simple and effective for database servers

35
Improving OLTP - Dirty Misses
  • Dirty misses
  • Mostly to migratory data
  • Due to few instructions in critical sections
  • Solutions for migratory reads
  • Software prefetching producer-initiated flushes
  • Preliminary results without access to source code
  • 1.14X performance improvement

36
Summary
  • Detailed simulation study of Oracle with
    out-of-order processors
  • Impact of ILP techniques on database workloads
  • Improve performance (1.5X OLTP, 2.6X DSS)
  • Reduce performance gap between consistency models
  • Improving performance of OLTP
  • OLTP limited by instruction and migratory data
    misses
  • Small stream buffer close to perfect instruction
    cache
  • Prefetching/flush appear promising

37
Simultaneous Multithreading (SMT)
  • Collaboration with
  • Kourosh Gharachorloo (Compaq)
  • Jack Lo, Susan Eggers, Hank Levy, Sujay Parekh
    (U.Washington)
  • Exploit multithreaded nature of commercial
    applications
  • Aggressive Wide-issue OOO superscalar saturates
    at 4-issue slots
  • Potential to increase utilization of issue slots
  • Potential to exploit parallelism in the memory
    system

38
SMT what is it?
  • SMT enables multiple threads to issue
    instructions to multiple functional units in a
    single cycle
  • SMT exploits instruction-level thread-level
    parallelism
  • Hides long latencies
  • Increases resource utilization and instruction
    throughput

fine-grain multithreading
superscalar
SMT
thread 1
thread 2
thread 3
thread 4
39
SMT and database workloads
  • Pro
  • SMT a good match, can take advantage of SMTs
    multithreading HW
  • Low throughput
  • High cache miss rates
  • Con
  • Fine-grain interleaving can cause cache
    interference
  • What software techniques can help avoid
    interference?

40
SMT studies methodology
  • Trace-driven simulation
  • Same traces used in previous ILP study
  • New front-end to SMT simulator
  • Used OLTP and DSS workloads

41
SMT Configuration
  • 21264-like superscalar base, augmented with
  • up to 8 hardware contexts
  • 8-wide superscalar
  • 128KB, 2-way I and D, L1 cache, 2 cycle access
  • 16MB, direct-mapped L2 cache, 12 cycle access
  • 80 cycle memory latency
  • 10 functional units (6 integer (4 ld/st), 4 FP)
  • 100 additional integer FP renaming registers
  • integer and FP instruction queues, 32 entries each

42
OLTP Characterization
  • Memory behavior (1 context, 16 server processes)
  • High miss rates large footprints

43
Cache interference (16 server processes)
  • With 8-context SMT, many conflict misses
  • DSS data set fits in L2

44
Where are the misses?
16 server processes,
8-context SMT
Misses (PGA)
Misses (Instructions)
  • L1 and L2 misses dominated by PGA references
  • Misses result from unnecessary address conflicts

45
L2 conflicts page mapping
  • Page coloring can be augmented with random first
    seed

46
Results for different page mapping schemes
16 MB, direct-mapped L2 cache, 16 server processes
DSS
10.0
)
l
a
9.0
b
o
8.0
l
g
(

e
t
a
r

s
s
i
m

e
h
c
a
c

2
L
47
Why the steady L2 miss rates?
  • Not all footprint has temporal locality
  • Critical working sets are being cached
  • 87 of instruction refs are to 31 of the
    I-footprint
  • 41 of metadata refs are to 26KB
  • SMT and superscalar cache misses comparable
  • SMT changes interleaving, not total footprint
  • With proper global policies, working sets still
    fit in caches SMT is effective

48
L1 conflicts application-level offseting
  • Base of each threads PGA is at same virtual
    address
  • Causes unnecessary conflicts in virtually-indexed
    cache
  • Address offsets can avoid interference
  • Offset by thread id 8KB

49
Offsetting results
128KB, 2-way set associative L1 cache
bin hopping no offset
bin hopping with offset
50
SMT constructive interference
  • Cache interference can also be beneficial
  • Instruction segment is shared
  • SMT exploits instruction sharing
  • Improves I-cache locality
  • Reduces I-cache miss rate (OLTP)
  • 14 with superscalar ? 9 with 8-context SMT

51
SMT overall performance
16 server processes
OLTP
4.0
e
3.0
l
c
y
c
/
s
n
2.0
o
i
t
c
u
r
t
s
1.0
n
I
0.0
52
Why SMT is effective
  • Exploits memory-system concurrency
  • Improves instruction fetch
  • Improves instruction issue

53
Exploiting memory system concurrency
  • OLTP has lots of pointer chasing

54
Improving instruction fetch
  • SMT can fetch from multiple threads
  • Tolerate I-cache misses and branch mispredictions
  • Fetch fewer speculative instructions

55
Improving instruction issue
  • SMT exposes more parallelism
  • use instruction-level and thread-level parallelism

56
SMT Performance
57
Summary
  • Critical working sets for DB workloads
  • can still fit in caches even for SMT
  • fine-granularity interleaving can be accommodated
  • Cache interference can be avoided with simple
    policies
  • page mapping and application level offsetting
  • SMT miss rates comparable to superscalar
  • SMT is effective
  • 4.6x speedup on OLTP, 1.5x on DSS

58
Final remarks
  • We understand architectural requirements of
    commercial applications more than we did a couple
    of years ago
  • But both technology and applications are moving
    targets
  • Lots to be done!

59
Final remarks(2)
  • Important emerging workloads
  • ERP benchmarks from software vendors
  • more representative of end-user OLTP performance
  • Better decision support system algorithms
  • Many Web-based applications
  • very young field
  • good benchmarks are still to come
  • TPC-W may be a good start
  • Enterprise-scale mail servers
  • Packet-switching servers for high-bandwidth
    subscriber connections (e.g., ADSL)

60
Final remarks(3)
  • New technological/architectural challenges
  • Large-scale NUMA architectures
  • worsen dirty miss problem
  • reliability and fault-containment
  • Increased integration
  • whats the next subsystem to move on-chip?
  • Explicitly parallel ISA?
  • Impact of next generation Direct Rambus DRAMs
  • very low latency
  • LogicDRAM integration
  • What if memory were non-volatile?

61
Final remarks(3)
  • More short term issues
  • How to reduce I-stream related stalls for OLTP?
  • How to reduce communication penalties in OLTP?
  • Prefetch/post-store?
  • Smarter coherence protocols?
  • How to deal with 100s of threads per processor?
  • Innovative ways to reduce latency of
    pointer-based access patterns?
  • Can clusters become competitive in REAL OLTP
    environments?
Write a Comment
User Comments (0)
About PowerShow.com