Title: The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism
1The Potential for Variable-Granularity Access
Tracking for Optimistic Parallelism
- Mihai Burcea, J. Gregory Steffan, Cristiana Amza
- University of Toronto
- MSPC 2008
2Getting the Most Out of Your CPUs
AMD Barcelona quad-core
- Ubiquitous CMPs
- How do we exploit all this parallelism?
- How do we improve sequential applications?
Intel Kentsfield quad-core
3Optimistic Parallelism
- Flavors
- Transactional Memory (TM)
- Thread-Level Speculation (TLS)
- Implementations hardware, software, hybrid
- Common required support
- Buffering speculative memory changes
- Tracking and detecting memory access conflicts
4Traditional Access Tracking
- Most approaches use some fixed granularity
- Hardware TM/TLS cache-line size
- Typically 32/64/128 bytes
- Software TLS word-, object-level
- Software TM word/page/object granularity
- Hybrid TM mixture of above (in HW/SW)
Is Fixed Granularity the best approach ?
5Can We Reduce The Overhead of Dependence Tracking
?
Too much overhead
Too many false conflicts
Fine
Granularity
Coarse
- Key Intuition best granularity likely
varies within and across benchmarks
6False Conflicts when Using Uniform Coarse
Granularity
Measured in a TLS simulator 32/64/128 cache
line sizes (bytes)
Uniform coarse grain approach suffers false
conflicts
7- Is there potential for a variable-granularity
approach?
8Goals Of Our Work
- Show potential for Variable-Granularity Access
Tracking (VGAT) - Finest grain too expensive which coarse grain?
- Show that ideal granularity varies across and
within applications - Suggests need for dynamic, adaptive scheme
- Show significant reduction in number of tracked
memory ranges when using VGAT
9Related Work
- Hardware TLS / TM track accesses at cache-line
size (32/64/128 bytes) - Stampede (Steffan et. al., ACM Trans. 2005),
Speculative Versioning Cache (Vijaykumar et. al.,
HPCA 1998) - Unbounded TM (Ananian et. al., HPCA 2005), LogTM
(Moore et. al., HPCA 2006) - Software TLS
- Word (Cintra et. al., PPoPP 2003)
- Object (Pickett et. al., LCPC 2005)
- Software TM
- Word (McRT-STM Saha et. al., PPoPP 2006)
- Page (Manassiev et. al., PPoPP 2006)
- Object RSTM (Marathe et. al., PLDI 2006), DSTM
(Herlihy et. al., PODC 2003)
Most systems use fixed or object grain - but not
necessarily the best
10Related Work Bulk Disambiguation
- Ceze et. al., ISCA 2006
- Encode read/write sets into signatures
- Detect conflicts by performing operations on
signatures (fast) - Design of hashing (encoding) addresses into
signatures includes false positives - Reduce conflict-detection traffic, but increase
false conflicts
Our goal minimize false conflicts
11Variable Granularity Access Tracking
- Approaches vary granularity across
- Time parts of apps. (speculative code regions)
- Space ranges of memory
- Can potentially reduce
- Tracking storage
- Tracking traffic
- Commit latency
- False conflicts
12Impact On Conflicts Of Increasing Granularity
Granularity (bytes) Number of conflicts
4 100
8 100
16 103
32 120
True (actual) conflicts
?
Same nr. of conflicts, still ok
Extra (false) conflicts!
Coarsest granularity that incurs no false
conflicts Ideal Granularity
13- Measuring the Potential for VGAT
14Experimental Framework
- TLS simulator (CMU)
- Subset of SpecINT2000 benchmarks
- Instrumented for TLS
- TLS regions mostly loop-based
- TLS regions pre-selected based on 32-byte reading
and 4-byte writing granularity - Focus on specific aspects
- Simulate first billion instructions
- Track only Read-After-Write dependences
Speculative code regions pre-selected for 32
bytes -gt our results are conservative!
15Variable Granularity at Code Region Level
Memory accessed by Region 1
fork
Speculative Code Region 1
join
Granularity 4 bytes
Memory accessed by Region 2
Speculative Code Region 2
fork
join
Granularity 32 bytes
Memory accessed by Region 3
Speculative Code Region 3
fork
join
Granularity 8 bytes
4 bytes
8 bytes
32 bytes
16Ideal Granularity at Code Region Level
page-level (4 k)
cache-line level
word-level
Code regions with no conflicts not shown in
figure (in parentheses)
Ideal Granularity varies significantly between
code regions
17Variable Granularity Across Memory Ranges
Memory accessed by Region 1
fork
Speculative Code Region 1
join
Memory accessed by Region 2
fork
Speculative Code Region 2
join
Memory accessed by Region 3
fork
Speculative Code Region 3
join
4 bytes
8 bytes
32 bytes
18Ideal Granularity Across Memory Ranges
Cache-line size sometimes good, sometimes not
Word-level rarely necessary
Page-level often sufficient
Ideal Granularity varies widely across memory
ranges
19- Can VGAT improve performance?
20Reducing the Number of Tracked Elements by using
Variable Granularity
51
50
35
458
61
31
9
5
3
VGAT can reduce the of tracked elements more
than 3x!
21Ongoing Work
- Should memory-centric or code-centric accesses
determine granularity ? - Dynamic, adaptive system for deciding granularity
based on iterative sampling - How best to use and store profile information
- May tolerate some percentage of false conflicts
- Hardware TLS
- Reduce conflict-detection traffic, possibly power
- Software TM (lock-based)
- Reduce number of locks save space and time
- Reduce lock contention
22Conclusions (for Stampede TLS)
- TM/TLS systems with only fixed coarse granularity
may suffer many false conflicts - 2x 4x on average
- Variable granularity can reduce false conflicts
and tracking overhead - 3x 35x reduction in tracked ranges
- Ideal granularity varies widely across memory
ranges and speculative code regions
23