Title: Recursive Data Structure Profiling
1Recursive Data Structure Profiling
- Easwaran Raman
- David I. August
- Princeton University
2Motivation
- Huge processor-memory performance gap
- Latency gt 100 cycles
- significant fraction of memory operations in
typical programs - In many applications, Recursive Data Structures
(RDS) constitute a large fraction of memory usage
1000
100
10
Year
3Motivation
- Techniques to minimize the performance impact of
this gap - Caching, prefetching, out-of-order execution
- Not very successful for RDS
- Difficult to statically determine many RDS
properties - Accesses are irregular and usually lie in
critical path of execution
Short loop body prevents efficient OoO
execution
Non-contiguous layout results in irregular
access patterns
while (valid(node)) //do something
//with node-gtdata node next(node)
0x1000
0x2000
0x3000
0x4000
Traversal Code
An RDS layout example
4Motivation
- LinearizationClark76, Luk99
- Speculation recovery costs outweighs benefits if
the next pointer field gets overwritten
frequently - Information on the dynamic behavior of entire RDS
structure is important
head
head
1008
1012
1004
1016
1000
pos
index 0 head posindex while(head)
foo(head) head posindex
check(head)
Placement of the nodes in the figure
correspond to their placement in memory
5RDS Profile
- RDS profiling gives a logical understanding of
runtime behavior - Application creates 100 trees instead of
application allocates 2MB in heap - Linked list traversed 10 times instead of
Address 0x10004000 accessed 200 times - Profile for linearization next pointer field in
list L is modified n times
6RDS Discovery
- node tree_create()
- node n (node )malloc()
-
- n-gtleft
- tree_create()
- n-gtright
- tree_create()
-
- call malloc id 1
- mov r10 r8
-
- call tree_create
- call malloc id 2
- mov r11 r8
- store r10offset1 r11 create 1-gt2
- call tree_create
- call malloc id 3
- mov r12 r8
- store r10offset2 r12 create 1-gt3
C function for creating a tree
Dynamic Shape Graph
Execution trace in (pseudo) assembly
- Assign unique id for value returned by malloc and
create a node labeled by that id - Connect nodes by a directed edge if both the
address and the value of a store have valid ids
7RDS Discovery
- Multiple RDS instances can be connected together
in the DSG! - To separate them, we use properties of the static
code - Use another graph called Static Shape Graph (SSG)
8RDS discovery
Execution trace in (pseudo) assembly
- call malloc id 1
- Mov r20 r8
- call malloc id 2
- mov r10 r8
-
- call tree_create
- call malloc id 3
- mov r11 r8
- store r10offset1 r11 create 2-gt3
- call tree_create
- call malloc id 4
- mov r12 r8
- store r10offset2 r12create 2-gt4
- store r200 r10 create 1-gt2
- For every static call to malloc, create a node
with unique id in the Static Shape Graph (SSG) - If a store creates an edge, connect the
corresponding static nodes - Check for SCCs in the SSG
- Connect two dynamic nodes only if their
corresponding static nodes are in same SCC
1
A
5
2
T
6
7
3
4
SSG
DSG
9Experimental setup
- Uses Pin, a dynamic instrumentation tool for
Itanium - Mapping between address ranges and dynamic ids
are stored in an AVL tree - Most recent mapping is cached
- A mix of benchmarks from SPEC, Olden and other
pointer intensive applications - Dynamic instruction count varies from a few
million (ks) to over 300 billion (mesa) - All experiments run on a 900MHz Itanium 2 with 2
GB RAM running RH 7.1
10Profiler Performance
- Profile RDS size, lifetime, access count
- Memory lt16 MB for all but 3 applications
Baseline Execution using Pin ( 10 times slower
than native)
11RDS usage statistics
- SCCs in static shape graph (RDS types)
- Usually a few(lt5) per benchmark, a maximum of 31
in parser - RDS instances (connected components in DSG)
- Exhibits a wide range (1 in mcf to around million
in parser) - Tend to be live for long if the program creates
only a few of them - Sizes of RDS instances
- Varies from a single node self-loop (parser) to a
few hundred thousand nodes (mcf, parser) - pointer chasing loads
- Significant in many benchmarks
- Applications show vast diversity in RDS usage
- A good reason for profiling them!
12Temporal distribution
13Cumulative distribution of RDS lifetimes
14RDS Stability
- Stability of an RDS A notion of how
'array-like' an RDS is - Stability index an attempt to quantify this
notion - Identify the time instances (alteration points)
when changes occur to the RDS structure (by
stores that replace existing pointers) - Count the traversals between successive
alteration points - Stability index intervals that account for
most of the traversals - Lower index means higher stability
15Cumulative distribution of stability index
16Conclusion
- Aggressive data structure level optimization
techniques for RDS need profile information for
improved performance - RDS profiling gives a better understanding of the
runtime behavior of RDS - RDS usage varies widely across benchmarks
17Extra Slides
18RDS Profiling Definitions
- RDS type The abstract form of the logical data
structure that is manipulated by the program - Examples list, binary tree, graph, etc.
- Can be mutually recursive (nodes point to their
incident edges and vice versa to form a graph) - RDS instance A concrete realization of the RDS
type - Example the tree created in function foo, the
list pointed to by the first entry of the hash
table.
19(No Transcript)