Title: Performance%20Visualizations%20using%20XML%20Representations
1Performance Visualizations using XML
Representations
- Presented by Kristof BeylsYijun YuErik H.
DHollander
2Overview
- Background program optimization research
- XML representations
- Visualizations
- Conclusion
3Program optimization research
- What slows down a program execution?Need to
pinpoint the performance bottlenecks.(by
analyzing the program) - How to improve the performance?By program
transformations, based on pinpointed bottlenecks. - How to transform the program?
- Compileradvantage automatic optimizationdisadva
ntage sometimes hard to understand what program
does - Programmeradvantage has good understanding of
program functionalitydisadvantage requires
human effort / How to
present performance bottlenecks best? - How to construct a research infrastructure that
supports all the above in a common framework? (?
XML)
4Two main performance factors
- Parallelismperforming computation in
parallelreduces execution time - Data localityfetching data from fast CPU caches
reduces execution time
5Overview
- Background program optimization research
- XML representations
- Visualizations
- Conclusion
6Why XML representations?
yaxx YACC extension to XML oc Omega
calculator isv iteration space visualizer cv
cache (trace) visualizer distv (cache reuse)
distance visualizer
- Extensible and versatile
- Standard and Interoperable
- Language Independent
XMLnamespace (tool) Representing
1. ast (yaxx) abstract syntax tree
2. par (oc) identified parallel or sequential loops
3. trace (isv, cv) execution trace of memory instructions
4. hotspot(isv,cv) performance bottleneck locations
5. isdg (isv) iteration space dependence graph
6. rdv (distv) a reuse distance vector
71. AST (Abstract Syntax Tree) (ast)
- XML is a good representation for AST by its
hierarchical nature. - ast namespace captures syntactical information of
a program - We can construct AST from source code through
YAXX and regenerate source code through XSLT.
ltastDO_Loopgt ltvar nameI/gt ltlbgtltconst value1/gtlt/lbgt ltubgtltconst value10/gtlt/ubgt ltstgtltconst value1/gtlt/stgt ltbodygtlt/bodygt lt/astDO_Loopgt DO I1,10,1 ENDDO
8Program optimization research
- What slows down a program execution?Need to
pinpoint the performance bottlenecks.(by
analyzing the program) - How to improve the performance?By program
transformations, based on pinpointed bottlenecks. - Who transforms the program?
- Compileradvantage automatic optimizationdisadva
ntage sometimes hard to understand what program
does - Programmeradvantage has good understanding of
program functionalitydisadvantage requires
human effort / How to
present performance bottlenecks best? - How to construct a research infrastructure that
supports all the above in a common framework? (?
XML)
92. Parallel loops (par)
- Identified parallel loop are annotated with a
ltpartrue/gt element in the par namespace. - ltastDO_Loopgt
- ltpartrue/gt
-
- lt/astDO_Loopgt
- In this way, semantics and syntax information are
in orthogonal name spaces. Syntax-based tools
(e.g. unparser) can still ignore it, or translate
it into directive comments e.g. Fortran CDOALL.
10XFPT an extended optimizing compiler
11Program optimization research
- What slows down a program execution?Need to
pinpoint the performance bottlenecks.(by
analyzing the program) - How to improve the performance?By program
transformations, based on pinpointed bottlenecks. - Who transforms the program?
- Compileradvantage automatic optimizationdisadva
ntage sometimes hard to understand what program
does - Programmeradvantage has good understanding of
program functionalitydisadvantage requires
human effort / How to
present performance bottlenecks best? - How to construct a research infrastructure that
supports all the above in a common framework? (?
XML)
123. Traces (trace)
- Trace records a sequence of memory address
accesses - lttraceseqgtltaccess addr0x00ffe8 bytes8
/gtltaccess addr0x00fff0 bytes16 /gt - lt/traceseqgt
- Trace alone can be used to identify runtime data
dependences and identify cache misses through
cache simulator - Associate an address with the array reference
number or loop iteration index on the programs
AST, the trace can be used for advanced loop
dependence analysis and cache reuse distance
analysis. - lttraceseqgtltaccess addr0x00ffe8 bytes8
hotspotid1gt lt!- The 1st reference --gt
ltdo_loop hotspotid1 vector1 2/gt lt! The
1st DO loop(I,J)(1,2) --gt ltarray
hotspotid1 vector1/gt lt!- Reference to
array element X(1) --gtlt/accessgt -
- lt/traceseqgt
134. Hotspots (hotspot)
- Hot spots are identified bottlenecks of the
program - Two types are used
- Bottleneck loops tells which loop is the
performance bottlenecks - Bottleneck references tells which references are
performance bottlenecks - lthotspotlistgt
- ltdo_loop id1gt
- ltindex vectorI J/gt
- ltstart lineno3 colno1/gt
- ltend lineno7 colno12/gt
- lt/do_loopgt
- ltarray id2 nameXgt
- ltdimgtltlbgt1lt/lbgtltubgt10lt/ubgtlt/dimgt
- lt/arraygt
- ltreference id1 typeRgt
- ltstart lineno5 colno9/gt
- ltend lineno5 colno14/gt
- lt/referencegt
- lt/hotspotlistgt
- DIM T(3), X(10)
- REAL S, X
- DO I 1, 10
- DO J 1, 10
- S S X(I)J
- ENDDO
- ENDDO
14Overview
- Background program optimization research
- XML representations
- Visualizations
- Conclusion
15Program optimization research
- What slows down a program execution?Need to
pinpoint the performance bottlenecks.(by
analyzing the program) - How to improve the performance?By program
transformations, based on pinpointed bottlenecks. - Who transforms the program?
- Compileradvantage automatic optimizationdisadva
ntage sometimes hard to understand what program
does - Programmeradvantage has good understanding of
program functionalitydisadvantage requires
human effort / How to
present performance bottlenecks best? - How to construct a research infrastructure that
supports all the above in a common framework? (?
XML)
16Performance Visualizations
- XML plays an important role to glue the
visualizers with an optimizing compiler - Loop dependence visualization
- Reuse distance visualization
- Cache behavior visualization
17Visualization 1ISDG iteration space dependence
graph
- An iteration is an instance of the loop body
statements. An iteration space is the set of
integer vector values of the DO loop index
variables for the traversed iterations. - Loop carried dependence is a dependence caused by
two references R1 and R2 that access to the same
memory address, while - One of R1, R2 is a write
- R1 belongs to loop iteration (i1, j1) and R2
belongs to loop iteration (i2, j2) ? (i1,j1) - A ISDG is a graph with nodes representing the
iteration space and edges representing loop
carried dependences.
- DO i1,5 DO j1,5 A(i,j) A(i,j1)
ENDDOENDDO
i
5
1
1
5
j
18The WTCM CFD application
- WTCM has a Computational Fluid Dynamics simulator
which involves solving partial differential
equations (PDE) through a Gauss-Siedel solver
3D geometry 1D time
temperature
19The visualized dependences
20The loop transformation
A 3-D unimodular transformation is found after
visualizing the 4D loop nest which has 177 array
references at run-time for each iteration. Here
we use a regular shape. The transformation makes
it possible to speed-up the program around N2/6
times where N is the diameter of the geometry.
21Visualization 2Reuse distances
- Reuse distance is the amount of data accessed
before a memory address is reused. - reuse distance gt cache size ? cache miss
22(No Transcript)
23Execution time reduction on an Itanium processor
(Spec2000 programs).
24Visualization 3Cache miss traces
(Tomcatv/Spec95)
White hit
Blue compulsory
Green capacity
Red conflict
56.7
254.2 Visualizing hotspots of conflict cache misses
X(I,J1) and X(I,J) has conflict if X has a
dimension (512,512). It is resolved by changing
thedimension to (524, 524). Also known as,
Array Padding
264.2 Cache misses trace after array padding, most
spatial locality is exploited, conflict misses
resolved
On Intel 550MHz Pentium III (single CPU), the
measured speedup with VTune gt50
17.2
27Overview
- Background program optimization research
- XML representations
- Visualizations
- Conclusion
28Conclusion
- An existing optimizing compiler FPT was extended
with an extensible XML interface. - The performance factors, in particular loop
parallelism and data locality, were exported from
FPT. - These factors were visualized through
- Loop dependence visualizer ISV
- Execution trace visualizer CacheVis
- Reuse distance visualizer ReuseVis
- The programmer can use the visualized feedback to
improve the performance.
29The End.
30Program semantics (Software) vs. Architecture
capabilities (Hardware)
Research Area Program Architecture
Parallel Computing Parallelism at Task, Loop, Instruction levels through data dependence analysis Multi-processors (MIMD), pipeline (SIMD), multi-threads, network of workstations (NOW, Grid computing)
Memory-hierarchy Temporal and spatial data locality, data layout, stack reuse distances Cache at level 1, 2, 3, TLB, set associativity, data replacement policy
Visualize them!
312. Major Performance factors
- Parallelism
- Loop dependences
- Loop-level parallelism
- Instruction-level parallelism
- Partition load balance
- Data locality
- Temporal locality
- Spatial locality
- CCC (Compulsory, Capacity, Conflict) cache misses
- Reuse distances
323.6 Cache parameters
- To tune different architectural cache
configurations, we represent the cache
parameters cache size, cache line size and set
associativity, into a configuration file in XML.
For example, a 2-level cache is specified as
follows - ltcachehierarchygt
- ltparameters level1gt
- ltsizegt1024lt/sizegt
- ltlinegt32lt/linegt
- ltassociativitygt32lt/associativitygt
- lt/parametersgt
- ltparameters level2gt
- ltsizegt65536lt/sizegt
- ltlinegt32lt/linegt
- ltassociativitygt1lt/associativitygt
- lt/parametersgt
- lt/cachehierarchygt
334.2 Visualizing data locality histogram
distributed over reuse distances