Performance%20Visualizations%20using%20XML%20Representations - PowerPoint PPT Presentation

About This Presentation
Title:

Performance%20Visualizations%20using%20XML%20Representations

Description:

(by analyzing the program) How to improve the performance? ... fetching data from fast CPU caches reduces execution time. 5. Overview. Background: ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 34
Provided by: yij5
Category:

less

Transcript and Presenter's Notes

Title: Performance%20Visualizations%20using%20XML%20Representations


1
Performance Visualizations using XML
Representations
  • Presented by Kristof BeylsYijun YuErik H.
    DHollander

2
Overview
  • Background program optimization research
  • XML representations
  • Visualizations
  • Conclusion

3
Program optimization research
  • What slows down a program execution?Need to
    pinpoint the performance bottlenecks.(by
    analyzing the program)
  • How to improve the performance?By program
    transformations, based on pinpointed bottlenecks.
  • How to transform the program?
  • Compileradvantage automatic optimizationdisadva
    ntage sometimes hard to understand what program
    does
  • Programmeradvantage has good understanding of
    program functionalitydisadvantage requires
    human effort / How to
    present performance bottlenecks best?
  • How to construct a research infrastructure that
    supports all the above in a common framework? (?
    XML)

4
Two main performance factors
  • Parallelismperforming computation in
    parallelreduces execution time
  • Data localityfetching data from fast CPU caches
    reduces execution time

5
Overview
  • Background program optimization research
  • XML representations
  • Visualizations
  • Conclusion

6
Why XML representations?
yaxx YACC extension to XML oc Omega
calculator isv iteration space visualizer cv
cache (trace) visualizer distv (cache reuse)
distance visualizer
  • Extensible and versatile
  • Standard and Interoperable
  • Language Independent

XMLnamespace (tool) Representing
1. ast (yaxx) abstract syntax tree
2. par (oc) identified parallel or sequential loops
3. trace (isv, cv) execution trace of memory instructions
4. hotspot(isv,cv) performance bottleneck locations
5. isdg (isv) iteration space dependence graph
6. rdv (distv) a reuse distance vector
7
1. AST (Abstract Syntax Tree) (ast)
  • XML is a good representation for AST by its
    hierarchical nature.
  • ast namespace captures syntactical information of
    a program
  • We can construct AST from source code through
    YAXX and regenerate source code through XSLT.

ltastDO_Loopgt ltvar nameI/gt ltlbgtltconst value1/gtlt/lbgt ltubgtltconst value10/gtlt/ubgt ltstgtltconst value1/gtlt/stgt ltbodygtlt/bodygt lt/astDO_Loopgt DO I1,10,1 ENDDO
8
Program optimization research
  • What slows down a program execution?Need to
    pinpoint the performance bottlenecks.(by
    analyzing the program)
  • How to improve the performance?By program
    transformations, based on pinpointed bottlenecks.
  • Who transforms the program?
  • Compileradvantage automatic optimizationdisadva
    ntage sometimes hard to understand what program
    does
  • Programmeradvantage has good understanding of
    program functionalitydisadvantage requires
    human effort / How to
    present performance bottlenecks best?
  • How to construct a research infrastructure that
    supports all the above in a common framework? (?
    XML)

9
2. Parallel loops (par)
  • Identified parallel loop are annotated with a
    ltpartrue/gt element in the par namespace.
  • ltastDO_Loopgt
  • ltpartrue/gt
  • lt/astDO_Loopgt
  • In this way, semantics and syntax information are
    in orthogonal name spaces. Syntax-based tools
    (e.g. unparser) can still ignore it, or translate
    it into directive comments e.g. Fortran CDOALL.

10
XFPT an extended optimizing compiler
11
Program optimization research
  • What slows down a program execution?Need to
    pinpoint the performance bottlenecks.(by
    analyzing the program)
  • How to improve the performance?By program
    transformations, based on pinpointed bottlenecks.
  • Who transforms the program?
  • Compileradvantage automatic optimizationdisadva
    ntage sometimes hard to understand what program
    does
  • Programmeradvantage has good understanding of
    program functionalitydisadvantage requires
    human effort / How to
    present performance bottlenecks best?
  • How to construct a research infrastructure that
    supports all the above in a common framework? (?
    XML)

12
3. Traces (trace)
  • Trace records a sequence of memory address
    accesses
  • lttraceseqgtltaccess addr0x00ffe8 bytes8
    /gtltaccess addr0x00fff0 bytes16 /gt
  • lt/traceseqgt
  • Trace alone can be used to identify runtime data
    dependences and identify cache misses through
    cache simulator
  • Associate an address with the array reference
    number or loop iteration index on the programs
    AST, the trace can be used for advanced loop
    dependence analysis and cache reuse distance
    analysis.
  • lttraceseqgtltaccess addr0x00ffe8 bytes8
    hotspotid1gt lt!- The 1st reference --gt
    ltdo_loop hotspotid1 vector1 2/gt lt! The
    1st DO loop(I,J)(1,2) --gt ltarray
    hotspotid1 vector1/gt lt!- Reference to
    array element X(1) --gtlt/accessgt
  • lt/traceseqgt

13
4. Hotspots (hotspot)
  • Hot spots are identified bottlenecks of the
    program
  • Two types are used
  • Bottleneck loops tells which loop is the
    performance bottlenecks
  • Bottleneck references tells which references are
    performance bottlenecks
  • lthotspotlistgt
  • ltdo_loop id1gt
  • ltindex vectorI J/gt
  • ltstart lineno3 colno1/gt
  • ltend lineno7 colno12/gt
  • lt/do_loopgt
  • ltarray id2 nameXgt
  • ltdimgtltlbgt1lt/lbgtltubgt10lt/ubgtlt/dimgt
  • lt/arraygt
  • ltreference id1 typeRgt
  • ltstart lineno5 colno9/gt
  • ltend lineno5 colno14/gt
  • lt/referencegt
  • lt/hotspotlistgt
  1. DIM T(3), X(10)
  2. REAL S, X
  3. DO I 1, 10
  4. DO J 1, 10
  5. S S X(I)J
  6. ENDDO
  7. ENDDO

14
Overview
  • Background program optimization research
  • XML representations
  • Visualizations
  • Conclusion

15
Program optimization research
  • What slows down a program execution?Need to
    pinpoint the performance bottlenecks.(by
    analyzing the program)
  • How to improve the performance?By program
    transformations, based on pinpointed bottlenecks.
  • Who transforms the program?
  • Compileradvantage automatic optimizationdisadva
    ntage sometimes hard to understand what program
    does
  • Programmeradvantage has good understanding of
    program functionalitydisadvantage requires
    human effort / How to
    present performance bottlenecks best?
  • How to construct a research infrastructure that
    supports all the above in a common framework? (?
    XML)

16
Performance Visualizations
  • XML plays an important role to glue the
    visualizers with an optimizing compiler
  • Loop dependence visualization
  • Reuse distance visualization
  • Cache behavior visualization

17
Visualization 1ISDG iteration space dependence
graph
  • An iteration is an instance of the loop body
    statements. An iteration space is the set of
    integer vector values of the DO loop index
    variables for the traversed iterations.
  • Loop carried dependence is a dependence caused by
    two references R1 and R2 that access to the same
    memory address, while
  • One of R1, R2 is a write
  • R1 belongs to loop iteration (i1, j1) and R2
    belongs to loop iteration (i2, j2) ? (i1,j1)
  • A ISDG is a graph with nodes representing the
    iteration space and edges representing loop
    carried dependences.
  • DO i1,5 DO j1,5 A(i,j) A(i,j1)
    ENDDOENDDO

i
5
1
1
5
j
18
The WTCM CFD application
  • WTCM has a Computational Fluid Dynamics simulator
    which involves solving partial differential
    equations (PDE) through a Gauss-Siedel solver

3D geometry 1D time
temperature
19
The visualized dependences
20
The loop transformation
A 3-D unimodular transformation is found after
visualizing the 4D loop nest which has 177 array
references at run-time for each iteration. Here
we use a regular shape. The transformation makes
it possible to speed-up the program around N2/6
times where N is the diameter of the geometry.
21
Visualization 2Reuse distances
  • Reuse distance is the amount of data accessed
    before a memory address is reused.
  • reuse distance gt cache size ? cache miss

22
(No Transcript)
23
Execution time reduction on an Itanium processor
(Spec2000 programs).
24
Visualization 3Cache miss traces
(Tomcatv/Spec95)
White hit
Blue compulsory
Green capacity
Red conflict
56.7
25
4.2 Visualizing hotspots of conflict cache misses
X(I,J1) and X(I,J) has conflict if X has a
dimension (512,512). It is resolved by changing
thedimension to (524, 524). Also known as,
Array Padding
26
4.2 Cache misses trace after array padding, most
spatial locality is exploited, conflict misses
resolved
On Intel 550MHz Pentium III (single CPU), the
measured speedup with VTune gt50
17.2
27
Overview
  • Background program optimization research
  • XML representations
  • Visualizations
  • Conclusion

28
Conclusion
  • An existing optimizing compiler FPT was extended
    with an extensible XML interface.
  • The performance factors, in particular loop
    parallelism and data locality, were exported from
    FPT.
  • These factors were visualized through
  • Loop dependence visualizer ISV
  • Execution trace visualizer CacheVis
  • Reuse distance visualizer ReuseVis
  • The programmer can use the visualized feedback to
    improve the performance.

29
The End.
  • Any questions?

30
Program semantics (Software) vs. Architecture
capabilities (Hardware)
Research Area Program Architecture
Parallel Computing Parallelism at Task, Loop, Instruction levels through data dependence analysis Multi-processors (MIMD), pipeline (SIMD), multi-threads, network of workstations (NOW, Grid computing)
Memory-hierarchy Temporal and spatial data locality, data layout, stack reuse distances Cache at level 1, 2, 3, TLB, set associativity, data replacement policy
Visualize them!
31
2. Major Performance factors
  • Parallelism
  • Loop dependences
  • Loop-level parallelism
  • Instruction-level parallelism
  • Partition load balance
  • Data locality
  • Temporal locality
  • Spatial locality
  • CCC (Compulsory, Capacity, Conflict) cache misses
  • Reuse distances

32
3.6 Cache parameters
  • To tune different architectural cache
    configurations, we represent the cache
    parameters cache size, cache line size and set
    associativity, into a configuration file in XML.
    For example, a 2-level cache is specified as
    follows
  • ltcachehierarchygt
  • ltparameters level1gt
  • ltsizegt1024lt/sizegt
  • ltlinegt32lt/linegt
  • ltassociativitygt32lt/associativitygt
  • lt/parametersgt
  • ltparameters level2gt
  • ltsizegt65536lt/sizegt
  • ltlinegt32lt/linegt
  • ltassociativitygt1lt/associativitygt
  • lt/parametersgt
  • lt/cachehierarchygt

33
4.2 Visualizing data locality histogram
distributed over reuse distances
Write a Comment
User Comments (0)
About PowerShow.com