ATLAS: Automatically Tuned Linear Algebra Software - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

ATLAS: Automatically Tuned Linear Algebra Software

Description:

Basic linear algebra subprograms (BLAS) ... Hand craft highly optimized BLAS libraries. This is still the current approach: Intel has its own BLAS library. ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 26
Provided by: xiny2
Category:

less

Transcript and Presenter's Notes

Title: ATLAS: Automatically Tuned Linear Algebra Software


1
ATLAS Automatically Tuned Linear Algebra Software
  • Background
  • The high level view of Automated Empirical
    Optimization of Software (AEOS) technique.
  • ATLAS systems

2
Background
  • Computers are getting more complex
  • Deeply pipelined
  • Complicate memory hierarchy
  • Many functional units
  • How can the software keep up with the hardware?
  • What to do if we really want our software to take
    full advantage of the hardware capacity
    (achieving near peak performance)?
  • In many cases, performance is no longer important
  • In other cases, performance is still important.
  • One area is the kernel linear algebra routines,
    which are used in many HPC applications.

3
Basic linear algebra subprograms (BLAS)
  • Identified by the community as important and the
    standard API is defined.
  • Used in many high performance computing
    applications.
  • Highly optimizable
  • Highly tuned implementation can be 10 100 times
    faster than naïve implementations on a typical
    processor today.

4
How to achieve near peak performance?
  • Historical approach
  • The compiler approach
  • The compilers promise the performance close to
    hand-crafted assembly.
  • Many compiler optimizations to improve
    performance have been developed.
  • Loop optimizations, Software pipelining,
    Instruction scheduling, Instruction selection.
  • In theory, the compiler can do its job and
    generate efficient code.
  • Practical issues efficient code will depend on
    MANY architectures parameters
  • The number of caches, the cache size, the
    availability of special instructions, the number
    of functional units, the depth of pipeline stage
    in the functional unit, etc.
  • The impact of the parameters and their
    interactions are NOT always well understood the
    compiler may not be able to make the best choice
    even if it knows all the parameters.
  • Architectures evolve faster than the compilers.

5
How to achieve near peak performance?
  • Historical approach
  • Hand craft highly optimized BLAS libraries.
  • This is still the current approach Intel has its
    own BLAS library.
  • Doable with a lot of efforts.
  • Need real experts
  • The rate for processor evolution is difficult for
    programmers to keep up.

6
Challenges to achieve near peak performance
  • Need to know the architecture features and used
    architecture specific optimizations
  • The impacts of architecture features are usually
    unclear.
  • accurate analytical modeling of systems is very
    difficult today.
  • This is the major problem for the compiler
  • Architectures evolve quickly.
  • This is the major problem for the hand-craft
    approach.

7
AEOS approach
  • Automated Empirical Optimization of Software
    (AEOS) a new paradigm
  • Adaptive software
  • Software package supports many implementations of
    each routine.
  • In theory, it can include all architecture-specifi
    c optimizations for current generation
    processors.
  • Likely have some optimizations that are
    important even for future generation hardware.
  • Automatic selection of the best performing
    schemes
  • Use the empirical measurement results for the
    selection.

8
AEOSs answer to the challenges
  • Need to know the architecture features and used
    architecture specific optimizations.
  • Architectures evolves quickly. But most of the
    time, it is still evolution, not revolution.
  • Parameter changing rather than new features most
    of the time.
  • In terms of important features, it is countable.
    This allows using a countable number of
    implementations to optimize for all features.

9
AEOSs answer to the challenges
  • The impacts of architecture features are usually
    unclear (accurate analytical modeling of systems
    is very difficult today).
  • This is the major problem for the compiler, but
    the empirical approach somewhat counters this.

10
AEOS requirement
  • Isolation of performance-critical routines
  • Different implementations for such routines that
    are optimized for different architecture
    features.
  • A method of adapting software to different
    environments.
  • Robust, context-sensitive timing mechanisms
  • Appropriate search heuristics
  • The importance of this depends on the number of
    different implementions.

11
AEOS software adaptation methods
  • Needs a way to provide different implementations.
  • Parameterized adaptation one copy of code with
    parameters.
  • Source code adaptation
  • multiple implementations
  • Code generation use tools to automatically
    generated implementations.

12
AEOS timer requirements
  • Timers are the center for any empirical approach.
  • The fundamental issue is to make sure that the
    timing results reflect the actual performance.
  • Cold cache/hot cache
  • Highly loaded system/unload systems
  • Wall time /CPU time
  • Timer is even harder to design for distributed
    systems.

13
ATLAS
  • Automatically tuned linear algebra software
  • Mainly done by Clint Whaley at Univ. of Tenn.
  • Clint got his PHD at FSU and is now at UTSA.
  • ATLAS is one of the most successful application
    of the AEOS technique.
  • Used in many systems such as MATLAB, MAPLE,
    Mathematica.
  • Optimizing BLAS

14
More on BLAS
  • Level 1 BLAS vector / vector operations
  • Level 2 BLAS matrix /vector operations
  • Level 3 BLAS matrix / matrix operations
  • This paper focuses on the Level 3 BLAS (matrix
    multiply) that has the most optimization
    opportunities.

15
Level 3 BLAS support
  • BLAS 3 level BLAS routines are all based on
    matrix multiply
  • For (i1 iltN i)
  • for (j1 jltN j)
  • for(k1 kltN k)
  • c(I, j) c(I, j) A(I, k)B(k, j)

16
ATLAS MM implementation
  • Two levels
  • L1 cache-contained MM (on-chip MM)
  • Size and shapes are known
  • General MM that is built on top of L1
    cache-contained MM

17
Optimizations in General MM
  • May or may not copy A and B into block-major
    format all elements in each block in continuous
    memory.
  • Block size is fixed N_B, depending on the L1
    cache size.
  • Eliminates TLB problems, minimizes cache
    thrashing, and maximize cache line reuse.
  • A is in transposed form in a block and B is in
    the non-transposed form.
  • When to copy effectiveness depends on shapes and
    sizes of the array.
  • Measure and compared.
  • The paper only discusses the copied case.

18
Optimizations in General MM
  • May or may not have a copy for C
  • The result of L1 contained MM may write to C or a
    buffer C.
  • With C
  • Address alignment can be controlled (reduce cache
    interference)
  • Data is contiguous, eliminate cache thrashing.
  • Double the number of writes for C.
  • Use a heuristic to decide whether to use C

19
Optimizations in General MM
  • Two loop ordering
  • I or J loop can be either be the outer loops.
  • Choose the correct looping structure to fix L2
    cache.
  • Blocking for higher level caches
  • Needed only when panels of A and B overflow a
    particular level of cache.
  • Memory footprint for computing one N_B x N_B
    section of C is 2KN_B N_B2.
  • The number of variables is small enough, general
    MM does not require the use of routine generator.

20
L1 Cache-contained MM
  • Obtained using code generator after the L1 cache
    size is determined.
  • Once the cache size is determined, the shape and
    size of the MM can be decided.
  • Register blocking for A, B, and, C can be
    decided.
  • Reduce the loop overhead
  • All iterations can be completed unrolled in
    theory. In ATLAS, only the K loop can be
    completely unrolled.

21
L1 Cache-contained MM
  • Floating Point instructing ordering
  • Cx cx aybz
  • Some architecture has fused multiply/add unit ?
    this form will be ideal.
  • Some architecture has pipelined floating point
    unit ? this form will be worst.
  • Rearrange the order of instructions. Lots of
    followed by lots of or other unrelated
    instructions in between.

22
L1 Cache-contained MM
  • Exposing parallelism
  • Cx cx aybz
  • Some architecture has multiple floating point
    functional units. Need to keep all functional
    units to maximize the performance.
  • Feed the processor with more independent
    instruction streams unroll M and/or N loops.
  • Finding the correct number of cache misses
  • Modern processors has cache miss buffers
    allowing multiple outstanding cache miss before
    blocking the program. Should issue maximum number
    of cache miss in each cycle (or no cache miss).
  • Only use M/N loop unrolling in ATLAS.

23
L1 Cache-contained MM code generator parameters
  • On-chip MM is generated automatically too many
    options.
  • A and/or B in standard or transposed form
  • Loop unrolling factors (for M, N, and K)
  • Choice of floating point instructions
  • Choice of using generation time constant or
    run-time variables for loop dimensions
  • Choice of use generation time contant or run-time
    variables for array indexing.

24
Put it all together
  • Probing the system
  • L1 cache size perform a fixed number of memory
    references and reduce the amount of memory
    addresses. A significant gap in timings is mark
    as the L1 cache size.
  • Floating point units muladd or independent
    multiply and add pipeline (and pipeline depth)
    with simple register to register code.
  • Number of Floating point registers
  • Run code that requires more and more registers
    until the performance significantly drops.

25
Put it all together
  • Generate on-chip MM
  • With the probing, the algorithm search space is
    greatly reduced.
  • Number of registers decides the feasibility of
    register blocking ? M/N blocking factor.
  • Cache size determines the blocking factor of K
  • Type of float-point unit further reduce the
    search space.
  • ATLAS search through all possible M/N unrolling
    factors (for the number of registers), after
    that, all blocking factors and K-loop unrolling
    factor.
Write a Comment
User Comments (0)
About PowerShow.com