Title: ATLAS: Automatically Tuned Linear Algebra Software
1ATLAS Automatically Tuned Linear Algebra Software
- Background
- The high level view of Automated Empirical
Optimization of Software (AEOS) technique. - ATLAS systems
2Background
- Computers are getting more complex
- Deeply pipelined
- Complicate memory hierarchy
- Many functional units
- How can the software keep up with the hardware?
- What to do if we really want our software to take
full advantage of the hardware capacity
(achieving near peak performance)? - In many cases, performance is no longer important
- In other cases, performance is still important.
- One area is the kernel linear algebra routines,
which are used in many HPC applications.
3Basic linear algebra subprograms (BLAS)
- Identified by the community as important and the
standard API is defined. - Used in many high performance computing
applications. - Highly optimizable
- Highly tuned implementation can be 10 100 times
faster than naïve implementations on a typical
processor today.
4How to achieve near peak performance?
- Historical approach
- The compiler approach
- The compilers promise the performance close to
hand-crafted assembly. - Many compiler optimizations to improve
performance have been developed. - Loop optimizations, Software pipelining,
Instruction scheduling, Instruction selection. - In theory, the compiler can do its job and
generate efficient code. - Practical issues efficient code will depend on
MANY architectures parameters - The number of caches, the cache size, the
availability of special instructions, the number
of functional units, the depth of pipeline stage
in the functional unit, etc. - The impact of the parameters and their
interactions are NOT always well understood the
compiler may not be able to make the best choice
even if it knows all the parameters. - Architectures evolve faster than the compilers.
5How to achieve near peak performance?
- Historical approach
- Hand craft highly optimized BLAS libraries.
- This is still the current approach Intel has its
own BLAS library. - Doable with a lot of efforts.
- Need real experts
- The rate for processor evolution is difficult for
programmers to keep up.
6Challenges to achieve near peak performance
- Need to know the architecture features and used
architecture specific optimizations - The impacts of architecture features are usually
unclear. - accurate analytical modeling of systems is very
difficult today. - This is the major problem for the compiler
- Architectures evolve quickly.
- This is the major problem for the hand-craft
approach.
7AEOS approach
- Automated Empirical Optimization of Software
(AEOS) a new paradigm - Adaptive software
- Software package supports many implementations of
each routine. - In theory, it can include all architecture-specifi
c optimizations for current generation
processors. - Likely have some optimizations that are
important even for future generation hardware. - Automatic selection of the best performing
schemes - Use the empirical measurement results for the
selection.
8AEOSs answer to the challenges
- Need to know the architecture features and used
architecture specific optimizations. - Architectures evolves quickly. But most of the
time, it is still evolution, not revolution. - Parameter changing rather than new features most
of the time. - In terms of important features, it is countable.
This allows using a countable number of
implementations to optimize for all features.
9AEOSs answer to the challenges
- The impacts of architecture features are usually
unclear (accurate analytical modeling of systems
is very difficult today). - This is the major problem for the compiler, but
the empirical approach somewhat counters this.
10AEOS requirement
- Isolation of performance-critical routines
- Different implementations for such routines that
are optimized for different architecture
features. - A method of adapting software to different
environments. - Robust, context-sensitive timing mechanisms
- Appropriate search heuristics
- The importance of this depends on the number of
different implementions.
11AEOS software adaptation methods
- Needs a way to provide different implementations.
- Parameterized adaptation one copy of code with
parameters. - Source code adaptation
- multiple implementations
- Code generation use tools to automatically
generated implementations.
12AEOS timer requirements
- Timers are the center for any empirical approach.
- The fundamental issue is to make sure that the
timing results reflect the actual performance. - Cold cache/hot cache
- Highly loaded system/unload systems
- Wall time /CPU time
- Timer is even harder to design for distributed
systems.
13ATLAS
- Automatically tuned linear algebra software
- Mainly done by Clint Whaley at Univ. of Tenn.
- Clint got his PHD at FSU and is now at UTSA.
- ATLAS is one of the most successful application
of the AEOS technique. - Used in many systems such as MATLAB, MAPLE,
Mathematica. - Optimizing BLAS
14More on BLAS
- Level 1 BLAS vector / vector operations
- Level 2 BLAS matrix /vector operations
- Level 3 BLAS matrix / matrix operations
- This paper focuses on the Level 3 BLAS (matrix
multiply) that has the most optimization
opportunities.
15Level 3 BLAS support
- BLAS 3 level BLAS routines are all based on
matrix multiply - For (i1 iltN i)
- for (j1 jltN j)
- for(k1 kltN k)
- c(I, j) c(I, j) A(I, k)B(k, j)
16ATLAS MM implementation
- Two levels
- L1 cache-contained MM (on-chip MM)
- Size and shapes are known
- General MM that is built on top of L1
cache-contained MM
17Optimizations in General MM
- May or may not copy A and B into block-major
format all elements in each block in continuous
memory. - Block size is fixed N_B, depending on the L1
cache size. - Eliminates TLB problems, minimizes cache
thrashing, and maximize cache line reuse. - A is in transposed form in a block and B is in
the non-transposed form. - When to copy effectiveness depends on shapes and
sizes of the array. - Measure and compared.
- The paper only discusses the copied case.
18Optimizations in General MM
- May or may not have a copy for C
- The result of L1 contained MM may write to C or a
buffer C. - With C
- Address alignment can be controlled (reduce cache
interference) - Data is contiguous, eliminate cache thrashing.
- Double the number of writes for C.
- Use a heuristic to decide whether to use C
19Optimizations in General MM
- Two loop ordering
- I or J loop can be either be the outer loops.
- Choose the correct looping structure to fix L2
cache. - Blocking for higher level caches
- Needed only when panels of A and B overflow a
particular level of cache. - Memory footprint for computing one N_B x N_B
section of C is 2KN_B N_B2. - The number of variables is small enough, general
MM does not require the use of routine generator.
20L1 Cache-contained MM
- Obtained using code generator after the L1 cache
size is determined. - Once the cache size is determined, the shape and
size of the MM can be decided. - Register blocking for A, B, and, C can be
decided. - Reduce the loop overhead
- All iterations can be completed unrolled in
theory. In ATLAS, only the K loop can be
completely unrolled.
21L1 Cache-contained MM
- Floating Point instructing ordering
- Cx cx aybz
- Some architecture has fused multiply/add unit ?
this form will be ideal. - Some architecture has pipelined floating point
unit ? this form will be worst. - Rearrange the order of instructions. Lots of
followed by lots of or other unrelated
instructions in between.
22L1 Cache-contained MM
- Exposing parallelism
- Cx cx aybz
- Some architecture has multiple floating point
functional units. Need to keep all functional
units to maximize the performance. - Feed the processor with more independent
instruction streams unroll M and/or N loops. - Finding the correct number of cache misses
- Modern processors has cache miss buffers
allowing multiple outstanding cache miss before
blocking the program. Should issue maximum number
of cache miss in each cycle (or no cache miss). - Only use M/N loop unrolling in ATLAS.
23L1 Cache-contained MM code generator parameters
- On-chip MM is generated automatically too many
options. - A and/or B in standard or transposed form
- Loop unrolling factors (for M, N, and K)
- Choice of floating point instructions
- Choice of using generation time constant or
run-time variables for loop dimensions - Choice of use generation time contant or run-time
variables for array indexing.
24Put it all together
- Probing the system
- L1 cache size perform a fixed number of memory
references and reduce the amount of memory
addresses. A significant gap in timings is mark
as the L1 cache size. - Floating point units muladd or independent
multiply and add pipeline (and pipeline depth)
with simple register to register code. - Number of Floating point registers
- Run code that requires more and more registers
until the performance significantly drops.
25Put it all together
- Generate on-chip MM
- With the probing, the algorithm search space is
greatly reduced. - Number of registers decides the feasibility of
register blocking ? M/N blocking factor. - Cache size determines the blocking factor of K
- Type of float-point unit further reduce the
search space. - ATLAS search through all possible M/N unrolling
factors (for the number of registers), after
that, all blocking factors and K-loop unrolling
factor.