ATLAS: Automatically Tuned Linear Algebra Software

About This Presentation

Title:

ATLAS: Automatically Tuned Linear Algebra Software

Description:

Basic linear algebra subprograms (BLAS) ... Hand craft highly optimized BLAS libraries. This is still the current approach: Intel has its own BLAS library. ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 26

Provided by: xiny2

Category:

more less

Transcript and Presenter's Notes

Title: ATLAS: Automatically Tuned Linear Algebra Software

1
ATLAS Automatically Tuned Linear Algebra Software

Background
The high level view of Automated Empirical
Optimization of Software (AEOS) technique.
ATLAS systems

2
Background

Computers are getting more complex
Deeply pipelined
Complicate memory hierarchy
Many functional units
How can the software keep up with the hardware?
What to do if we really want our software to take
full advantage of the hardware capacity
(achieving near peak performance)?
In many cases, performance is no longer important
In other cases, performance is still important.
One area is the kernel linear algebra routines,
which are used in many HPC applications.

3
Basic linear algebra subprograms (BLAS)

Identified by the community as important and the
standard API is defined.
Used in many high performance computing
applications.
Highly optimizable
Highly tuned implementation can be 10 100 times
faster than naïve implementations on a typical
processor today.

4
How to achieve near peak performance?

Historical approach
The compiler approach
The compilers promise the performance close to
hand-crafted assembly.
Many compiler optimizations to improve
performance have been developed.
Loop optimizations, Software pipelining,
Instruction scheduling, Instruction selection.
In theory, the compiler can do its job and
generate efficient code.
Practical issues efficient code will depend on
MANY architectures parameters
The number of caches, the cache size, the
availability of special instructions, the number
of functional units, the depth of pipeline stage
in the functional unit, etc.
The impact of the parameters and their
interactions are NOT always well understood the
compiler may not be able to make the best choice
even if it knows all the parameters.
Architectures evolve faster than the compilers.

5
How to achieve near peak performance?

Historical approach
Hand craft highly optimized BLAS libraries.
This is still the current approach Intel has its
own BLAS library.
Doable with a lot of efforts.
Need real experts
The rate for processor evolution is difficult for
programmers to keep up.

6
Challenges to achieve near peak performance

Need to know the architecture features and used
architecture specific optimizations
The impacts of architecture features are usually
unclear.
accurate analytical modeling of systems is very
difficult today.
This is the major problem for the compiler
Architectures evolve quickly.
This is the major problem for the hand-craft
approach.

7
AEOS approach

Automated Empirical Optimization of Software
(AEOS) a new paradigm
Adaptive software
Software package supports many implementations of
each routine.
In theory, it can include all architecture-specifi
c optimizations for current generation
processors.
Likely have some optimizations that are
important even for future generation hardware.
Automatic selection of the best performing
schemes
Use the empirical measurement results for the
selection.

8
AEOSs answer to the challenges

Need to know the architecture features and used
architecture specific optimizations.
Architectures evolves quickly. But most of the
time, it is still evolution, not revolution.
Parameter changing rather than new features most
of the time.
In terms of important features, it is countable.
This allows using a countable number of
implementations to optimize for all features.

9
AEOSs answer to the challenges

The impacts of architecture features are usually
unclear (accurate analytical modeling of systems
is very difficult today).
This is the major problem for the compiler, but
the empirical approach somewhat counters this.

10
AEOS requirement

Isolation of performance-critical routines
Different implementations for such routines that
are optimized for different architecture
features.
A method of adapting software to different
environments.
Robust, context-sensitive timing mechanisms
Appropriate search heuristics
The importance of this depends on the number of
different implementions.

11
AEOS software adaptation methods

Needs a way to provide different implementations.
Parameterized adaptation one copy of code with
parameters.
Source code adaptation
multiple implementations
Code generation use tools to automatically
generated implementations.

12
AEOS timer requirements

Timers are the center for any empirical approach.
The fundamental issue is to make sure that the
timing results reflect the actual performance.
Cold cache/hot cache
Highly loaded system/unload systems
Wall time /CPU time
Timer is even harder to design for distributed
systems.

13
ATLAS

Automatically tuned linear algebra software
Mainly done by Clint Whaley at Univ. of Tenn.
Clint got his PHD at FSU and is now at UTSA.
ATLAS is one of the most successful application
of the AEOS technique.
Used in many systems such as MATLAB, MAPLE,
Mathematica.
Optimizing BLAS

14
More on BLAS

Level 1 BLAS vector / vector operations
Level 2 BLAS matrix /vector operations
Level 3 BLAS matrix / matrix operations
This paper focuses on the Level 3 BLAS (matrix
multiply) that has the most optimization
opportunities.

15
Level 3 BLAS support

BLAS 3 level BLAS routines are all based on
matrix multiply
For (i1 iltN i)
for (j1 jltN j)
for(k1 kltN k)
c(I, j) c(I, j) A(I, k)B(k, j)

16
ATLAS MM implementation

Two levels
L1 cache-contained MM (on-chip MM)
Size and shapes are known
General MM that is built on top of L1
cache-contained MM

17
Optimizations in General MM

May or may not copy A and B into block-major
format all elements in each block in continuous
memory.
Block size is fixed N_B, depending on the L1
cache size.
Eliminates TLB problems, minimizes cache
thrashing, and maximize cache line reuse.
A is in transposed form in a block and B is in
the non-transposed form.
When to copy effectiveness depends on shapes and
sizes of the array.
Measure and compared.
The paper only discusses the copied case.

18
Optimizations in General MM

May or may not have a copy for C
The result of L1 contained MM may write to C or a
buffer C.
With C
Address alignment can be controlled (reduce cache
interference)
Data is contiguous, eliminate cache thrashing.
Double the number of writes for C.
Use a heuristic to decide whether to use C

19
Optimizations in General MM

Two loop ordering
I or J loop can be either be the outer loops.
Choose the correct looping structure to fix L2
cache.
Blocking for higher level caches
Needed only when panels of A and B overflow a
particular level of cache.
Memory footprint for computing one N_B x N_B
section of C is 2KN_B N_B2.
The number of variables is small enough, general
MM does not require the use of routine generator.

20
L1 Cache-contained MM

Obtained using code generator after the L1 cache
size is determined.
Once the cache size is determined, the shape and
size of the MM can be decided.
Register blocking for A, B, and, C can be
decided.
Reduce the loop overhead
All iterations can be completed unrolled in
theory. In ATLAS, only the K loop can be
completely unrolled.

21
L1 Cache-contained MM

Floating Point instructing ordering
Cx cx aybz
Some architecture has fused multiply/add unit ?
this form will be ideal.
Some architecture has pipelined floating point
unit ? this form will be worst.
Rearrange the order of instructions. Lots of
followed by lots of or other unrelated
instructions in between.

22
L1 Cache-contained MM

Exposing parallelism
Cx cx aybz
Some architecture has multiple floating point
functional units. Need to keep all functional
units to maximize the performance.
Feed the processor with more independent
instruction streams unroll M and/or N loops.
Finding the correct number of cache misses
Modern processors has cache miss buffers
allowing multiple outstanding cache miss before
blocking the program. Should issue maximum number
of cache miss in each cycle (or no cache miss).
Only use M/N loop unrolling in ATLAS.

23
L1 Cache-contained MM code generator parameters

On-chip MM is generated automatically too many
options.
A and/or B in standard or transposed form
Loop unrolling factors (for M, N, and K)
Choice of floating point instructions
Choice of using generation time constant or
run-time variables for loop dimensions
Choice of use generation time contant or run-time
variables for array indexing.

24
Put it all together

Probing the system
L1 cache size perform a fixed number of memory
references and reduce the amount of memory
addresses. A significant gap in timings is mark
as the L1 cache size.
Floating point units muladd or independent
multiply and add pipeline (and pipeline depth)
with simple register to register code.
Number of Floating point registers
Run code that requires more and more registers
until the performance significantly drops.

25
Put it all together

Generate on-chip MM
With the probing, the algorithm search space is
greatly reduced.
Number of registers decides the feasibility of
register blocking ? M/N blocking factor.
Cache size determines the blocking factor of K
Type of float-point unit further reduce the
search space.
ATLAS search through all possible M/N unrolling
factors (for the number of registers), after
that, all blocking factors and K-loop unrolling
factor.

Write a Comment

User Comments (0)

About PowerShow.com

ATLAS: Automatically Tuned Linear Algebra Software - PowerPoint PPT Presentation

ATLAS: Automatically Tuned Linear Algebra Software

Basic linear algebra subprograms (BLAS) ... Hand craft highly optimized BLAS libraries. This is still the current approach: Intel has its own BLAS library. ... – PowerPoint PPT presentation