Title: Optimizing Matrix Multiplication with a Classifier Learning System
1Optimizing Matrix Multiplication with a
Classifier Learning System
- Xiaoming Li (presenter)
- MarÃa Jesús Garzarán
- University of Illinois at Urbana-Champaign
2Tuning library for recursive matrix multiplication
- Use cache-aware algorithms that take into account
architectural features - Memory hierarchy
- Register file,
- Take into account input characteristics
- matrix sizes
- The process of tuning is automatic.
3Recursive Matrix Partitioning
- Previous approaches
- Multiple recursive steps
- Only divide by half
A
B
4Recursive Matrix Partitioning
- Previous approaches
- Multiple recursive steps
- Only divide by half
A
B
Step 1
5Recursive Matrix Partitioning
- Previous approaches
- Multiple recursive steps
- Only divide by half
A
B
Step 2
6Recursive Matrix Partitioning
- Our approach is more general
- No need to divide by half
- May use a single step to reach the same partition
- Faster and more general
A
B
Step 1
7Our approach
- A general framework to describe a family of
recursive matrix multiplication algorithms, where
given the input dimensions of the matrices, we
determine - Number of partition levels
- How to partition at each level
- An intelligent search method based on a
classifier learning system - Search for the best partitioning strategy in a
huge search space
8Outline
- Background
- Partition Methods
- Classifier Learning System
- Experimental Results
9Recursive layout framework
- Multiple levels of recursion
- Takes into account the cache hierarchy
10Recursive layout framework
- Multiple levels of recursion
- Takes into account the cache hierarchy
2
1
4
3
11Recursive layout in our framework
- Multiple levels of recursion
- Takes into account the cache hierarchy
12Recursive layout framework
- Multiple levels of recursion
- Takes into account the cache hierarchy
13Recursive layout framework
- Multiple levels of recursion
- Takes into account the cache hierarchy
1
2
5
6
3
4
7
8
9
10
13
14
11
12
15
16
14Padding
- Necessary when the partition factor is not a
divisor of the matrix dimension.
Divide by 3
2000
15Padding
- Necessary when the partition factor is not a
divisor of the matrix dimension.
Divide by 3
2001
667
16Padding
- Necessary when the partition factor is not a
divisor of the matrix dimension.
Divide by 4
2001
667
17Padding
- Necessary when the partition factor is not a
divisor of the matrix dimension.
Divide by 4
2004
668
18Recursive layout in our framework
- Multiple level recursion
- Support cache hierarchy
- Square tile ? rectangular tile
- Fit non-square matrixes
19Recursive layout in our framework
- Multiple level recursion
- Support cache hierarchy
- Square tile ? rectangular tile
- Fit non-square matrixes
8
9
20Recursive layout in our framework
- Multiple level recursion
- Support cache hierarchy
- Square tile ? rectangular tile
- Fit non-square matrixes
8
10
Padding
21Recursive layout in our framework
- Multiple level recursion
- Support cache hierarchy
- Square tile ? rectangular tile
- Fit non-square matrixes
4
3
22Outline
- Background
- Partition Methods
- Classifier Learning System
- Experimental Results
23Two methods to partition matrices
- Partition by Block (PB)
- Specify the size of each tile
- Example
- Dimensions (M,N,K) (100, 100, 40)
- Tile size (bm, bn, bk) (50, 50, 20)
- Partition factors (pm, pn, pk)
(2,2,2) - Tiles need not to be square
24Two methods to partition matrices
- Partition by Size (PS)
- Specify the maximum size of the three tiles.
- Maintain the ratios between dimensions constant
- Example
- (M,N,K) (100, 100,50)
- Maximum tile size for M,N 1250
- (pm, pn, pk) (2,2,1)
- Generalization of the divide-by-half approach.
- Tile size 1/4 matrix size
25Outline
- Background
- Partition Methods
- Classifier Learning System
- Experimental Results
26Classifier Learning System
- Use the two partition primitives to determine how
the input matrices are partitioned - Determine partition factors at each level
- f (M,N,K) ? (pmi,pni,pki), i0,1,2 (only
consider 3 levels) - The partition factors depend on the matrix size
- Eg. The partitions factors of a (1000 x 1000)
matrix should be different that those of a (50 x
1000) matrix. - The partition factors also depend on the
architectural characteristics, like cache size.
27Determine the best partition factors
- The search space is huge ? exhaustive search is
impossible - Our proposal use a multi-step classifier
learning system - Creates a table that given the matrix dimensions
determines the partition factors
28Classifier Learning System
- The result of the classifier learning system is a
table with two columns - Column 1 (Pattern) A string of 0, 1, and
that encodes the dimensions of the matrices - Column 2 (Action) Partition method for one step
- Built using the partition-by-block and
partition-by-size primitives with different
parameters.
29Learn with Classifier System
30Learn with Classifier System
5 bits / dim
31Learn with Classifier System
24
16
32Learn with Classifier System
24
16
33Learn with Classifier System
12
8
34Learn with Classifier System
12
8
35Learn with Classifier System
12
8
36Learn with Classifier System
4
4
37How classifier learning algorithm works?
- Change the table based on the feedback of
performance and accuracy from previous runs. - Mutate the condition part of the table to adjust
the range of matching matrix dimensions. - Mutate the action part to find the best partition
method for the matching matrices.
38Outline
- Background
- Partition Methods
- Classifier Learning System
- Experimental Results
39Experimental Results
- Experiments on three platforms
- Sun UltraSparcIII
- P4 Intel Xeon
- Intel Itanium2
- Matrices of sizes from 1000 x 1000 to 5000 x 5000
40Algorithms
- Classifier MMM our approach
- Include the overhead of copying in and out of
recursive layout - ATLAS Library generated by ATLAS using the
search procedure without hand-written codes. - Has some type of blocking for L2
- L1 One level of tiling
- tile size the same that ATLAS for L1
- L2 Two levels of tiling
- L1tile and L2tile the same that ATLAS for L1
41(No Transcript)
42(No Transcript)
43Conclusion and Future Work
- Preliminary results prove the effectiveness of
our approach - Sun UltraSparcIII and Xeon 18 and 5
improvement, respectively. - Itanium -14
- Need to improve padding mechanism
- Reduce the amount of padding
- Avoid unnecessary computation on padding
44Thank you!