Custom Fast Binary Matrix Multiplication Processor Design - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Custom Fast Binary Matrix Multiplication Processor Design

Description:

As an example, the new custom ISA has the following interesting ... I have demonstrated that a custom design is and could be useful for certain applications. ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 25
Provided by: timpe3
Category:

less

Transcript and Presenter's Notes

Title: Custom Fast Binary Matrix Multiplication Processor Design


1
Custom Fast Binary Matrix Multiplication
Processor Design
  • Tim Pevzner (tpevzner_at_cs.ucsd.edu)
  • CSE237A
  • 6/12/2004

2
Outline
  • Introduction
  • Use case example
  • Design
  • Design Process
  • Future Work
  • Conclusion

3
Introduction
  • Fast binary matrix multiplication custom
    processor
  • An exercise in ISA design
  • Custom instructions
  • Custom design

4
Outline
  • Introduction
  • Use case example
  • Design
  • Design Process
  • Future Work
  • Conclusion

5
Use Case Example
  • Rows represent Selectors
  • Columns represent Data Items
  • Want to find correlation between different data
    items as a binary number
  • 1 there is a correlation
  • 0 there is no correlation
  • In this example, there are only nine selectors
    and 11 data items, a very small example!

6
Use Case Example (contd)
7
Use Case Example Analysis
  • The matrix was very small, therefore, results are
    not very interesting
  • Interesting when matrix is on the order of 1-10
    million data items and 100 thousand selectors.
  • If 100 K x 10 M matrix, there are ONE Trillion
    possible intersections.

8
Motivation
  • As can be seen from example, a very large matrix
    can emerge
  • 1 Trillion intersections would require
    potentially up to 31 Billion 32 bit instructions
    in the worst case.
  • Also, storing that matrix would require a lot of
    memory and a lot of disk accesses.
  • Even with a lot of main memory, the memory is not
    dedicated wholly for this task.
  • A co-processor design would be built on an add-on
    board such that the whole database is loaded onto
    the board at startup, and thereafter, queries are
    sent to the board, and results are returned, thus
    requiring a lot less machine main memory accesses.

9
Outline
  • Introduction
  • Use case example
  • Design
  • Design Process
  • Future Work
  • Conclusion

10
Design
  • A new design is proposed as a co-processor model
  • Allow for a very large, directly addressable
    local, on-board memory.
  • 64-bit addressing a lot of potential for
    scaling.
  • 64-bit operations means it would require about 16
    Billion operations, or half of the operations
    required for 32 bit instructions.

11
Outline
  • Introduction
  • Use case example
  • Design
  • Design Process
  • Future Work
  • Conclusion

12
Design Process
  • A design process is a process when an idea gets
    transformed into a design, and is implemented to
    accomplish the goal stated.
  • In the next few slides I will present the design
    process I went through, and some examples to
    illustrate the steps.

13
Design Process (as I see it)
  • A custom design requires custom thinking
  • Start with an algorithm design.
  • Translate to control flow flowchart.
  • Co-design new ISA and the program that implements
    the algorithm using the new instructions.
  • Decide on instruction and register bit
    assignment.
  • Design datapath for the overall processor.
  • Design appropriate components.

14
Design Process (algorithm)
  • The basic fast binary matrix multiplication
    algorithm
  • Input matrix (selector rows x data item
    columns), row of interest (a number between 0 and
    number of rows)
  • Output a vector (length is the number of rows)
    that contains a zero at position j if there is no
    correlation between the input row and row j on
    any data item and one if there is a correlation.
  • The algorithm
  • Select a row to be used.
  • For each row j in the matrix, perform a bit-wise
    AND operation such that if the two rows have a
    value 1 in any (same) position, then the result
    vector will have 1 at the jth position.

15
Design Process (new ISA)
  • As an example, the new custom ISA has the
    following interesting instructions
  • cmovpp copies data from one memory block (main
    memory) to a row cache (temporary memory) and
    increments two memory pointers.
  • memAND takes two values, one in memory and one
    in row cache, performs the bit-wise AND operation
    on them, and increments two memory pointers.

16
Design Process (optimizations)
  • The new implementation has a number of
    optimizations
  • Results are cached in a special memory, so any
    subsequent request for the same row will simply
    return what is in cache.
  • A row cache is used as a temporary location to
    store row that is being processed, thus making it
    possible to have parallel memory reads without
    the use of a dual ported memory.

17
Results Analysis
  • In the best case, when the result has already
    been calculated, it takes only 9(3 rows / 64)
    instructions to return the result.
  • In the worst case, when the matrix is completely
    empty, so all columns must be ANDed, it takes 22
    (3 columns / 64) (((6 columns / 64) 16
    ) rows) (3 rows / 64)
  • For our 100 K x 10 M example
  • Best case about 5 thousand cycles.
  • Worst case about 100 billion cycles (this is a
    very bad case, and is, for all practical
    purposes, impossible)

18
Design Process (tradeoffs)
  • I chose to make memory word serial in access
  • Pros
  • Makes the design very scalable, simply reads
    memory locations that are available in any matrix
    configuration.
  • Cons
  • It takes O(columns) to process one row which
    makes the overall algorithm O(rowscolumns)

19
Outline
  • Introduction
  • Use case example
  • Design
  • Design Process
  • Future Work
  • Conclusion

20
Future Work
  • Implement some optimizations
  • It is possible to store a sparse matrix using a
    binary matrix compression algorithm that allows
    direct addressing of sparse bits.
  • Pros
  • Uses less memory, and would require less cycles
    in the worst case.
  • Cons
  • Makes the implementation somewhat more complex.
  • Make the input matrix columns sorted by their
    population density.
  • Pros
  • Reduces the number of cycle needed to compute a
    result.
  • Cons
  • Requires more work from the main processor.

21
Future work (contd)
  • More optimizations
  • It is possible to make the design have bit-wise
    AND units allocated to banks of memory
  • Pros
  • Makes row processing become O(1), and the overall
    performance becomes O(rows)
  • Cons
  • Restricts the possible matrix geometries, since
    the AND unit would be tied to a bank of memory.
  • Also, it would be harder to scale the design, as
    it would require addition not only of memory, but
    also of some logic units.

22
Outline
  • Introduction
  • Use case example
  • Design
  • Design Process
  • Future Work
  • Conclusion

23
Conclusion
  • I have demonstrated that a custom design is and
    could be useful for certain applications.
  • I learned that schematic entry is not the most
    efficient way of building a complex design
  • As of now, not all of the parts are
    completed/connected together.
  • Since the parts are not yet connected, they are
    also not tested to work together. A big problem!
  • My design is well suited as a co-processor
    implementation for a dedicated DB server.

24
The End!
  • QA
Write a Comment
User Comments (0)
About PowerShow.com