Parallizing the SVD Computation for Latent Semantic Analysis - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Parallizing the SVD Computation for Latent Semantic Analysis

Description:

Difficulties in matrix operations because of the Compact Row Format: ... Communication cost higher because of the compact row format: rowind and pointr vectors. ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 11
Provided by: jorgec6
Category:

less

Transcript and Presenter's Notes

Title: Parallizing the SVD Computation for Latent Semantic Analysis


1
Parallizing the SVD Computation for Latent
Semantic Analysis
  • Jorge Civera Saiz
  • Borislav Stoyanov

2
Introduction Information Retrieval Methods
  • Term-matching retrieval techniques.
  • Based on sintaxis.
  • Typical in search engines.
  • Latent Semantic Analysis.
  • Allows document- document, term-document and
    term-term comparison.
  • Idea Model the relationship between terms and
    documents using a frequency term-by-document
    matrix.
  • Element aij is how many times word i occurs in
    document j.
  • We apply SVD decomposition to this matrix to
    remove noisy information and reveal semantic
    structure.

3
SVD decomposition
  • Matrix A T x S x D
  • We use SVDPACKC library to carry out SVD
    decomposition of sparse matrices stored in Row
    Compact Format.
  • Pointr Vector
  • Rowind Vector
  • Value Vector

4
How we parallelized SVD computation ?
  • Parallelize basic matrix and vector operations
    because they are 90 of the execution time.
  • This approach was taken based on simplicity,
    since parallelization of the SVD computation as a
    whole was not feasible.
  • Application of Master-Worker Paradigm to our
    case
  • Master is running the main program.
  • Workers waiting in a barrier.
  • Master opens the barrier and send the code of the
    operation to workers together with the data, at
    the end of the computation synchronization using
    MPI_Reduceor MPI_Gatherv.

5
How we parallelized SVD computation ? (2)
  • How to parallelize Matrix x Matrix x Vector and
    Matrix x Vector taking compact row format into
    account?
  • Distribution of non-zero values could be not
    uniform, i.e. columns with few non-zero values
    and colums with a lot of non-zero values.
  • Spliting the matrix just by columns it may result
    in an unbalanced distribution of computational
    effort .
  • Solution is simple Split data in chunks with
    equal amount of nonzero values, since this is the
    real computation. Based on value vector.

6
Advantages and disadvantages with this solution
  • Good news
  • Matrix is sent only once at the beginning.
    Communication cost is reduced to the minimum.
  • Computation is balanced for all the nodes.
  • We just need to send the piece of vector each
    time we calculate this operation.
  • Bad news
  • Sometimes we will need to split in the middle of
    a column (in the best case).
  • Need of recalculation of pointr for each node and
    the same for the piece of vector.

7
Vector operations
  • Basic operations
  • Constant x Vector Vector.
  • Dot Product.
  • Parallelization is simple.
  • Problem
  • These operations are called around 1 million
    times in a execution, every call implies
    synchronization communication cost.
  • Communicative cost computational cost in
    remote nodes is higher than computational cost in
    local node for medium matrices ( A few thousands
    by a few thousands).
  • Just useful for big vectors, in our test decrease
    the performance.

8
Performance Measurements
  • Speed-up curve for 2, 4, 6 and 8 processors

Size of sparse matrix(3 dense)
9
Problems
  • Speed-up worse than expected a priori, maybe
    because we could not test with larger matrices.
  • Problems with MPI_Broadcast.
  • Parallelization is not always good. Small vector
    operations.
  • Difficulties in matrix operations because of the
    Compact Row Format
  • Recalculation of pointr vector for each node is
    not so easy.
  • Similar problem with vector that we send for
    matrix operations.

10
Conclusion
  • Speed-up about 3, not so great as we expected.
    Possible reasons
  • We are using medium size matrices, not large
    ones.
  • Communication cost higher because of the compact
    row format rowind and pointr vectors.
  • Not possible to take advantage of vector
    operations, since medium scale problem.
  • Syncronization overhead. We call an operation a
    few millions times.
  • Questions ??
Write a Comment
User Comments (0)
About PowerShow.com