Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

Description:

Title: PowerPoint Presentation Last modified by: Created Date: 1/1/1601 12:00:00 AM Document presentation format: Other titles – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 32
Provided by: dataminin4
Category:

less

Transcript and Presenter's Notes

Title: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce


1

Distributed Nonnegative Matrix Factorization for
Web-Scale Dyadic Data Analysis on MapReduce
Definition In general, dyadic data are the
measurements on dyads, which are pairs of two
elements coming from two sets. Feature
High-dimensional, Sparse, Nonnegative, Dynamic.
Challenge the scalability of available tools
2
What we do
In this paper, by observing a variety of Web
dyadic data that conforms to different
probabilistic distributions, we put forward a
probabilistic NMF framework, which not only
encompasses the two classic GNMF and PNMF, but
also presents the Exponential NMF (ENMF) for
modeling Web lifetime dyadic data. Furthermore,
we scale up the NMF to arbitrarily large matrices
on MapReduce clusters.
3

4
GNMF
maximizing the likelihood of observing A w.r.t. W
and H under the i.i.d. assumption
is equivalent to minimizing
5
(No Transcript)
6
PNMF
then maximizing the likelihood of observing A
becomes to minimizing
7
(No Transcript)
8

ENMF
maximizing the likelihood of observing A w.r.t. W
and H
becomes to minimizing
9

10

(a)just like the cuda programming model, and (b)
is map-reduce programming. (a) puts the full row
or column in the shared memory, and goes on
computing with multi-core GPU. So if we have a
very long column that cant put in the shared
memory, (a) cant work efficiently, because the
access overhead becomes large.
11
The updating formula for H (Eqn. 1) is composed
of three components
12
(No Transcript)
13
(No Transcript)
14
Then computing YCH
15

16

17
Experiment
The experiment is based on GNMF.
  • Sandbox Experiment
  • This is designed for a clear understanding
    the algorithm but not for showcasing the
    scalability. And all the reported time is the
    time taken just by one iteration.
  • Experiment on real Web data
  • The experiment is for investigating the
    effectiveness of this approach, with a particular
    focus on its scalability, because a method that
    does not scale well will be of limited utility in
    practice.

18

Sandbox Experiment
We generate the sample matrices which has m
rows and n columns. And the m pow(2,17),
npow(2, 16), sparsitypow(2, -10) and sparsity
pow(2, -7).
19

Performance
The time taken for one iteration needed more and
more as the number of nonzero cells in A increase.
20

Performance
The time taken increases as the value of k is
added. And the slope for sparsity pow(2,-10) is
much smaller than that for sparsity pow(2,-7).
Both is smaller than 1.
21

Performance
The ideal speedup is along the diagonal, which
upper-bounds the practical speedup because of the
Amdahls Law. On the matrix with pow(2, -7),
the speedup is nearly 5 when 8 workers are
enabled. The dominant factor becomes extra
overhead such as shuffle, as the sparsity
decreases.
22
Experiment on real Web data
We record a UID-by-website count matrix
involving 17.8 million distinct UIDs and 7.24
million distinct websites, which is training set.
For effective test, we take the top-1000 UIDs
that have the largest number of associated
websites, and this gives us 346040 distinct (UID,
website) pairs. To measure the effectiveness, we
randomly hold out a visited website for each UID,
and mix it with another 99 un-visited sites. The
100 websites then constitute the test case for
that UID. The goal is to check the rank of the
holdout visited website among the 100 websites
for each UID, and the overall effectiveness s
measured by the average rank across the 1000 test
cases, the smaller the better.
23

Scalability
On the one hand, it shows that the elapse time
increases linearly with the number of iterations,
but on the other hand, we note that the average
time per iteration becomes smaller when more
iterations are executed in one job.
24

Scalability
It reveals the linearity between the elapse time
and the dimensionality k.
25

Scalability
It lists how the algorithm scales with
increasingly larger data sampled from
increasingly longer time periods.
26

Conclusion
27

28

29

30

31
Write a Comment
User Comments (0)
About PowerShow.com