Content Based Recommendation and Summarization in the Blogosphere

1 / 55

About This Presentation

Title:

Content Based Recommendation and Summarization in the Blogosphere

Description:

Content Based Recommendation and Summarization in the Blogosphere. Ahmed Hassan ... Blogs are now one of the main means for spread of ideas and information. ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 56

Provided by: Rad94

more less

Transcript and Presenter's Notes

Title: Content Based Recommendation and Summarization in the Blogosphere

1
Content Based Recommendation and Summarization in
the Blogosphere
CLAIR

Ahmed HassanUniversity of Michigan Ann
ArborDragomir RadevUniversity of Michigan Ann
ArborJungho Cho University of California Los
Angeles
Amruta Joshi University of California Los Angele

2
Outline

Introduction
Approach
Experiments and Results
Conclusions

1
2
3
4
Introduction Approach Experiments
and Results Conclusions
3
The Blogosphere

Blogs are now one of the main means for spread of
ideas and information.
The size of the Blogosphere has been exhibiting
an exponential increase.
How can we decide which blogs are more
important/influential?

Introduction Approach Experiments
and Results Conclusions
4
The Blogosphere

Blogs are now one of the main means for spread of
ideas and information.
The size of the Blogosphere has been exhibiting
an exponential increase.
How can we decide which blogs are more
important/influential?

Introduction Approach Experiments
and Results Conclusions
5
The Blogosphere

Blogs are now one of the main means for spread of
ideas and information.
The size of the Blogosphere has been exhibiting
an exponential increase.
How can we decide which blogs are more
important/influential?

Introduction Approach Experiments
and Results Conclusions
6
Problem Definition

Given a set of blogs related to a particular
topic
Find a subset of blog feeds to read that have
interest in that topic
Similar to the Blog Distillation task in the
TREC Blog Track.

Introduction Approach Experiments
and Results Conclusions
7
Problem Definition

Given a set of blogs related to a particular
topic
Find a subset of blog feeds to read that have
interest in that topic
Similar to the Blog Distillation task in the
TREC Blog Track.

Introduction Approach Experiments
and Results Conclusions
8
Problem Definition

Given a set of blogs related to a particular
topic
Find a subset of blog feeds to read that have
interest in that topic.
Similar to the Blog Distillation task in the
TREC Blog Track.

Introduction Approach Experiments
and Results Conclusions
9
Ranking Web Pages

This problem is similar to the problem of ranking
web pages.
How is that problem solved?
Link popularity algorithms
The most popular link popularity based algorithms
are
PageRank
Hypertext Induced Topic Selection (HITS)

Introduction Approach Experiments
and Results Conclusions
10
Ranking Web Pages

This problem is similar to the problem of ranking
web pages.
How is that problem solved?
Link popularity based algorithms.
The most popular link popularity based algorithms
are
PageRank
Hypertext Induced Topic Selection (HITS)

Introduction Approach Experiments
and Results Conclusions
11
Ranking Web Pages

This problem is similar to the problem of ranking
web pages.
How is that problem solved?
Link popularity algorithms
The most popular link popularity based algorithms
are
PageRank
Hypertext Induced Topic Selection (HITS)

Introduction Approach Experiments
and Results Conclusions
12
Ranking Web Pages

Can we use link popularity based algorithms for
speeches and blogs?
Yes, but they might not work very well.
Weakly linked nature of blog pages.
Bloggers try to exploit the system to boost the
rank of their blogs.

Introduction Approach Experiments
and Results Conclusions
13
Ranking Web Pages

Can we use hyperlink popularity based algorithms
for speeches and blogs?
Yes, but they might not work very well.
Weakly linked nature of blog pages.
Bloggers try to exploit the system to boost the
rank of their blogs.

Introduction Approach Experiments
and Results Conclusions
14
From Hyperlinks to Similarity

Use textual similarity to link posts instead of
hyperlinks.
Textual similarity between posts suggests some
kind of interaction
one of them affecting the other.
Textual similarity is a way of conferring
authority.

Introduction Approach Experiments
and Results Conclusions
15
Outline

Introduction
Approach
Experiments and Results
Conclusions

1
2
3
4
Introduction Approach Experiments
and Results Conclusions
16
Problem Statement

Given a set of blogs
Build a graph where
Nodes represent posts/feeds.
Edges link posts/feeds with similar text.
Edge weight reflects how similar the text is.
Find a subset of nodes that are more important
than others.

Introduction Approach Experiments
and Results Conclusions
17
Problem Statement

Given a set of blogs
Build a graph where
Nodes represent posts/feeds.
Edges link posts/feeds with similar text.
Edge weight reflects how similar the text is.
Find a subset of nodes that are more important
than others

Introduction Approach Experiments
and Results Conclusions
18
Text Salience Scores

We define the importance score of a blog
recursively in terms of the scores of its
neighbors as followswhere deg(q) is the
degree of node q, and adjq is the set of all
nodes adjacent to q in the network.
This can be rewritten in matrix notation
aswhere S (S(p1), S(p2), . . . , S(pN))
and the matrix B is the row normalized similarity
matrix of the graph.

Introduction Approach Experiments
and Results Conclusions
19
Text Salience Scores

We define the importance score of a blog
recursively in terms of the scores of its
neighbors as followswhere deg(q) is the
degree of node q, and adjq is the set of all
nodes adjacent to q in the network.
This can be rewritten in matrix notation
aswhere S (S(p1), S(p2), . . . , S(pN))
and the matrix B is the row normalized similarity
matrix of the graph.

Introduction Approach Experiments
and Results Conclusions
20
Measuring Textual Similarity

Posts similarity is estimated by the cosine
similarity between a tf-idf vector representation
of the posts.
tf The term frequency in some document.
idf The inverse document frequency.
tf-idf tf idf
Cosine similarity the cosine of the angle
between the tf-idf vectors.
Other possible similarity measures
Edit distance.
Language models.

Introduction Approach Experiments
and Results Conclusions
21
Measuring Textual Similarity

Posts similarity is estimated by the cosine
similarity between a tf-idf vector representation
of the posts.
tf The term frequency in some document.
idf The inverse document frequency.
tf-idf tf idf
Cosine similarity the cosine of the angle
between the tf-idf vectors.
Other possible similarity measures
Edit distance.
Language models.

Introduction Approach Experiments
and Results Conclusions
22
Diversity Ranking

The second important node (B) is quite similar to
the first.
Node C is also important but more diverse w.r.t A
than B.
How can we select the second node such that it is
important and as diverse as possible from the
first selected node?

Introduction Approach Experiments
and Results Conclusions
23
Diversity Ranking

The second important node (B) is quite similar to
the first.
Node C is also important but more diverse w.r.t A
than B.
How can we select the second node such that it is
important and as diverse as possible from the
first selected node?

Introduction Approach Experiments
and Results Conclusions
24
Diversity Ranking

Possible solution Remove selected nodes or nodes
similar to them from the network.
May disconnect the network
Better solution Discount nodes that are too
similar to previously selected nodes.where
d(p) is a discounting factor that is inversely
proportional to the similarity between p and the
selected nodes

Introduction Approach Experiments
and Results Conclusions
25
Diversity Ranking

Discount nodes that are too similar to previously
selected nodes.
where d(p) is a discounting factor that is
inversely proportional to the similarity between
p and the selected nodes

Introduction Approach Experiments
and Results Conclusions
26
Adding Priors

Relation to other nodes might not be the only
parameter affecting its importance.
Other attributes that are related to the node
itself may be involved.
How can we incorporate this into the
approachwhere Q(p) is some node quality
measure, and ß is a trad-off factor.

Introduction Approach Experiments
and Results Conclusions
27
Adding Priors

Relation to other nodes might not be the only
parameter affecting its importance.
Other attributes that are related to the node
itself may be involved.
How can we incorporate this into the
approachwhere Q(p) is some node quality
measure, and ß is a trad-off factor.

Introduction Approach Experiments
and Results Conclusions
28
Outline

Introduction
Approach
Experiments and Results
Conclusions

1
2
3
4
Introduction Approach Experiments
and Results Conclusions
29
Experimental Setup

Data The TREC BLOG06 dataset (Macdonald and
Ounis 2006)
100,649 feeds and 3,215,171 posts covering an 11
weeks period
a list of 17,969 spam blogs
Objective Rank blogs according to their coverage
of other blogs.

Introduction Approach Experiments
and Results Conclusions
30
Case Study 1 Blogs

Data The TREC BLOG06 dataset (Macdonald and
Ounis 2006)
100,649 feeds and 3,215,171 posts covering an 11
weeks period
a list of 17,969 spam blogs
Objective Rank blogs according to their coverage
of other blogs.

Introduction Approach Experiments
and Results Conclusions
31
Evaluation

Given a rank of blogs, how can we measure how
good it is?
Using diffusion models.
Diffusion models are originally used in social
networks to model the spread of influence in a
network.
The Linear Threshold Model is one of the most
popular diffusion models.

Introduction Approach Experiments
and Results Conclusions
32
Evaluation

Given a rank of blogs, how can we measure how
good it is?
Using diffusion models.
Diffusion models are originally used in social
networks to model the spread of influence in a
network.
The Linear Threshold Model is one of the most
popular diffusion models.

Introduction Approach Experiments
and Results Conclusions
33
Linear Threshold Model

Each node has a threshold ? in 0,1 selected
uniformly at random.
The diffusion process starts with a set of active
nodes.
An inactive node v becomes active if

Introduction Approach Experiments
and Results Conclusions
34
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
35
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
36
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
37
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
38
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5.
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
39
Evaluation

The quality of a rank of blogs can be measured
by
How many blogs are covered by the first blog in
the rank?
How many blogs are covered by the first 2 blogs
in the rank?
And so on ..
The coverage of a set of blogs is simply the
number of other blogs it activates using the
linear threshold model
Initially mark the nodes in the set as active.
Apply the linear threshold model.
Count the number of activated nodes.
Repeat for M times and take the average.
To compare two blog ranks
Compare the coverage of the first 1,2,N blogs in
each rank.

40
Evaluation

The quality of a rank of blogs can be measured
by
How many blogs are covered by the first blog in
the rank?
How many blogs are covered by the first 2 blogs
in the rank?
And so on ..
The coverage of a set of blogs is simply the
number of other blogs it activates using the
linear threshold model
Initially mark the nodes in the set as active.
Apply the linear threshold model.
Count the number of activated nodes.
Repeat for M times and take the average.
To compare two blog ranks
Compare the coverage of the first 1,2,N blogs in
each rank.

41
Evaluation

The quality of a rank of blogs can be measured
by
How many blogs are covered by the first blog in
the rank?
How many blogs are covered by the first 2 blogs
in the rank?
And so on ..
The coverage of a set of blogs is simply the
number of other blogs it activates using the
linear threshold model
Initially mark the nodes in the set as active.
Apply the linear threshold model.
Count the number of activated nodes.
Repeat for M times and take the average.
To compare two blog ranks
Compare the coverage of the first 1,2,N blogs in
each rank.

42
Baselines

Random
Select blogs uniformly at random.
Heuristic
Find the most popular blogs by using a fixed
goodness criteria (e.g. number of posts).
Greedy
Start with an empty set M0 .
Add the node that maximizes the marginal gain in
subset quality.
Quality of a subset is measured by the number of
nodes it covers in the blogs network.
Very computationally costly O(V4).

Introduction Approach Experiments
and Results Conclusions
43
Baselines

Random
Select blogs uniformly at random.
Heuristic
Find the most popular blogs by using a fixed
goodness criteria (e.g. number of posts).
Greedy
Start with an empty set M0 .
Add the node that maximizes the marginal gain in
subset quality.
Quality of a subset is measured by the number of
nodes it covers in the blogs network.
Very computationally costly O(V4).

Introduction Approach Experiments
and Results Conclusions
44
Baselines

Random
Select blogs uniformly at random.
Heuristic
Find the most popular blogs by using a fixed
goodness criteria (e.g. number of posts).
Greedy
Start with an empty set M0 .
Add the node that maximizes the marginal gain in
subset quality.
Quality of a subset is measured by the number of
nodes it covers in the blogs network.
Very computationally costly O(V4).

Introduction Approach Experiments
and Results Conclusions
45
Results

Percentage of covered blogs vs percentage of
selected blogs for BlogRank order, greedy
algorithm order, number of posts order, and
random order -Topic Global Warming.

Introduction Approach Experiments
and Results Conclusions
46
Results

Percentage of covered blogs vs. percentage of
selected blogs for BlogRank order, greedy
algorithm order, number of posts order, and
random order - Topic iPhone.

Introduction Approach Experiments
and Results Conclusions
47
Generalization to Future Data

Percentage of activated blogs vs. percentage of
selected blogs for BlogRank order (all data -
future data),BlogRank-Future order (history data
- future data), and Greedy-Future order (history
data - future data) - Topic Global Warming.

Introduction Approach Experiments
and Results Conclusions
48
Generalization to Future Data

Percentage of activated blogs vs. percentage of
selected blogs for BlogRank order (all data -
future data),BlogRank-Future order (history data
- future data), and Greedy-Future order (history
data - future data) - Topic iPhone.

Introduction Approach Experiments
and Results Conclusions
49
Priors

Average normalized number of posts vs. percentage
of selected blogs for BlogRank order and BlogRank
with priors order - Topic Global Warming.

Introduction Approach Experiments
and Results Conclusions
50
Priors

Percentage of covered blogs vs percentage of
selected blogs for BlogRank order and BlogRank
with priors order - Topic Global Warming.

Introduction Approach Experiments
and Results Conclusions
51
Similarity vs. Links
52
Outline

Introduction
Approach
Experiments and Results
Conclusions

1
2
3
4
Introduction Approach Experiments
and Results Conclusions
53
Conclusions

This work presents a stochastic graph based
method for recommending blogs to read with
interest in some topic.
The proposed method uses content similairty
instead of hyperlinks to link blogs.
It addresses issues like diversity ranking and
biased ranking using some quality priors.
The method was evaluated using a large real word
datasets.

Introduction Approach Experiments
and Results Conclusions
54
Acknowledgments
This work is based upon work supported
by the National Science Foundation under
Grant No. 0534323 , Collaborative Research
BlogoCenter Infrastructure for Collecting,
Mining and Accessing Blogs. Any opinions,
findings, and conclusions or recommendations
expressed in this paper are those of the
authors and do not necessarily reflect the views
of the National Science Foundation.
55