Title: Content Based Recommendation and Summarization in the Blogosphere
1Content Based Recommendation and Summarization in
the Blogosphere
CLAIR
- Ahmed HassanUniversity of Michigan Ann
ArborDragomir RadevUniversity of Michigan Ann
ArborJungho Cho University of California Los
Angeles - Amruta Joshi University of California Los Angele
2Outline
- Introduction
- Approach
- Experiments and Results
- Conclusions
1
2
3
4
Introduction Approach Experiments
and Results Conclusions
3The Blogosphere
- Blogs are now one of the main means for spread of
ideas and information. - The size of the Blogosphere has been exhibiting
an exponential increase. - How can we decide which blogs are more
important/influential?
Introduction Approach Experiments
and Results Conclusions
4The Blogosphere
- Blogs are now one of the main means for spread of
ideas and information. - The size of the Blogosphere has been exhibiting
an exponential increase. - How can we decide which blogs are more
important/influential?
Introduction Approach Experiments
and Results Conclusions
5The Blogosphere
- Blogs are now one of the main means for spread of
ideas and information. - The size of the Blogosphere has been exhibiting
an exponential increase. - How can we decide which blogs are more
important/influential?
Introduction Approach Experiments
and Results Conclusions
6Problem Definition
- Given a set of blogs related to a particular
topic - Find a subset of blog feeds to read that have
interest in that topic - Similar to the Blog Distillation task in the
TREC Blog Track.
Introduction Approach Experiments
and Results Conclusions
7Problem Definition
- Given a set of blogs related to a particular
topic - Find a subset of blog feeds to read that have
interest in that topic - Similar to the Blog Distillation task in the
TREC Blog Track.
Introduction Approach Experiments
and Results Conclusions
8Problem Definition
- Given a set of blogs related to a particular
topic - Find a subset of blog feeds to read that have
interest in that topic. - Similar to the Blog Distillation task in the
TREC Blog Track.
Introduction Approach Experiments
and Results Conclusions
9Ranking Web Pages
- This problem is similar to the problem of ranking
web pages. - How is that problem solved?
- Link popularity algorithms
- The most popular link popularity based algorithms
are - PageRank
- Hypertext Induced Topic Selection (HITS)
Introduction Approach Experiments
and Results Conclusions
10Ranking Web Pages
- This problem is similar to the problem of ranking
web pages. - How is that problem solved?
- Link popularity based algorithms.
- The most popular link popularity based algorithms
are - PageRank
- Hypertext Induced Topic Selection (HITS)
Introduction Approach Experiments
and Results Conclusions
11Ranking Web Pages
- This problem is similar to the problem of ranking
web pages. - How is that problem solved?
- Link popularity algorithms
- The most popular link popularity based algorithms
are - PageRank
- Hypertext Induced Topic Selection (HITS)
Introduction Approach Experiments
and Results Conclusions
12Ranking Web Pages
- Can we use link popularity based algorithms for
speeches and blogs? - Yes, but they might not work very well.
- Weakly linked nature of blog pages.
- Bloggers try to exploit the system to boost the
rank of their blogs.
Introduction Approach Experiments
and Results Conclusions
13Ranking Web Pages
- Can we use hyperlink popularity based algorithms
for speeches and blogs? - Yes, but they might not work very well.
- Weakly linked nature of blog pages.
- Bloggers try to exploit the system to boost the
rank of their blogs.
Introduction Approach Experiments
and Results Conclusions
14From Hyperlinks to Similarity
- Use textual similarity to link posts instead of
hyperlinks. - Textual similarity between posts suggests some
kind of interaction - one of them affecting the other.
- Textual similarity is a way of conferring
authority.
Introduction Approach Experiments
and Results Conclusions
15Outline
- Introduction
- Approach
- Experiments and Results
- Conclusions
1
2
3
4
Introduction Approach Experiments
and Results Conclusions
16Problem Statement
- Given a set of blogs
- Build a graph where
- Nodes represent posts/feeds.
- Edges link posts/feeds with similar text.
- Edge weight reflects how similar the text is.
- Find a subset of nodes that are more important
than others.
Introduction Approach Experiments
and Results Conclusions
17Problem Statement
- Given a set of blogs
- Build a graph where
- Nodes represent posts/feeds.
- Edges link posts/feeds with similar text.
- Edge weight reflects how similar the text is.
- Find a subset of nodes that are more important
than others
Introduction Approach Experiments
and Results Conclusions
18Text Salience Scores
- We define the importance score of a blog
recursively in terms of the scores of its
neighbors as followswhere deg(q) is the
degree of node q, and adjq is the set of all
nodes adjacent to q in the network. - This can be rewritten in matrix notation
aswhere S (S(p1), S(p2), . . . , S(pN))
and the matrix B is the row normalized similarity
matrix of the graph.
Introduction Approach Experiments
and Results Conclusions
19Text Salience Scores
- We define the importance score of a blog
recursively in terms of the scores of its
neighbors as followswhere deg(q) is the
degree of node q, and adjq is the set of all
nodes adjacent to q in the network. - This can be rewritten in matrix notation
aswhere S (S(p1), S(p2), . . . , S(pN))
and the matrix B is the row normalized similarity
matrix of the graph.
Introduction Approach Experiments
and Results Conclusions
20Measuring Textual Similarity
- Posts similarity is estimated by the cosine
similarity between a tf-idf vector representation
of the posts. - tf The term frequency in some document.
- idf The inverse document frequency.
- tf-idf tf idf
- Cosine similarity the cosine of the angle
between the tf-idf vectors. - Other possible similarity measures
- Edit distance.
- Language models.
Introduction Approach Experiments
and Results Conclusions
21Measuring Textual Similarity
- Posts similarity is estimated by the cosine
similarity between a tf-idf vector representation
of the posts. - tf The term frequency in some document.
- idf The inverse document frequency.
- tf-idf tf idf
- Cosine similarity the cosine of the angle
between the tf-idf vectors. - Other possible similarity measures
- Edit distance.
- Language models.
Introduction Approach Experiments
and Results Conclusions
22Diversity Ranking
- The second important node (B) is quite similar to
the first. - Node C is also important but more diverse w.r.t A
than B. - How can we select the second node such that it is
important and as diverse as possible from the
first selected node?
Introduction Approach Experiments
and Results Conclusions
23Diversity Ranking
- The second important node (B) is quite similar to
the first. - Node C is also important but more diverse w.r.t A
than B. - How can we select the second node such that it is
important and as diverse as possible from the
first selected node?
Introduction Approach Experiments
and Results Conclusions
24Diversity Ranking
- Possible solution Remove selected nodes or nodes
similar to them from the network. - May disconnect the network
- Better solution Discount nodes that are too
similar to previously selected nodes.where
d(p) is a discounting factor that is inversely
proportional to the similarity between p and the
selected nodes
Introduction Approach Experiments
and Results Conclusions
25Diversity Ranking
- Discount nodes that are too similar to previously
selected nodes. - where d(p) is a discounting factor that is
inversely proportional to the similarity between
p and the selected nodes
Introduction Approach Experiments
and Results Conclusions
26Adding Priors
- Relation to other nodes might not be the only
parameter affecting its importance. - Other attributes that are related to the node
itself may be involved. - How can we incorporate this into the
approachwhere Q(p) is some node quality
measure, and ß is a trad-off factor.
Introduction Approach Experiments
and Results Conclusions
27Adding Priors
- Relation to other nodes might not be the only
parameter affecting its importance. - Other attributes that are related to the node
itself may be involved. - How can we incorporate this into the
approachwhere Q(p) is some node quality
measure, and ß is a trad-off factor.
Introduction Approach Experiments
and Results Conclusions
28Outline
- Introduction
- Approach
- Experiments and Results
- Conclusions
1
2
3
4
Introduction Approach Experiments
and Results Conclusions
29Experimental Setup
- Data The TREC BLOG06 dataset (Macdonald and
Ounis 2006) - 100,649 feeds and 3,215,171 posts covering an 11
weeks period - a list of 17,969 spam blogs
- Objective Rank blogs according to their coverage
of other blogs.
Introduction Approach Experiments
and Results Conclusions
30Case Study 1 Blogs
- Data The TREC BLOG06 dataset (Macdonald and
Ounis 2006) - 100,649 feeds and 3,215,171 posts covering an 11
weeks period - a list of 17,969 spam blogs
- Objective Rank blogs according to their coverage
of other blogs.
Introduction Approach Experiments
and Results Conclusions
31Evaluation
- Given a rank of blogs, how can we measure how
good it is? - Using diffusion models.
- Diffusion models are originally used in social
networks to model the spread of influence in a
network. - The Linear Threshold Model is one of the most
popular diffusion models.
Introduction Approach Experiments
and Results Conclusions
32Evaluation
- Given a rank of blogs, how can we measure how
good it is? - Using diffusion models.
- Diffusion models are originally used in social
networks to model the spread of influence in a
network. - The Linear Threshold Model is one of the most
popular diffusion models.
Introduction Approach Experiments
and Results Conclusions
33Linear Threshold Model
- Each node has a threshold ? in 0,1 selected
uniformly at random. - The diffusion process starts with a set of active
nodes. - An inactive node v becomes active if
Introduction Approach Experiments
and Results Conclusions
34Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
35Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
36Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
37Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
38Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5.
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
39Evaluation
- The quality of a rank of blogs can be measured
by - How many blogs are covered by the first blog in
the rank? - How many blogs are covered by the first 2 blogs
in the rank? - And so on ..
- The coverage of a set of blogs is simply the
number of other blogs it activates using the
linear threshold model - Initially mark the nodes in the set as active.
- Apply the linear threshold model.
- Count the number of activated nodes.
- Repeat for M times and take the average.
- To compare two blog ranks
- Compare the coverage of the first 1,2,N blogs in
each rank.
40Evaluation
- The quality of a rank of blogs can be measured
by - How many blogs are covered by the first blog in
the rank? - How many blogs are covered by the first 2 blogs
in the rank? - And so on ..
- The coverage of a set of blogs is simply the
number of other blogs it activates using the
linear threshold model - Initially mark the nodes in the set as active.
- Apply the linear threshold model.
- Count the number of activated nodes.
- Repeat for M times and take the average.
- To compare two blog ranks
- Compare the coverage of the first 1,2,N blogs in
each rank.
41Evaluation
- The quality of a rank of blogs can be measured
by - How many blogs are covered by the first blog in
the rank? - How many blogs are covered by the first 2 blogs
in the rank? - And so on ..
- The coverage of a set of blogs is simply the
number of other blogs it activates using the
linear threshold model - Initially mark the nodes in the set as active.
- Apply the linear threshold model.
- Count the number of activated nodes.
- Repeat for M times and take the average.
- To compare two blog ranks
- Compare the coverage of the first 1,2,N blogs in
each rank.
42Baselines
- Random
- Select blogs uniformly at random.
- Heuristic
- Find the most popular blogs by using a fixed
goodness criteria (e.g. number of posts). - Greedy
- Start with an empty set M0 .
- Add the node that maximizes the marginal gain in
subset quality. - Quality of a subset is measured by the number of
nodes it covers in the blogs network. - Very computationally costly O(V4).
Introduction Approach Experiments
and Results Conclusions
43Baselines
- Random
- Select blogs uniformly at random.
- Heuristic
- Find the most popular blogs by using a fixed
goodness criteria (e.g. number of posts). - Greedy
- Start with an empty set M0 .
- Add the node that maximizes the marginal gain in
subset quality. - Quality of a subset is measured by the number of
nodes it covers in the blogs network. - Very computationally costly O(V4).
Introduction Approach Experiments
and Results Conclusions
44Baselines
- Random
- Select blogs uniformly at random.
- Heuristic
- Find the most popular blogs by using a fixed
goodness criteria (e.g. number of posts). - Greedy
- Start with an empty set M0 .
- Add the node that maximizes the marginal gain in
subset quality. - Quality of a subset is measured by the number of
nodes it covers in the blogs network. - Very computationally costly O(V4).
Introduction Approach Experiments
and Results Conclusions
45Results
- Percentage of covered blogs vs percentage of
selected blogs for BlogRank order, greedy
algorithm order, number of posts order, and
random order -Topic Global Warming.
Introduction Approach Experiments
and Results Conclusions
46Results
- Percentage of covered blogs vs. percentage of
selected blogs for BlogRank order, greedy
algorithm order, number of posts order, and
random order - Topic iPhone.
Introduction Approach Experiments
and Results Conclusions
47Generalization to Future Data
- Percentage of activated blogs vs. percentage of
selected blogs for BlogRank order (all data -
future data),BlogRank-Future order (history data
- future data), and Greedy-Future order (history
data - future data) - Topic Global Warming.
Introduction Approach Experiments
and Results Conclusions
48Generalization to Future Data
- Percentage of activated blogs vs. percentage of
selected blogs for BlogRank order (all data -
future data),BlogRank-Future order (history data
- future data), and Greedy-Future order (history
data - future data) - Topic iPhone.
Introduction Approach Experiments
and Results Conclusions
49Priors
- Average normalized number of posts vs. percentage
of selected blogs for BlogRank order and BlogRank
with priors order - Topic Global Warming.
Introduction Approach Experiments
and Results Conclusions
50Priors
- Percentage of covered blogs vs percentage of
selected blogs for BlogRank order and BlogRank
with priors order - Topic Global Warming.
Introduction Approach Experiments
and Results Conclusions
51Similarity vs. Links
52Outline
- Introduction
- Approach
- Experiments and Results
- Conclusions
1
2
3
4
Introduction Approach Experiments
and Results Conclusions
53Conclusions
- This work presents a stochastic graph based
method for recommending blogs to read with
interest in some topic. - The proposed method uses content similairty
instead of hyperlinks to link blogs. - It addresses issues like diversity ranking and
biased ranking using some quality priors. - The method was evaluated using a large real word
datasets.
Introduction Approach Experiments
and Results Conclusions
54Acknowledgments
This work is based upon work supported
by the National Science Foundation under
Grant No. 0534323 , Collaborative Research
BlogoCenter Infrastructure for Collecting,
Mining and Accessing Blogs. Any opinions,
findings, and conclusions or recommendations
expressed in this paper are those of the
authors and do not necessarily reflect the views
of the National Science Foundation.
55Ahmed Hassan hassanam_at_umich.edu