Content Based Recommendation and Summarization in the Blogosphere

1 / 55
About This Presentation
Title:

Content Based Recommendation and Summarization in the Blogosphere

Description:

Content Based Recommendation and Summarization in the Blogosphere. Ahmed Hassan ... Blogs are now one of the main means for spread of ideas and information. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 56
Provided by: Rad94

less

Transcript and Presenter's Notes

Title: Content Based Recommendation and Summarization in the Blogosphere


1
Content Based Recommendation and Summarization in
the Blogosphere
CLAIR
  • Ahmed HassanUniversity of Michigan Ann
    ArborDragomir RadevUniversity of Michigan Ann
    ArborJungho Cho University of California Los
    Angeles
  • Amruta Joshi University of California Los Angele

2
Outline
  • Introduction
  • Approach
  • Experiments and Results
  • Conclusions

1
2
3
4
Introduction Approach Experiments
and Results Conclusions
3
The Blogosphere
  • Blogs are now one of the main means for spread of
    ideas and information.
  • The size of the Blogosphere has been exhibiting
    an exponential increase.
  • How can we decide which blogs are more
    important/influential?

Introduction Approach Experiments
and Results Conclusions
4
The Blogosphere
  • Blogs are now one of the main means for spread of
    ideas and information.
  • The size of the Blogosphere has been exhibiting
    an exponential increase.
  • How can we decide which blogs are more
    important/influential?

Introduction Approach Experiments
and Results Conclusions
5
The Blogosphere
  • Blogs are now one of the main means for spread of
    ideas and information.
  • The size of the Blogosphere has been exhibiting
    an exponential increase.
  • How can we decide which blogs are more
    important/influential?

Introduction Approach Experiments
and Results Conclusions
6
Problem Definition
  • Given a set of blogs related to a particular
    topic
  • Find a subset of blog feeds to read that have
    interest in that topic
  • Similar to the Blog Distillation task in the
    TREC Blog Track.

Introduction Approach Experiments
and Results Conclusions
7
Problem Definition
  • Given a set of blogs related to a particular
    topic
  • Find a subset of blog feeds to read that have
    interest in that topic
  • Similar to the Blog Distillation task in the
    TREC Blog Track.

Introduction Approach Experiments
and Results Conclusions
8
Problem Definition
  • Given a set of blogs related to a particular
    topic
  • Find a subset of blog feeds to read that have
    interest in that topic.
  • Similar to the Blog Distillation task in the
    TREC Blog Track.

Introduction Approach Experiments
and Results Conclusions
9
Ranking Web Pages
  • This problem is similar to the problem of ranking
    web pages.
  • How is that problem solved?
  • Link popularity algorithms
  • The most popular link popularity based algorithms
    are
  • PageRank
  • Hypertext Induced Topic Selection (HITS)

Introduction Approach Experiments
and Results Conclusions
10
Ranking Web Pages
  • This problem is similar to the problem of ranking
    web pages.
  • How is that problem solved?
  • Link popularity based algorithms.
  • The most popular link popularity based algorithms
    are
  • PageRank
  • Hypertext Induced Topic Selection (HITS)

Introduction Approach Experiments
and Results Conclusions
11
Ranking Web Pages
  • This problem is similar to the problem of ranking
    web pages.
  • How is that problem solved?
  • Link popularity algorithms
  • The most popular link popularity based algorithms
    are
  • PageRank
  • Hypertext Induced Topic Selection (HITS)

Introduction Approach Experiments
and Results Conclusions
12
Ranking Web Pages
  • Can we use link popularity based algorithms for
    speeches and blogs?
  • Yes, but they might not work very well.
  • Weakly linked nature of blog pages.
  • Bloggers try to exploit the system to boost the
    rank of their blogs.

Introduction Approach Experiments
and Results Conclusions
13
Ranking Web Pages
  • Can we use hyperlink popularity based algorithms
    for speeches and blogs?
  • Yes, but they might not work very well.
  • Weakly linked nature of blog pages.
  • Bloggers try to exploit the system to boost the
    rank of their blogs.

Introduction Approach Experiments
and Results Conclusions
14
From Hyperlinks to Similarity
  • Use textual similarity to link posts instead of
    hyperlinks.
  • Textual similarity between posts suggests some
    kind of interaction
  • one of them affecting the other.
  • Textual similarity is a way of conferring
    authority.

Introduction Approach Experiments
and Results Conclusions
15
Outline
  • Introduction
  • Approach
  • Experiments and Results
  • Conclusions

1
2
3
4
Introduction Approach Experiments
and Results Conclusions
16
Problem Statement
  • Given a set of blogs
  • Build a graph where
  • Nodes represent posts/feeds.
  • Edges link posts/feeds with similar text.
  • Edge weight reflects how similar the text is.
  • Find a subset of nodes that are more important
    than others.

Introduction Approach Experiments
and Results Conclusions
17
Problem Statement
  • Given a set of blogs
  • Build a graph where
  • Nodes represent posts/feeds.
  • Edges link posts/feeds with similar text.
  • Edge weight reflects how similar the text is.
  • Find a subset of nodes that are more important
    than others

Introduction Approach Experiments
and Results Conclusions
18
Text Salience Scores
  • We define the importance score of a blog
    recursively in terms of the scores of its
    neighbors as followswhere deg(q) is the
    degree of node q, and adjq is the set of all
    nodes adjacent to q in the network.
  • This can be rewritten in matrix notation
    aswhere S (S(p1), S(p2), . . . , S(pN))
    and the matrix B is the row normalized similarity
    matrix of the graph.

Introduction Approach Experiments
and Results Conclusions
19
Text Salience Scores
  • We define the importance score of a blog
    recursively in terms of the scores of its
    neighbors as followswhere deg(q) is the
    degree of node q, and adjq is the set of all
    nodes adjacent to q in the network.
  • This can be rewritten in matrix notation
    aswhere S (S(p1), S(p2), . . . , S(pN))
    and the matrix B is the row normalized similarity
    matrix of the graph.

Introduction Approach Experiments
and Results Conclusions
20
Measuring Textual Similarity
  • Posts similarity is estimated by the cosine
    similarity between a tf-idf vector representation
    of the posts.
  • tf The term frequency in some document.
  • idf The inverse document frequency.
  • tf-idf tf idf
  • Cosine similarity the cosine of the angle
    between the tf-idf vectors.
  • Other possible similarity measures
  • Edit distance.
  • Language models.

Introduction Approach Experiments
and Results Conclusions
21
Measuring Textual Similarity
  • Posts similarity is estimated by the cosine
    similarity between a tf-idf vector representation
    of the posts.
  • tf The term frequency in some document.
  • idf The inverse document frequency.
  • tf-idf tf idf
  • Cosine similarity the cosine of the angle
    between the tf-idf vectors.
  • Other possible similarity measures
  • Edit distance.
  • Language models.

Introduction Approach Experiments
and Results Conclusions
22
Diversity Ranking
  • The second important node (B) is quite similar to
    the first.
  • Node C is also important but more diverse w.r.t A
    than B.
  • How can we select the second node such that it is
    important and as diverse as possible from the
    first selected node?

Introduction Approach Experiments
and Results Conclusions
23
Diversity Ranking
  • The second important node (B) is quite similar to
    the first.
  • Node C is also important but more diverse w.r.t A
    than B.
  • How can we select the second node such that it is
    important and as diverse as possible from the
    first selected node?

Introduction Approach Experiments
and Results Conclusions
24
Diversity Ranking
  • Possible solution Remove selected nodes or nodes
    similar to them from the network.
  • May disconnect the network
  • Better solution Discount nodes that are too
    similar to previously selected nodes.where
    d(p) is a discounting factor that is inversely
    proportional to the similarity between p and the
    selected nodes

Introduction Approach Experiments
and Results Conclusions
25
Diversity Ranking
  • Discount nodes that are too similar to previously
    selected nodes.
  • where d(p) is a discounting factor that is
    inversely proportional to the similarity between
    p and the selected nodes

Introduction Approach Experiments
and Results Conclusions
26
Adding Priors
  • Relation to other nodes might not be the only
    parameter affecting its importance.
  • Other attributes that are related to the node
    itself may be involved.
  • How can we incorporate this into the
    approachwhere Q(p) is some node quality
    measure, and ß is a trad-off factor.

Introduction Approach Experiments
and Results Conclusions
27
Adding Priors
  • Relation to other nodes might not be the only
    parameter affecting its importance.
  • Other attributes that are related to the node
    itself may be involved.
  • How can we incorporate this into the
    approachwhere Q(p) is some node quality
    measure, and ß is a trad-off factor.

Introduction Approach Experiments
and Results Conclusions
28
Outline
  • Introduction
  • Approach
  • Experiments and Results
  • Conclusions

1
2
3
4
Introduction Approach Experiments
and Results Conclusions
29
Experimental Setup
  • Data The TREC BLOG06 dataset (Macdonald and
    Ounis 2006)
  • 100,649 feeds and 3,215,171 posts covering an 11
    weeks period
  • a list of 17,969 spam blogs
  • Objective Rank blogs according to their coverage
    of other blogs.

Introduction Approach Experiments
and Results Conclusions
30
Case Study 1 Blogs
  • Data The TREC BLOG06 dataset (Macdonald and
    Ounis 2006)
  • 100,649 feeds and 3,215,171 posts covering an 11
    weeks period
  • a list of 17,969 spam blogs
  • Objective Rank blogs according to their coverage
    of other blogs.

Introduction Approach Experiments
and Results Conclusions
31
Evaluation
  • Given a rank of blogs, how can we measure how
    good it is?
  • Using diffusion models.
  • Diffusion models are originally used in social
    networks to model the spread of influence in a
    network.
  • The Linear Threshold Model is one of the most
    popular diffusion models.

Introduction Approach Experiments
and Results Conclusions
32
Evaluation
  • Given a rank of blogs, how can we measure how
    good it is?
  • Using diffusion models.
  • Diffusion models are originally used in social
    networks to model the spread of influence in a
    network.
  • The Linear Threshold Model is one of the most
    popular diffusion models.

Introduction Approach Experiments
and Results Conclusions
33
Linear Threshold Model
  • Each node has a threshold ? in 0,1 selected
    uniformly at random.
  • The diffusion process starts with a set of active
    nodes.
  • An inactive node v becomes active if

Introduction Approach Experiments
and Results Conclusions
34
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
35
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
36
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
37
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
38
Linear Threshold Model
0.6
0.7
0.5
0.2
0.2
0.3
0.1
0.5.
0.4
0.4
0.3
0.5
0.2
0.1
0.3
0.5
Example from (Kempe et al, 2003)
Introduction Approach Experiments
and Results Conclusions
39
Evaluation
  • The quality of a rank of blogs can be measured
    by
  • How many blogs are covered by the first blog in
    the rank?
  • How many blogs are covered by the first 2 blogs
    in the rank?
  • And so on ..
  • The coverage of a set of blogs is simply the
    number of other blogs it activates using the
    linear threshold model
  • Initially mark the nodes in the set as active.
  • Apply the linear threshold model.
  • Count the number of activated nodes.
  • Repeat for M times and take the average.
  • To compare two blog ranks
  • Compare the coverage of the first 1,2,N blogs in
    each rank.

40
Evaluation
  • The quality of a rank of blogs can be measured
    by
  • How many blogs are covered by the first blog in
    the rank?
  • How many blogs are covered by the first 2 blogs
    in the rank?
  • And so on ..
  • The coverage of a set of blogs is simply the
    number of other blogs it activates using the
    linear threshold model
  • Initially mark the nodes in the set as active.
  • Apply the linear threshold model.
  • Count the number of activated nodes.
  • Repeat for M times and take the average.
  • To compare two blog ranks
  • Compare the coverage of the first 1,2,N blogs in
    each rank.

41
Evaluation
  • The quality of a rank of blogs can be measured
    by
  • How many blogs are covered by the first blog in
    the rank?
  • How many blogs are covered by the first 2 blogs
    in the rank?
  • And so on ..
  • The coverage of a set of blogs is simply the
    number of other blogs it activates using the
    linear threshold model
  • Initially mark the nodes in the set as active.
  • Apply the linear threshold model.
  • Count the number of activated nodes.
  • Repeat for M times and take the average.
  • To compare two blog ranks
  • Compare the coverage of the first 1,2,N blogs in
    each rank.

42
Baselines
  • Random
  • Select blogs uniformly at random.
  • Heuristic
  • Find the most popular blogs by using a fixed
    goodness criteria (e.g. number of posts).
  • Greedy
  • Start with an empty set M0 .
  • Add the node that maximizes the marginal gain in
    subset quality.
  • Quality of a subset is measured by the number of
    nodes it covers in the blogs network.
  • Very computationally costly O(V4).

Introduction Approach Experiments
and Results Conclusions
43
Baselines
  • Random
  • Select blogs uniformly at random.
  • Heuristic
  • Find the most popular blogs by using a fixed
    goodness criteria (e.g. number of posts).
  • Greedy
  • Start with an empty set M0 .
  • Add the node that maximizes the marginal gain in
    subset quality.
  • Quality of a subset is measured by the number of
    nodes it covers in the blogs network.
  • Very computationally costly O(V4).

Introduction Approach Experiments
and Results Conclusions
44
Baselines
  • Random
  • Select blogs uniformly at random.
  • Heuristic
  • Find the most popular blogs by using a fixed
    goodness criteria (e.g. number of posts).
  • Greedy
  • Start with an empty set M0 .
  • Add the node that maximizes the marginal gain in
    subset quality.
  • Quality of a subset is measured by the number of
    nodes it covers in the blogs network.
  • Very computationally costly O(V4).

Introduction Approach Experiments
and Results Conclusions
45
Results
  • Percentage of covered blogs vs percentage of
    selected blogs for BlogRank order, greedy
    algorithm order, number of posts order, and
    random order -Topic Global Warming.

Introduction Approach Experiments
and Results Conclusions
46
Results
  • Percentage of covered blogs vs. percentage of
    selected blogs for BlogRank order, greedy
    algorithm order, number of posts order, and
    random order - Topic iPhone.

Introduction Approach Experiments
and Results Conclusions
47
Generalization to Future Data
  • Percentage of activated blogs vs. percentage of
    selected blogs for BlogRank order (all data -
    future data),BlogRank-Future order (history data
    - future data), and Greedy-Future order (history
    data - future data) - Topic Global Warming.

Introduction Approach Experiments
and Results Conclusions
48
Generalization to Future Data
  • Percentage of activated blogs vs. percentage of
    selected blogs for BlogRank order (all data -
    future data),BlogRank-Future order (history data
    - future data), and Greedy-Future order (history
    data - future data) - Topic iPhone.

Introduction Approach Experiments
and Results Conclusions
49
Priors
  • Average normalized number of posts vs. percentage
    of selected blogs for BlogRank order and BlogRank
    with priors order - Topic Global Warming.

Introduction Approach Experiments
and Results Conclusions
50
Priors
  • Percentage of covered blogs vs percentage of
    selected blogs for BlogRank order and BlogRank
    with priors order - Topic Global Warming.

Introduction Approach Experiments
and Results Conclusions
51
Similarity vs. Links
52
Outline
  • Introduction
  • Approach
  • Experiments and Results
  • Conclusions

1
2
3
4
Introduction Approach Experiments
and Results Conclusions
53
Conclusions
  • This work presents a stochastic graph based
    method for recommending blogs to read with
    interest in some topic.
  • The proposed method uses content similairty
    instead of hyperlinks to link blogs.
  • It addresses issues like diversity ranking and
    biased ranking using some quality priors.
  • The method was evaluated using a large real word
    datasets.

Introduction Approach Experiments
and Results Conclusions
54
Acknowledgments
This work is based upon work supported
by the National Science Foundation under
Grant No. 0534323 , Collaborative Research
BlogoCenter Infrastructure for Collecting,
Mining and Accessing Blogs. Any opinions,
findings, and conclusions or recommendations
expressed in this paper are those of the
authors and do not necessarily reflect the views
of the National Science Foundation.
55
  • Thanks

Ahmed Hassan hassanam_at_umich.edu
Write a Comment
User Comments (0)