Graphbased Text Classification: Learn from Your Neighbors - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Graphbased Text Classification: Learn from Your Neighbors

Description:

... many settings, this 'context-free' approach does not exploit ... The third dataset used in the experiments was the online encyclopedia Wikipedia. Experiments ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 24
Provided by: KOI6
Category:

less

Transcript and Presenter's Notes

Title: Graphbased Text Classification: Learn from Your Neighbors


1
Graph-based Text Classification Learn from Your
Neighbors
Ralitsa Angelova , Gerhard Weikum Max Planck
Institute for Informatics Stuhlsatzenhausweg 85
66123, Saarbrücken, Germany
Present by Chia-Hao Lee
2
outline
  • Introduction
  • Graph-based Classification
  • Incorporating Metric Label Distances
  • Experimental
  • Conclusion

3
Introduction
  • Automatic classification is a supervised learning
    technique for assigning thematic categories to
    data items such as customer records,
    gene-expression data records, Web pages, or text
    documents.
  • The standard approach is to represent each data
    item by a feature vector and learn parameters of
    mathematical decision models.
  • Context-free the decision is based only on the
    feature vector of a given data item, disregarding
    the other data items in the test set.

4
Introduction
  • In many settings, this context-free approach
    does not exploit the available information about
    relationships between data items.
  • Using the relationship information, we can
    construct a graph G in which each data item is a
    node and each relationship instance forms an edge
    between the corresponding nodes.
  • In the following we will mostly focus on text
    documents with links to and from other documents.

5
Introduction
  • A straightforward approach to capturing a
    documents neighbors would be to incorporate the
    features and feature weights of the neighbors
    into the feature vector of the given document
    itself.
  • A more advanced approach is to model the mutual
    influence between neighboring documents, aiming
    to estimate the class labels of all test
    documents simultaneously.

6
Introduction
  • A simple example for RL (Relaxation labeling) is
    shown in figure 1.
  • Let our set of class be .
  • We wish to assign to every document marked ?
    its most probable label.
  • Let the contingency matrix in figure 1b) be
    estimated from the training data.

7
Introduction
  • The theory paper by Kleinberg and Tardos views
    the classification problem for nodes in an
    undirected graph as a metric labeling problem
    where we aim to optimize a combinatorial function
    consisting of assignment costs and separation
    costs.

8
Graph-Based Classification
  • Our approach is based on the probabilistic
    formulation of the classification problem and
    uses a relaxation labeling technique to derive
    two major approaches for finding the maximally
    likely labeling ? of the given test graph hard
    and soft labeling.
  • D a set of documents
  • G a graph whose vertices correspond to
    documents
  • and edges represent the link structure
    of D.
  • the label of node u.
  • the feature vector that locally
    captures the content of
  • document d.

9
Graph-Based Classification
  • Taking into account the underlying link structure
    and document ds context-based feature vector,
    the probability of a label to be
    assigned to d is
  • In the spirit of the introductions discuss on
    emphasizing the influence of the immediate
    neighbors for each document, ,we obtain
  • and denote it by .
  • The independent of the labels of other nodes in
    the graph given the labels of its immediate
    neighbors. We abbreviate
    into .

10
Graph-Based Classification
  • We abbreviate ,the
    graph-unaware probability based only on ds local
    content, by .
  • The additional independence assumption that there
    is no direct of its coupling between the content
    of a document and the labels of its neighbors,
    the following central equation holds for the
    total probability , summing up the
    posterior probabilities for all possible
    labelings of the neighborhood

11
Graph-Based Classification
  • In the same vein, if we further assume
    independence among all neighbor labels of the
    same node, we reach the following formulation for
    our neighborhood-conscious classification
    problem
  • This can be computed in an iterative manner as
    follow

12
Graph-Based Classification
  • Hard labeling
  • In contrast to the presented soft
    labeling approach, we also consider a method that
    take into account only the most probable label
    assignments in the test document neighborhood to
    be significant for the
  • computation.
  • Let be the maximum probable label

13
Graph-Based Classification
  • Soft Labeling
  • The soft labeling approach aims to achieve
    better accuracy of the classification by avoiding
    the overly eager rounding that the hard
    labeling approach does.

14
Incorporating Metric Label Distance
  • Intuitively, neighboring documents should receive
    similar class labels.
  • For example, suppose we have a set of classes

  • and we wish to find the most probable label
    for a test document d.
  • A document discussing scientific problems (S)
    would be much farther away from both C and E.
  • So, a similarity metric imposed on the
    set of labels C would have high values for the
    pair (C,E) and small values for class pairs (C,S)
    and (E,S).

15
Incorporating Metric Label Distance
  • This is why introducing a metric should
    help improve the classification result. In this
    metric, similar classes are separated by a
    shorter distance and impose smaller separation
    cost on an edge labeling.
  • Our approach, on the other hand, is general, and
    we construct the metric G automatically from the
    training data.
  • We incorporate the label metric into the
    iterations for computing the probability of an
    edge labeling by treating as
    a scaling factor.

16
Incorporating Metric Label Distance
  • This way, we magnify the impact of edges between
    nodes with similar labels and scale down the
    impact of edges between dissimilar ones

17
Experiments
  • We have tested our graph-based classifier on
    three different data sets.
  • The first one includes approximately
    16000scientific publications chosen from the DBLP
    database.
  • The second dataset has been selected from the
    internet movie database IMDB.
  • The third dataset used in the experiments was the
    online encyclopedia Wikipedia.

18
Experiments
19
Experiments
20
Experiments
21
Experiments
22
Experiments
23
Conclusion
  • The presented GC method for graph-based
    classification is a way of exploiting context
    relationships of data items.
  • Incorporating metric distances among different
    labels contributed to the very good performance
    of GC method.
  • This is one new form of exploiting knowledge
    about the relationships among category labels and
    thus the structure of the classifiers target
    space.
Write a Comment
User Comments (0)
About PowerShow.com