Title: Graphbased Text Classification: Learn from Your Neighbors
1Graph-based Text Classification Learn from Your
Neighbors
Ralitsa Angelova , Gerhard Weikum Max Planck
Institute for Informatics Stuhlsatzenhausweg 85
66123, Saarbrücken, Germany
Present by Chia-Hao Lee
2outline
- Introduction
- Graph-based Classification
- Incorporating Metric Label Distances
- Experimental
- Conclusion
3Introduction
- Automatic classification is a supervised learning
technique for assigning thematic categories to
data items such as customer records,
gene-expression data records, Web pages, or text
documents. - The standard approach is to represent each data
item by a feature vector and learn parameters of
mathematical decision models. - Context-free the decision is based only on the
feature vector of a given data item, disregarding
the other data items in the test set.
4Introduction
- In many settings, this context-free approach
does not exploit the available information about
relationships between data items. - Using the relationship information, we can
construct a graph G in which each data item is a
node and each relationship instance forms an edge
between the corresponding nodes. - In the following we will mostly focus on text
documents with links to and from other documents.
5Introduction
- A straightforward approach to capturing a
documents neighbors would be to incorporate the
features and feature weights of the neighbors
into the feature vector of the given document
itself. - A more advanced approach is to model the mutual
influence between neighboring documents, aiming
to estimate the class labels of all test
documents simultaneously.
6Introduction
- A simple example for RL (Relaxation labeling) is
shown in figure 1. - Let our set of class be .
- We wish to assign to every document marked ?
its most probable label. - Let the contingency matrix in figure 1b) be
estimated from the training data.
7Introduction
- The theory paper by Kleinberg and Tardos views
the classification problem for nodes in an
undirected graph as a metric labeling problem
where we aim to optimize a combinatorial function
consisting of assignment costs and separation
costs.
8Graph-Based Classification
- Our approach is based on the probabilistic
formulation of the classification problem and
uses a relaxation labeling technique to derive
two major approaches for finding the maximally
likely labeling ? of the given test graph hard
and soft labeling. - D a set of documents
- G a graph whose vertices correspond to
documents - and edges represent the link structure
of D. - the label of node u.
- the feature vector that locally
captures the content of - document d.
9Graph-Based Classification
- Taking into account the underlying link structure
and document ds context-based feature vector,
the probability of a label to be
assigned to d is - In the spirit of the introductions discuss on
emphasizing the influence of the immediate
neighbors for each document, ,we obtain - and denote it by .
- The independent of the labels of other nodes in
the graph given the labels of its immediate
neighbors. We abbreviate
into .
10Graph-Based Classification
- We abbreviate ,the
graph-unaware probability based only on ds local
content, by . - The additional independence assumption that there
is no direct of its coupling between the content
of a document and the labels of its neighbors,
the following central equation holds for the
total probability , summing up the
posterior probabilities for all possible
labelings of the neighborhood -
11Graph-Based Classification
- In the same vein, if we further assume
independence among all neighbor labels of the
same node, we reach the following formulation for
our neighborhood-conscious classification
problem - This can be computed in an iterative manner as
follow -
12Graph-Based Classification
- Hard labeling
- In contrast to the presented soft
labeling approach, we also consider a method that
take into account only the most probable label
assignments in the test document neighborhood to
be significant for the - computation.
- Let be the maximum probable label
13Graph-Based Classification
- Soft Labeling
- The soft labeling approach aims to achieve
better accuracy of the classification by avoiding
the overly eager rounding that the hard
labeling approach does.
14Incorporating Metric Label Distance
- Intuitively, neighboring documents should receive
similar class labels. - For example, suppose we have a set of classes
-
and we wish to find the most probable label
for a test document d. - A document discussing scientific problems (S)
would be much farther away from both C and E. - So, a similarity metric imposed on the
set of labels C would have high values for the
pair (C,E) and small values for class pairs (C,S)
and (E,S).
15Incorporating Metric Label Distance
- This is why introducing a metric should
help improve the classification result. In this
metric, similar classes are separated by a
shorter distance and impose smaller separation
cost on an edge labeling. - Our approach, on the other hand, is general, and
we construct the metric G automatically from the
training data. - We incorporate the label metric into the
iterations for computing the probability of an
edge labeling by treating as
a scaling factor.
16Incorporating Metric Label Distance
- This way, we magnify the impact of edges between
nodes with similar labels and scale down the
impact of edges between dissimilar ones
17Experiments
- We have tested our graph-based classifier on
three different data sets. - The first one includes approximately
16000scientific publications chosen from the DBLP
database. - The second dataset has been selected from the
internet movie database IMDB. - The third dataset used in the experiments was the
online encyclopedia Wikipedia.
18Experiments
19Experiments
20Experiments
21Experiments
22Experiments
23Conclusion
- The presented GC method for graph-based
classification is a way of exploiting context
relationships of data items. - Incorporating metric distances among different
labels contributed to the very good performance
of GC method. - This is one new form of exploiting knowledge
about the relationships among category labels and
thus the structure of the classifiers target
space.