Title: A Neural Network Approach to Topic Spotting
1A Neural Network Approach to Topic Spotting
- Presented by Loulwah AlSumait
- INFS 795 Spec. Topics in Data Mining
- 4.14.2005
2Article Information
- Published in
- Proceedings of SDAIR-95, 4th Annual Symposium on
Document Analysis and Information Retrieval 1995 - Authors
- Wiener E.,
- Pedersen, J.O.
- Weigend, A.S.
- 54 citations
3Summary
- Introduction
- Related Work
- The Corpus
- Representation
- Term Selection
- Latent Semantic Indexing
- Generic LSI
- Local LSI
- Cluster-Directed LSI
- Topic-Directed LSI
- Relevancy Weighting LSI
4Summary
- Neural Network Classifier
- Neural Networks for Topic Spotting
- Linear vs. Non Linear Networks
- Flat Architecture vs. Modular Architecture
- Experiment Results
- Evaluating Performance
- Results discussions
5Introduction
- Topic Spotting Text Categorization Text
Classification - Problem of identifying which of a set of
predefined topics are present in a natural
language document.
Document
Topic 1
Topic 2
Topic n
6Introduction
- Classification Approaches
- Expert system approach
- manually construct a system of inference rules on
top of large body of linguistic and domain
knowledge - could be extremely accurate
- very time consuming
- brittle to changes in the data environment
- Data driven approach
- induce a set of rules from a corpus of labeled
training documents - practically better
7Introduction Related Work
- The major remarks regarding the related work
- Separate classifier was constructed for each
topic. - Different set of terms was used to train each
classifier.
8Introduction The Corpus
- Reuters 22173 corpus of Reuters newswire stories
from 1987 - 21,450 stories
- 9,610 for training
- 3,662 for testing
- mean length 90.6 words, SD 91.6
- 92 topics appeared at least once in the training
set. - The mean is 1.24 topics/doc. (up to 14 topics for
some doc.) - 11,161 unique terms after preprocessing
- inflectional stemming,
- stop word removal,
- conversion to lower case
- elimination of words appeared in fewer three
documents
9Representations
- starting point
- Document Profile term by document matrix
containing word frequency entries
10Representation
Thorsten Joachims. 1997. Text Categorization with
Support Vector Machines Learning with Many
Relevant Features. http//citeseer.ist.psu.edu/jo
achims97text.html
11Representation - Term Selection
- the subset of the original terms that are most
useful for the classification task. - Difficult to select terms that discriminate
between 92 classes while being small enough to
serve as the feature set for a neural network - Divide problem into 92 independent classification
tasks - Search for best discriminator terms between
documents with the topic and those without
12Representation - Term Selection
No. of doc. w/ topic t contain term k
- Relevancy Score
- measures how unbalanced the term is across
documents w/ or w/o the topic - Highly ve and highly -ve scores indicate useful
terms for discrimination - using about 20 terms yielded the best
classification performance
Total No. of doc. w/ topic t
13Representation - Term Selection
14Representation - Term Selection
- Advantage
- little computation is required
- resulting features have direct interpretability
- Drawback
- many of best individual predictors contain
redundant information - a term which may appear to be a very poor
predictor on its own may turn out to have great
discriminative power in combination with other
terms, and vise verse. - Apple vs. Apple Computers
- Selected Term Representation (TERMS) with 20
features
TERMS
15Representation LSI
- Transform original doc to lower-dimensional space
by analyzing correlational structure of terms in
the document collection - (Training Set) applying a singular-value
decomposition (SVD) to the original term by
document matrix ? Get U, ?, V - (test set) Transform document vectors by
projecting them into LSI space - Property of LSI higher dimensions capture less
of variance of original data ? drop w/ minimal
loss. - Found performance continues to improve up to at
least 250 dimensions - Improvement rapidly slows dawn after about 100
dimensions - Generic LSI Representation (LSI) with 200 features
LSI
16Representation LSI
Reuters Corpus
Generic LSI Representation w/ 200 features
Wool
Wheat
Money-supply
SVD
Gold
Barley
Zinc
17Representation Local LSI
- Global LSI performs worse as frequency decreases
- infrequent topics are usually indicated by
infrequent terms and infrequent terms may be
projected out of LSI and considered as mere
noise. - Proposed two task-directed methods that make use
of prior knowledge of the classification task
18Representation Local LSI
- What is Local LSI?
- modeling only the local portion of the corpus
related to those topics - includes documents that use terminology related
to the topics (not necessary have any of the
topics assigned) - Performing SVD over only the local set of
documents - representation more sensitive to small, localized
effects of infrequent terms. - representation more effective for classification
of topics related to that local structure.
19Representation Local LSI
- Type of Local LSI
- Cluster Directed representation
- 5 Meta-topics (clusters)
- Agriculture, Energy, Foreign exchange,
Government, and metals - How to construct local region?
- Break corpus into 5 clusters ? each containing
all documents on corresponding meta-topic - Perform SVD for each Meta-topic region
- Clustor-Directed LSI Representation (CD/LSI) with
200 features
CD/LSI
20Representation Local LSI
Reuters Corpus
Wool
Wheat
Money-supply
SVD
Gold
Barley
Zinc
21Representation Local LSI
Reuters Corpus
Government
Clustor-Directed LSI Representation (CD/LSI) w/
200 features
SVD
G O V E R N M E N T
A G R I C U L T U R E
F o r e i g n E x c h a n g e
M E T A L
E N E R G Y
Agriculture
Wool
Wheat
Barley
SVD
Foreign Exchange
Money-supply
SVD
Metal
Zinc
Gold
SVD
Energy
SVD
22Representation Local LSI
- Types of Local LSI
- Term Directed representation
- More fine-grained approach to local LSI
- Separate representation for each topic.
- How to construct the local region?
- Use 100 most predictive terms for the topic.
- Pick N most similar documents.N 5 No. of
documents containing topic, 350 ? N ? 110 - Final Documents in topic region N documents
150 random documents - Topic-Directed LSI Representation (TD/LSI) with
200 features
23Representation Local LSI
Reuters Corpus
Wool
Wheat
Money-supply
SVD
Gold
Barley
Zinc
24Representation Local LSI
Reuters Corpus
Term-Directed LSI Representation (TD/LSI) w/ 200
features
Wool
SVD
Wheat
SVD
Money-supply
SVD
Zinc
SVD
Barley
SVD
Gold
SVD
25Representation Local LSI
- Drawback of Local LSI
- Narrower the region, the Lower flexibility in
representations for modeling the classification
of multiple topics - High computational overhead
26Representation - Relevancy Weighting LSI
- Use term weight to emphasize the importance of
particular terms before applying SVD - IDF weighting
- ? importance of low frequency terms
- ? the importance of high frequency terms
- Assumes low frequency terms to be better
discriminators than high frequency terms
27Representation - Relevancy Weighting LSI
- Relevancy Weighting
- tune the IDF assumption
- emphasize terms in proportion to their estimated
topic discrimination power - Global Relevancy Weighting of term k (GRWk)
- Final Weighting of term k IDF2 GRWk
- ? all low frequency terms pulled up by IDF
- ? Poor predictors pushed down
- ? leaving only relevant low frequency terms with
high weights - Relevancy Weighted LSI Representation (REL/LSI)
with 200 features
28Neural Network Classifier (NN)
- NN consists of
- processing units (Neurons)
- weighted links connecting neurons
29Neural Network Classifier (NN)
- major components of NN model
- architecture defines the functional form
relating input to output - network topology
- unit connectivity
- activation functions e.g. Logistic regression fn.
30Neural Network Classifier (NN)
- Logistic regression function
- z
- is a linear combination of the input features
- p ? (0,1) - can be converted to binary
classification method by thresholding the output
probability
31Neural Network Classifier (NN)
- major components of NN model (cont)
- search algorithm the search in weight space for
a set of weights which minimizes the error
between the output and the expected output
(TRAINING PROCESS) - Backpropagation method
- Mean squared errors
- Cross-entropy error performancefunction
- C - sum all cases and outputs
- (dlog(y) (1-d)log(1-y) )
- d desired output, y actual output
32NN for Topic Spotting
- Network outputs are estimates of the probability
of topic presence given the feature vector of a
document - Generic LSI representation
- each network uses same representation
- Local LSI representation
- different representation for each network
33NN for Topic Spotting
- Linear NN
- Output units with logistic activation and no
hidden layer
34NN for Topic Spotting
- Non Linear NN
- Simple networks with a single hidden layer of
logistic sigmoid units (6 15)
35NN for Topic Spotting
- Flat Architecture
- Separate network for each topic
- use entire training set to train for each topic
- Avoiding overfittingproblem by
- adding penalty term to the cross-entropy cost
function to encourage eliminationof small
weights. - Early stopping based on cross-validation
36NN for Topic Spotting
- Modular Architecture
- decompose learning problem into smaller problems
- Meta-Topic Network trained on full training set
- estimate the presence probability of the five
topics in doc. - use 15 hidden units
37NN for Topic Spotting
- Modular Architecture
- five groups of local topic networks
- consists of local topic networks for each topic
in meta-topic - each network trained only on the meta-topic region
38NN for Topic Spotting
- Modular Architecture
- five groups of local topic networks (cont.)
- Example wheat network trained Agriculture
meta-topic. - Focus on finer distinctions, e.g. wheat and grain
- Dont waste time on easier distinctions, e.g.
wheat and gold. - Each local topic networks uses 6 hidden units.
39NN for Topic Spotting
- Modular Architecture
- To compute topic predictions for a given document
- Present document to meta-topic network
- Present document to each of the topic networks
- Outputs of meta-topic network ? estimate of topic
networks final topic estimates
40Experimental Results
- Evaluating Performance
- Mean squared error between actual and predicted
values is inefficient - Compute precision and recall based on contingency
table constructed over range of decision
thresholds - How to get the decision Thresholds?
41Experimental Results
- Evaluating Performance
- How to get the decision Thresholds?
- Proportional assignment
Topic wool Topic ? wool
Predicted Topic wool iff Output probability ? ? output probability of kpth highest rank doc. K integer, P prior probability of wool topic
Predicted Topic ? wool, iff output probability lt ?
42Experimental Results
- Evaluating Performance
- How to get the decision Thresholds?
- fixed recall level approach
- determine set of recall levels
- analyze ranked documents to determine what
decision thresholds lead to the desired set of
recall levels.
Topic wool Topic ? wool
Predicted Topic wool iff Output probability ? ? output probability of doc. where of doc. with higher output probability Leads to desired recall level
Predicted Topic ? wool, iff output probability lt ?
Target Recall
43Experimental Results
- Performance by Micoraveraging
- add all contingency tables together across topics
at a certain threshold - compute precision and recall
- used proportional assignment for picking decision
thresholds - does not weight the topics evenly
- used for comparisons to previously reported
results - Breakeven point is used as a summary value
44Experimental Results
- Performance by Macoraveraging
- compute precision and recall for each topic
- take the average across topics
- used fixed set of recall levels
- summary values are obtained for particular topics
by averaging precision over the 19 evenly spaced
recall levels between 0.05 and 0.95
45Experimental Results
- Microaveraged performance
- Breakpoints compared to best algorithm
- rule induction method best on heuristic search
with breakpoint (0.789)
0.82
0.801
0.795
0.775
46Experimental Results
- Macroaveraged performance
- TERMS appears much closer to other three.
- Relative effectivenessof the representationsat
low recall levels isreversed at high
recalllevels
47- Six techniques performance on54 most frequent
topics - considerable variation of performance across
topics - relative ups and downs are mirrored in both plots
Slight improvement of nonlinear networks
LSI performance degrades compared to TERMS when
ft decreases
48Experimental Results
- Performance of Combination of Techniques and Its
Improvement
NN architecture Document Representation Flat Flat Modular (Meta-Topic NW trained using LSI representation) Modular (Meta-Topic NW trained using LSI representation)
NN architecture Document Representation Linear Non Linear Linear Non Linear
TERMS ? ? ? ? ? ? ? ? ?
LSI ? ? ? ? ? ? ? ?
CD-LSI ? ? ? ? ? ? ? ? ? ?
TD-LSI ? ? ? ?
REL-LSI ? ?
Hybrid (CD-LSI TERMS) ? ? ?
Match color shape to get an experiment
49Experimental Results
50Experimental Results
- Modular Networks
- 4 clusters only used
- Recomputed average precision for the flat
networks
51(No Transcript)
52LSI representation is able to equal or exceed
TERMS performance for high frequency topic, but
performs poorly for low frequency
53Task-Directed LSI representations improve
performance in the low frequency domain TD/LSI
Trade-off ? Cost REL/LSI Trade-off ? lower
performance on m/h topics
54Modular CD/LSI improves performance further for
low frequency, because individual networks are
trained only in the domain that LSI was performed
55TERMS proves to be competitive to more
sophisticated LSI technique ? most topics are
predictable by small set of terms
56Discussion
- Rich solution many representations and many
models - Total Supervised approach
- Results are lower than what expected
- Is the dataset responsible?
- High computational overhead
- Does NN deserve a place in DM tool boxes?
- Questions?