Topics from Text - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Topics from Text

Description:

Given a set of documents, identify the set of topics addressed by them and ... [03] Wai Lam and Kon-Fan Low, Automatic document classification based on ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 36

Provided by: awe2

Category:

more less

Transcript and Presenter's Notes

Title: Topics from Text

1
Topics from Text
CS698 Course Project

by
Amit Awekar Y3111003
Niraj Kalamkar Y3111020

2
Contents

Problem Definition
Motivation
Naïve Bayesian Classifier
Methods used by others
TF-IDF
LSI
Unigram
Mixture of Unigram
PLSI
Latent Dirichlet Allocation(LDA) and Inferencing
on LDA
Examples
Data Set

3
What is it?

Problem Definition
Given a set of documents, identify the set of
topics addressed by them and classify the
documents as per the topics addressed (may be
multiple)
Also analyze other aspects like changes in topics
over time, relation with a particular author.
(secondary task)
Type of Learning
Can be supervised or unsupervised
Statistical Language Model (Unigram method)

4
Motivation

Identifying Hot Topics
How topics have changed with time.
Assigning papers to reviewers
Can be related to author to determine
area of expertise
similar author.
How unusual a paper for the author is.
Can be applied to other domains like videos, etc.

5
How to define topic?

Template
Define template or samples for different topics
and match it with given document.
Syntax
Define syntax or primitives for different topics
and check if it given document follows it.
Statistical
Define different features for a topic and check
that to what degree given document has those
features.
Basic problem with template matching and
syntactic techniques is that defining templates
and syntax itself is difficult for a topic.
Problem becomes even more difficult with synonyms
and domain specific details

6
Naïve Baeyes Classifier

Highely Practical Method
Performance shown to be comparable to other
methods like ANN Decision Tree learning.

7
General strategy

Let lta1,a2,,angt be attributes
ltc1,c2,,cmgt be classes
Task is to find
cnb argmaxcj p(cja1,a2,,an)
Using Bayes Rule
cnb argmaxcj p(a1,a2,,ancj)p(cj)/p(a1,a2,,an)
cnb argmaxcj p(a1,a2,,ancj)p(cj)
If attributes are assumed to be independent
cnb argmaxcj p(cj) ?i p(aicj)

8
Naïve Bayes For Classifing Text

Set of topics T t1,t2,...,tm
Made up of set of Documents Dd1,d2,...,dn
Made up of set of Words Ww1,w2,...,wk
Need to Calculates p(T D)
We use the Bayes Rule for this.

9
Naïve Bayes For Classifing Text

Training data is of the form (di,ti)
The preprocessing involves extracting the tokens.
Remove most commonly occurring words.
Stem to get the root of each of the words.
Parse the training corpus to get the Dictionary
Dictionary entries have following form
( wi t1_cnt t2_cnt t3_cnt tm_cnt )
wi is the word and ti_cnt is the count of
occurrence in the tith topic

Algorithm
Training
Collect all tokens
Parse the training files and stem the words
Prepare the Dictionary of these words
Vocabularyset of all distinct (non-common) words
occurring in any text document.
Calculate required p(ti) and p(wk ti)
probability terms
For each target value ti in T do
Docsj subset of documents from the Training
Data for with the target value is tj.
P(tj)docsj / No. of Training Data samples
Textj Single Document created by concatenating
all members of docj
n total no. of distinct words in the textj
For each word wk in the vocabulary
nk no. of times word wk occurs in Textj
P(wk tj ) nk1/ ( n vocabulary )

Classification
return the estimated target value for the
document Doc
Pos all word positions in Doc that contains
tokens found in vocabulary
return t_nb, where
t_nb argmax p(tj) ? p(wi tj)
tj ? T i ?pos
where wi denotes the word found at ith
position

12
Naïve Baeyes Classifier

Advantages.
Simple and elegant.
Very good results can be obtained.
Disadvantages.
Not Generative.
Training data is needed.
No. of Topics should be known in advance.

13
Methods Used by Others

TF-IDF
LSI
PLSI

14
TF-IDF

Term Frequency (TF)
n(d,t)Number of times term t occurs in document
d
TF(d,t) 0 if n(d,t) 0
1log(1log(n(d,t))) otherwise
Can be normalized
Inverse Document Frequency (IDF)
D Total Number of documents
Dt Number of documents containing t
IDF(t)log( (1D) / Dt )
Coordinate along t is TF(d,t)IDF(t)
Define topic as vector in this space say Q and
document as vector D
Find cosine of angle between the two
Higher value of cosine indicates more relevance

15
Problems with TF-IDF

High dimensionality
Main purpose is to answer the short queries

16
LSI

Calculate TF-IDF coordinates
Compute Singular Value Decomposition (SVD)
AkUk?kVk
Thus dimensionality is reduced to k
Singular Values Square Root of Eigen Values of
AHA

17
Problems with LSI

Storage
Need to store k-dimensional representation of
each document.
Efficiency
Need to compare every topic with every document.

18
pLSI

Hofmann (1999) presented the probabilistic LSI
(pLSI) model, also known as the aspect model, as
an alternative to LSI.
From a mixture model, where the mixture
components are multinomial random variables that
can be viewed as representations of topics.
Thus each word is generated from a single
topic, and different words in a document may be
generated from different topics.
Each document is represented as a list of
mixing proportions for these mixture components
and thereby reduced to a probability of topic
within document.
Distribution on a fixed set of topics
This distribution is the reduced description
associated with the document.

19
Problems with pLSI

The number of parameters in the model grows
linearly with the size of the corpus (KVKD)
It is not clear how to assign probability to a
document outside of the training set.

20
Latent Dirichlet Allocation

Assumes document as bag of words .
This is assumption of exchangeability for words
in documents.
Also assumes that the documents are exchangeable.
i.e. ordering of the documents in the corpus can
also be neglected.
Thus if we want to capture exchangeability of
both document and words, we need to consider
mixture model that captures this exchangeability.
This line of thinking leads to the Latent
Dirichlet Allocation (LDA) model.

21
LDA

Notation and terminology
Word is basic unit, defined to be an item from
vocabulary indexed by 1,2,3,,v.
ith word in vocabulary is represented as
v-vecor w such that wi1 and wj0 for i!j
Document is sequence of N words denoted by
w(w1,w2,,wn), where wi is ith word in sequence.
Corpus is collection of M docs Dw1,w2,,wm

22
LDA

KNumber of topics.
There are K underlying latent topics.
DNumber of documents.
Each document is mixture of K topics.
VNumber of words in vocabulary.
Each topic is probability distribution over V
words.
Document is generated by sampling topic first and
then sampling word from selected topic.

23
LDA

Assumes following generative process for each doc
w in D
Choose ? Dir (a)
For each of N words wn
Choose a topic znmultinominial (?)
Choose a word wn from p(wnzn,ß), multinomial
prob distribution conditioned on topic zn, where
ßijp(wj1zi1)

24
LDA (cntd..)

P(?a)
(?(? i1,..,k ai)/? i1,..,k ?(ai))
?1a1-1?kak-1
P(?,z,wa, ß )
P(?a)? i1,..,NP(zn?)P(wnzn,ß)
P(wa,ß)
? P(?a)(? i1,..,N ?zn P (zn?)P(wnzn,ß))d?
P(Da,ß)
?d1..M? P(?a)(? i1,..,N ?zn
P(zn?)P(wnzn,ß))d?

25
Graphical Representation of LDA
26
Relation of LDA to other latent variable models
27
Griffiths Algorithm

TNumber of topics
D Number of documents
V Number of words in vocabulary
T chosen such that log(p(w/T)) is maximized
T Dir(a)
T dimensional vector for each document
F Dir(ß)
V dimensional vector for each topic

28
Griffiths Algorithm

Does not consider T ,F as explicit parameters to
be established.
Instead consider posterior p(z/w) and then obtain
estimate of T ,F by examining this posterior
distribution.
p(z/w) is problem of computing probability
distribution over large discrete space.
Assume symmetric dirichlet priors
(a, ß are scalars)

29
Griffiths Algorithm

Lexical Scan of all docuemnts
Collect frequencies and number of words in
document
Repeat following for all words

30
Example
31
After Application of the algorithm
32
Perplexity
33
Data Set

NAPS Archieves
(www.websciences.org/bibliosleep/naps)
Arranged by Authors
Keywords
Category (25)
Total 1115 abstracts

34
What to do next?

Actually implement!!

35
What we referred to?

Initial References
01 Thomas L. Griffiths and Mark Steyvers,
Finding Scientific Topics, Computing Control
Engineering Journal, 11(6)295 303, December
2000.
02D. M. Blei, A. Y. Ng and M. I. Jordan, Latent
Dirichlet allocation, Advances in Neural
Information Processing Systems (NIPS) 14, 2002
03 Wai Lam and Kon-Fan Low, Automatic document
classification based on probabilistic reasoning
model and performance analysis ,Systems, Man, and
Cybernetics, 1997. 'Computational Cybernetics and
Simulation'., 1997 IEEE International Conference
on , Volume 3 , 12-15 Oct. 1997, Pages2719 -
2723 vol.3
04 Steyvers, M., Rosen-Zvi, M., Griffiths T.,
Smyth, P. (in progress), The Author-Topic Model
a Generative Model for Authors and Documents
05Andrew McCallum and Kamal Nigam, A Comparison
of Event Models for Naive Bayes Text
Classification, In AAAI/ICML-98 Workshop on
Learning for Text Categorization, pp. 41-48.
Technical Report WS-98-05. AAAI Press. 1998.
06Thomas Hofmann, The Cluster-Abstraction
Model Unsupervised Learning of Topic Hierarchies
from Text Data, Proceedings of the Sixteenth
International Joint Conference on Artificial
Intelligence,1999, isbn number 1-55860-613-0,
Pages 682-687, publisher Morgan Kaufmann
Publishers Inc.
07Rosenfeld Ronald, Two decades of statistical
language modeling where do we go from here?,
Proceedings of the IEEE , Volume 88 , Issue 8
, Aug. 2000,Pages1270 - 1278.