Topics from Text - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Topics from Text

Description:

Given a set of documents, identify the set of topics addressed by them and ... [03] Wai Lam and Kon-Fan Low, Automatic document classification based on ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 36
Provided by: awe2
Category:
Tags: lam | text | topics

less

Transcript and Presenter's Notes

Title: Topics from Text


1
Topics from Text
CS698 Course Project
  • by
  • Amit Awekar Y3111003
  • Niraj Kalamkar Y3111020

2
Contents
  • Problem Definition
  • Motivation
  • Naïve Bayesian Classifier
  • Methods used by others
  • TF-IDF
  • LSI
  • Unigram
  • Mixture of Unigram
  • PLSI
  • Latent Dirichlet Allocation(LDA) and Inferencing
    on LDA
  • Examples
  • Data Set

3
What is it?
  • Problem Definition
  • Given a set of documents, identify the set of
    topics addressed by them and classify the
    documents as per the topics addressed (may be
    multiple)
  • Also analyze other aspects like changes in topics
    over time, relation with a particular author.
    (secondary task)
  • Type of Learning
  • Can be supervised or unsupervised
  • Statistical Language Model (Unigram method)

4
Motivation
  • Identifying Hot Topics
  • How topics have changed with time.
  • Assigning papers to reviewers
  • Can be related to author to determine
  • area of expertise
  • similar author.
  • How unusual a paper for the author is.
  • Can be applied to other domains like videos, etc.

5
How to define topic?
  • Template
  • Define template or samples for different topics
    and match it with given document.
  • Syntax
  • Define syntax or primitives  for different topics
     and check if it given document follows it.
  • Statistical
  • Define different features for a topic and check
    that to what degree given document has those
    features.
  • Basic problem with template matching and
    syntactic techniques is that defining templates
    and syntax itself is difficult for a topic.
    Problem becomes even more difficult with synonyms
    and domain specific details

6
Naïve Baeyes Classifier
  • Highely Practical Method
  • Performance shown to be comparable to other
    methods like ANN Decision Tree learning.

7
General strategy
  • Let lta1,a2,,angt be attributes
  • ltc1,c2,,cmgt be classes
  • Task is to find
  • cnb argmaxcj p(cja1,a2,,an)
  • Using Bayes Rule
  • cnb argmaxcj p(a1,a2,,ancj)p(cj)/p(a1,a2,,an)
  • cnb argmaxcj p(a1,a2,,ancj)p(cj)
  • If attributes are assumed to be independent
  • cnb argmaxcj p(cj) ?i p(aicj)

8
Naïve Bayes For Classifing Text
  • Set of topics T t1,t2,...,tm
  • Made up of set of Documents Dd1,d2,...,dn
  • Made up of set of Words Ww1,w2,...,wk
  • Need to Calculates p(T D)
  • We use the Bayes Rule for this.

9
Naïve Bayes For Classifing Text
  • Training data is of the form (di,ti)
  • The preprocessing involves extracting the tokens.
  • Remove most commonly occurring words.
  • Stem to get the root of each of the words.
  • Parse the training corpus to get the Dictionary
  • Dictionary entries have following form
  • ( wi t1_cnt t2_cnt t3_cnt tm_cnt )
  • wi is the word and ti_cnt is the count of
    occurrence in the tith topic

10
  • Algorithm
  • Training
  • Collect all tokens
  • Parse the training files and stem the words
  • Prepare the Dictionary of these words
  • Vocabularyset of all distinct (non-common) words
    occurring in any text document.
  • Calculate required p(ti) and p(wk ti)
    probability terms
  • For each target value ti in T do
  • Docsj subset of documents from the Training
    Data for with the target value is tj.
  • P(tj)docsj / No. of Training Data samples
  • Textj Single Document created by concatenating
    all members of docj
  • n total no. of distinct words in the textj
  • For each word wk in the vocabulary
  • nk no. of times word wk occurs in Textj
  • P(wk tj ) nk1/ ( n vocabulary )

11
  • Classification
  • return the estimated target value for the
    document Doc
  • Pos all word positions in Doc that contains
    tokens found in vocabulary
  • return t_nb, where
  • t_nb argmax p(tj) ? p(wi tj)
  • tj ? T i ?pos
  • where wi denotes the word found at ith
    position

12
Naïve Baeyes Classifier
  • Advantages.
  • Simple and elegant.
  • Very good results can be obtained.
  • Disadvantages.
  • Not Generative.
  • Training data is needed.
  • No. of Topics should be known in advance.

13
Methods Used by Others
  • TF-IDF
  • LSI
  • PLSI

14
TF-IDF
  • Term Frequency (TF)
  • n(d,t)Number of times term t occurs in document
    d
  • TF(d,t) 0 if n(d,t) 0
  • 1log(1log(n(d,t))) otherwise
  • Can be normalized
  • Inverse Document Frequency (IDF)
  • D Total Number of documents
  • Dt Number of documents containing t
  • IDF(t)log( (1D) / Dt )
  • Coordinate along t is TF(d,t)IDF(t)
  • Define topic as vector in this space say Q and
    document as vector D
  • Find cosine of angle between the two
  • Higher value of cosine indicates more relevance

15
Problems with TF-IDF
  • High dimensionality
  • Main purpose is to answer the short queries

16
LSI
  • Calculate TF-IDF coordinates
  • Compute Singular Value Decomposition (SVD)
  • AkUk?kVk
  • Thus dimensionality is reduced to k
  • Singular Values Square Root of Eigen Values of
    AHA

17
Problems with LSI
  • Storage
  • Need to store k-dimensional representation of
    each document.
  • Efficiency
  • Need to compare every topic with every document.

18
pLSI
  • Hofmann (1999) presented the probabilistic LSI
    (pLSI) model, also known as the aspect model, as
    an alternative to LSI.
  • From a mixture model, where the mixture
    components are multinomial random variables that
    can be viewed as representations of topics.
  • Thus each word is generated from a single
    topic, and different words in a document may be
    generated from different topics.
  • Each document is represented as a list of
    mixing proportions for these mixture components
    and thereby reduced to a probability of topic
    within document.
  • Distribution on a fixed set of topics
  • This distribution is the reduced description
    associated with the document.

19
Problems with pLSI
  • The number of parameters in the model grows
    linearly with the size of the corpus (KVKD)
  • It is not clear how to assign probability to a
    document outside of the training set.

20
Latent Dirichlet Allocation
  • Assumes document as bag of words .
  • This is assumption of exchangeability for words
    in documents.
  • Also assumes that the documents are exchangeable.
    i.e. ordering of the documents in the corpus can
    also be neglected.
  • Thus if we want to capture exchangeability of
    both document and words, we need to consider
    mixture model that captures this exchangeability.
  • This line of thinking leads to the Latent
    Dirichlet Allocation (LDA) model.

21
LDA
  • Notation and terminology
  • Word is basic unit, defined to be an item from
    vocabulary indexed by 1,2,3,,v.
  • ith word in vocabulary is represented as
    v-vecor w such that wi1 and wj0 for i!j
  • Document is sequence of N words denoted by
    w(w1,w2,,wn), where wi is ith word in sequence.
  • Corpus is collection of M docs Dw1,w2,,wm

22
LDA
  • KNumber of topics.
  • There are K underlying latent topics.
  • DNumber of documents.
  • Each document is mixture of K topics.
  • VNumber of words in vocabulary.
  • Each topic is probability distribution over V
    words.
  • Document is generated by sampling topic first and
    then sampling word from selected topic.

23
LDA
  • Assumes following generative process for each doc
    w in D
  • Choose ? Dir (a)
  • For each of N words wn
  • Choose a topic znmultinominial (?)
  • Choose a word wn from p(wnzn,ß), multinomial
    prob distribution conditioned on topic zn, where
    ßijp(wj1zi1)

24
LDA (cntd..)
  • P(?a)
  • (?(? i1,..,k ai)/? i1,..,k ?(ai))
    ?1a1-1?kak-1
  • P(?,z,wa, ß )
  • P(?a)? i1,..,NP(zn?)P(wnzn,ß)
  • P(wa,ß)
  • ? P(?a)(? i1,..,N ?zn P (zn?)P(wnzn,ß))d?
  • P(Da,ß)
  • ?d1..M? P(?a)(? i1,..,N ?zn
    P(zn?)P(wnzn,ß))d?

25
Graphical Representation of LDA
26
Relation of LDA to other latent variable models
27
Griffiths Algorithm
  • TNumber of topics
  • D Number of documents
  • V Number of words in vocabulary
  • T chosen such that log(p(w/T)) is maximized
  • T Dir(a)
  • T dimensional vector for each document
  • F Dir(ß)
  • V dimensional vector for each topic

28
Griffiths Algorithm
  • Does not consider T ,F as explicit parameters to
    be established.
  • Instead consider posterior p(z/w) and then obtain
    estimate of T ,F by examining this posterior
    distribution.
  • p(z/w) is problem of computing probability
    distribution over large discrete space.
  • Assume symmetric dirichlet priors
  • (a, ß are scalars)

29
Griffiths Algorithm
  • Lexical Scan of all docuemnts
  • Collect frequencies and number of words in
    document
  • Repeat following for all words

30
Example
31
After Application of the algorithm
32
Perplexity
33
Data Set
  • NAPS Archieves
  • (www.websciences.org/bibliosleep/naps)
  • Arranged by Authors
  • Keywords
  • Category (25)
  • Total 1115 abstracts

34
What to do next?
  • Actually implement!!

35
What we referred to?
  • Initial References
  • 01 Thomas L. Griffiths and Mark Steyvers,
    Finding Scientific Topics, Computing Control
    Engineering Journal, 11(6)295 303, December
    2000. 
  • 02D. M. Blei, A. Y. Ng and M. I. Jordan, Latent
    Dirichlet allocation, Advances in Neural
    Information Processing Systems (NIPS) 14, 2002
  • 03 Wai Lam and Kon-Fan Low, Automatic document
    classification based on probabilistic reasoning
    model and performance analysis ,Systems, Man, and
    Cybernetics, 1997. 'Computational Cybernetics and
    Simulation'., 1997 IEEE International Conference
    on , Volume 3 , 12-15 Oct. 1997, Pages2719 -
    2723 vol.3
  • 04 Steyvers, M., Rosen-Zvi, M., Griffiths T.,
    Smyth, P. (in progress), The Author-Topic Model
    a Generative Model for Authors and Documents
  • 05Andrew McCallum and Kamal Nigam, A Comparison
    of Event Models for Naive Bayes Text
    Classification, In AAAI/ICML-98 Workshop on
    Learning for Text Categorization, pp. 41-48.
    Technical Report WS-98-05. AAAI Press. 1998.
  • 06Thomas Hofmann, The Cluster-Abstraction
    Model Unsupervised Learning of Topic Hierarchies
    from Text Data, Proceedings of the Sixteenth
    International Joint Conference on Artificial
    Intelligence,1999, isbn number 1-55860-613-0,
    Pages 682-687, publisher Morgan Kaufmann
    Publishers Inc.
  • 07Rosenfeld Ronald, Two decades of statistical
    language modeling where do we go from here?,
    Proceedings of the IEEE , Volume 88 , Issue 8
    , Aug. 2000,Pages1270 - 1278.
Write a Comment
User Comments (0)
About PowerShow.com