Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing) Computer Science 83000 (Under: Topics in Artificial Intelligence) The Graduate School of the City University of New York Fall 2001

About This Presentation

Title:

Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing) Computer Science 83000 (Under: Topics in Artificial Intelligence) The Graduate School of the City University of New York Fall 2001

Description:

The City University of New York. 2. Suppose we have a single search word 'egg. ... Take all occurrences of 'egg' from all documents in our collection (in our ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 9

Provided by: hunte

Learn more at: https://www.hunter.cuny.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing) Computer Science 83000 (Under: Topics in Artificial Intelligence) The Graduate School of the City University of New York Fall 2001

1
Introduction to Computational Natural Language
LearningLinguistics 79400 (Under Topics in
Natural Language Processing)Computer Science
83000 (Under Topics in Artificial
Intelligence)The Graduate School of the City
University of New YorkFall 2001

William Gregory Sakas
Hunter College, Department of Computer Science
Graduate Center, PhD Programs in Computer Science
and Linguistics
The City University of New York

2
Suppose we have a single search word "egg." All
documents on in our corpus (e.g. web pages) are
organized into the following categories
dishwasher, poultry and pregnancy. What is the
likelihood that the keyword is intended to "key
into" each of the 3 categories of documents?
That is, which category would be the best
prediction for a search engine to make? Can
easily be based on word frequencies in a bag of
words approach. Say, for example In documents
classified as the word "egg" appears dishwasher
related 379 times poultry related 1,617
times pregnancy related 824 times
3
Clearly, without any other words, poultry would
be the best prediction. Let's formalize this a
bit We want argmax p( c "egg")
c That is, the
category c, that maximizes the probability of c
given "egg." Take an example. What's
p(poultry"egg")? Take all occurrences of "egg"
from all documents in our collection (in our
example that would be 2,820 ( 379 1,617 834)
and partition them into their categories.
"egg"
occurrences of "egg" in documents in pregnancy
category
1,617
834
379
occurrences of "egg" in documents in dishwasher
category
occurrences of "egg" in documents in poultry
category
4
p(dishwasher egg) 379/2,820 p(poultry
egg) 1,617/2,820 p(pregnancy egg)
824/2,820 In fact, since denominator is the same,
when calculating argmax, we just drop it, and
calculate simply the max number of occurrences of
"egg" in each category. That is, we want argmax
count ( "egg" in category c ).
c Unfortunately, calculating
this is quite expensive. We have to go through
EVERY document in every category. So instead we
apply Baye's Rule p(B A) P (B ) p( A
B) ----------------
p(A) or in our example p( word
category) p(category)p( category word )
------------------------------------
p(word)But
since we are finding the maximum probability, we
can drop the denominator argmax p( c word )
argmax p( word c ) p( c )
c?categories c?categories
5
Easy to extend a single word to multiple words,
and we get thebasic version of the NAIVE BAYES
algorithm (1) argmax p( words c )
p( c ) c ? categories p( c ) is simply
the probability of category c being chosen
independent of any words. For example by the
formula total number of words in all documents
categorized as c --------------------------------
-------------------------------- total
number of words in the entire corpus (BTW, Why
is (1) easier to compute than argmax p( c words
) ?
c?categories Because in order to compute the
second equation we would need to compute 2n
entries, where n number of words, to obtain a
joint probability distribution. See next slide.
The more computery students, see me after class
or email me if interested in further discussion).
6
"egg"
1,617
824
379
p(Poultry "egg") 1,617 / (379 1,617 824)
Dishwashers
Pregnancy
Poultry
7
In any event, the best predictor of a category,
given a bunch of words, is given by (1) above. A
final equation. If words contain the words w1,
w2, . . . , wnthen (1) can be rewritten as
assuming the words are independently likely to
occur! - pretty naive but it works fairly well in
practice
(2) argmax p(w1c) p(w2c) . . .
p(wnc) p( c ) c ? categories
And this is the way the implementation works. A)
A corpus of documents are feed to the learner,
the words are counted up and stored so that the
probabilities in (2) can be effectively
calculated. B)An unseen document is given to
the learner, (2) is calculated where w1, w2, . .
. , wn are the words in the document and the
category that maximizes (2) is returned by the
learner.
8
The vector space model. speech language process
ing Doc 1 6 0 1 Doc 2 0 5 1 Doc
3 1 2 1 Shorter notation Doc 1
lt6,0,1gt Doc 2 lt0,5,1gt Doc 3 lt1,2,1gt
lt0,5,1gt
language
But need to normalize. For example, lt1,2,1gt
should be considered very similar to lt3,6,3gt.
Easy to do.
lt1,2,1gt
speech
lt6,0,1gt
processing

Write a Comment

User Comments (0)