SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation

Title:

SIMS 290-2: Applied Natural Language Processing

Description:

Is 'jaguar' a good predictor for the 'auto' class? We want to compare: ... null hypothesis: that jaguar and auto are independent. 2 statistic (CHI) 9500. 3 ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 37

Provided by: coursesIs

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing

1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst Oct 23, 2006 (Slides developed by
Preslav Nakov)
2
Today

Feature selection
TF.IDF Term Weighting
Weka Input File Format

3
Features for Text Categorization

Linguistic features
Words
lowercase? (should we convert to?)
normalized? (e.g. texts ? text)
Phrases
Word-level n-grams
Character-level n-grams
Punctuation
Part of Speech
Non-linguistic features
document formatting
informative character sequences (e.g. lt)

4
When Do We NeedFeature Selection?

If the algorithm cannot handle all possible
features
e.g. language identification for 100 languages
using all words
text classification using n-grams
Good features can result in higher accuracy
What if we just keep all features?
Even the unreliable features can be helpful.
But we need to weight them
In the extreme case, the bad features can have a
weight of 0 (or very close), which is a form of
feature selection!

5
Why Feature Selection?

Not all features are equally good!
Bad features best to remove
Infrequent
unlikely to be seen again
co-occurrence with a class can be due to chance
Too frequent
mostly function words
Uniform across all categories
Good features should be kept
Co-occur with a particular category
Do not co-occur with other categories
The rest good to keep

6
Types Of Feature Selection?

Feature selection reduces the number of features
Usually
Eliminating features
Weighting features
Normalizing features
Sometimes by transforming parameters
e.g. Latent Semantic Indexing using Singular
Value Decomposition
Method may depend on problem type
For classification and filtering, may want to
use information from example documents to guide
selection.

7
Feature Selection

Task independent methods
Document Frequency (DF)
Term Strength (TS)
Task-dependent methods
Information Gain (IG)
Mutual Information (MI)
?2 statistic (CHI)
Empirically compared by Yang Pedersen (1997)

8
Pedersen Yang Experiments

Compared feature selection methods for text
categorization
5 feature selection methods
DF, MI, CHI, (IG, TS)
Features were just words, not phrases
2 classifiers
kNN k-Nearest Neighbor
LLSF Linear Least Squares Fit
2 data collections
Reuters-22173
OHSUMED subset of MEDLINE (19901991 used)

9
Document Frequency (DF)

DF number of documents a term appears in
Based on Zipfs Law
Remove the rare terms (seen 1-2 times)
Spurious
Unreliable can be just noise
Unlikely to appear in new documents
Plus
Easy to compute
Task independent do not need to know the
classes
Minus
Ad hoc criterion
For some applications, rare terms can be good
discriminators (e.g., in IR)

10
Stop Word Removal

Common words from a predefined list
Mostly from closed-class categories
unlikely to have a new word added
include auxiliaries, conjunctions, determiners,
prepositions, pronouns, articles
But also some open-class words like numerals
Bad discriminators
uniformly spread across all classes
can be safely removed from the vocabulary
Is this always a good idea? (e.g. author
identification)

11
?2 statistic (CHI)

?2 statistic (pronounced kai square)
A commonly used method of comparing
proportions.
Measures the lack of independence between a term
and a category (Yang Pedersen)

12
?2 statistic (CHI)

Is jaguar a good predictor for the auto
class?
We want to compare
the observed distribution above and
null hypothesis that jaguar and auto are
independent

Term jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
13
?2 statistic (CHI)

Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
If independent Pr(j,a) Pr(j) ? Pr(a)
So, there would be N ? Pr(j,a), i.e. N ? Pr(j)
? Pr(a)
Pr(j) (23)/N
Pr(a) (2500)/N
N235009500
Which N?(5/N)?(502/N)2510/N2510/10005 ? 0.25

Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
14
?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500
Class ? auto 3 9500
expected fe
observed fo
15
?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
16
?2 statistic (CHI)
?2 is interested in (fo fe)2/fe summed over all
table entries The null hypothesis is rejected
with confidence .999, since 12.9 gt 10.83 (the
value for .999 confidence).
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
17
?2 statistic (CHI)
There is a simpler formula for ?2
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
18
?2 statistic (CHI)

How to use ?2 for multiple categories?
Compute ?2 for each category and then combine
To require a feature to discriminate well across
all categories, then we need to take the expected
value of ?2
Or to weight for a single category, take the
maximum

19
?2 statistic (CHI)

Pluses
normalized and thus comparable across terms
?2(t,c) is 0, when t and c are independent
can be compared to ?2 distribution, 1 degree of
freedom
Minuses
unreliable for low frequency terms

20
Information Gain

A measure of importance of the feature for
predicting the presence of the class.
Has an information theoretic justification
Defined as
The number of bits of information gained by
knowing the term is present or absent
Based on Information Theory
We wont go into this in detail here.

21
Information Gain (IG)
IG number of bits of information gained by
knowing the term is present or absent t is
the term being scored, ci is a class variable
entropy H(c)
specific conditional entropy H(ct)
specific conditional entropy H(ct)
22
Mutual Information (MI)

The probability of seeing x and y together
vs
The probably of seeing x anywhere times the
probability of seeing y anywhere (independently).
MI log ( P(x,y) / P(x)P(y) )
log(P(x,y)) log(P(x)P(y))
From Bayes law P(x,y) P(xy)P(y)
log(P(xy)P(y)) log(P(x)P(y))
MI log(P(xy) log(P(x))

23
Mutual Information (MI)
rare terms get higher scores
Approximation
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
does not use term absence
24
Using Mutual Information

Compute MI for each category and then combine
If we want to discriminate well across all
categories, then we need to take the expected
value of MI
To discriminate well for a single category, then
we take the maximum

25
Mutual Information

Pluses
I(t,c) is 0, when t and c are independent
Has a sound information-theoretic interpretation
Minuses
Small numbers produce unreliable results
Does not use term absence

26
CHI max, IG, DF
Term strength
Mutual information
From Yang Pedersen 97
27
Feature Comparison

DF, IG and CHI are good and strongly correlated
thus using DF is good, cheap and task
independent
can be used when IG and CHI are too expensive
MI is bad
favors rare terms (which are typically bad)

28
Term Weighting

In the study just shown, terms were (mainly)
treated as binary features
If a term occurred in a document, it was assigned
1
Else 0
Often it us useful to weight the selected
features
Standard technique tf.idf

29
TF.IDF Term Weighting

TF term frequency
definition TF tij
frequency of term i in document j
purpose makes the frequent words for the
document more important
IDF inverted document frequency
definition IDF log(N/ni)
ni number of documents containing term i
N total number of documents
purpose makes rare words across documents more
important
TF.IDF (for term i in document j)
definition tij ? log(N/ni)

30
Term Normalization

Combine different words into a single
representation
Stemming/morphological analysis
bought, buy, buys -gt buy
General word categories
23.45, 5.30 Yen -gt MONEY
1984, 10,000 -gt DATE, NUM
PERSON
ORGANIZATION
(Covered in Information Extraction segment)
Generalize with lexical hierarchies
WordNet, MeSH
(Covered later in the semester)

31
What Do People Do In Practice?

Feature selection
infrequent term removal
infrequent across the whole collection (i.e. DF)
seen in a single document
most frequent term removal (i.e. stop words)
Normalization
Stemming. (often)
Word classes (sometimes)
Feature weighting TF.IDF or IDF
Dimensionality reduction (sometimes)

32
Weka

Java-based tool for large-scale machine-learning
problems
Tailored towards text analysis
http//weka.sourceforge.net/wekadoc/

33
Weka Input Format

Expects a particular input file format
Called ARFF Attribute-Relation File Format
Consists of a Header and a Data section

http//weka.sourceforge.net/wekadoc/index.php/enA
RFF_(3.4.6)
34
WEKA File Format ARFF
_at_relation heart-disease-simplified _at_attribute
age numeric _at_attribute sex female,
male _at_attribute chest_pain_type typ_angina,
asympt, non_anginal, atyp_angina _at_attribute
cholesterol numeric _at_attribute exercise_induced_an
gina no, yes _at_attribute class present,
not_present _at_data 63,male,typ_angina,233,no,not_
present 67,male,asympt,286,yes,present 67,male,asy
mpt,229,yes,present 38,female,non_anginal,?,no,not
_present ...
Numerical attribute
Nominal attribute