SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

SIMS 290-2: Applied Natural Language Processing

Description:

Is 'jaguar' a good predictor for the 'auto' class? We want to compare: ... null hypothesis: that jaguar and auto are independent. 2 statistic (CHI) 9500. 3 ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 37
Provided by: coursesIs
Category:

less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing


1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst Oct 23, 2006 (Slides developed by
Preslav Nakov)    
2
Today
  • Feature selection
  • TF.IDF Term Weighting
  • Weka Input File Format

3
Features for Text Categorization
  • Linguistic features
  • Words
  • lowercase? (should we convert to?)
  • normalized? (e.g. texts ? text)
  • Phrases
  • Word-level n-grams
  • Character-level n-grams
  • Punctuation
  • Part of Speech
  • Non-linguistic features
  • document formatting
  • informative character sequences (e.g. lt)

4
When Do We NeedFeature Selection?
  • If the algorithm cannot handle all possible
    features
  • e.g. language identification for 100 languages
    using all words
  • text classification using n-grams
  • Good features can result in higher accuracy
  • What if we just keep all features?
  • Even the unreliable features can be helpful.
  • But we need to weight them
  • In the extreme case, the bad features can have a
    weight of 0 (or very close), which is a form of
    feature selection!

5
Why Feature Selection?
  • Not all features are equally good!
  • Bad features best to remove
  • Infrequent
  • unlikely to be seen again
  • co-occurrence with a class can be due to chance
  • Too frequent
  • mostly function words
  • Uniform across all categories
  • Good features should be kept
  • Co-occur with a particular category
  • Do not co-occur with other categories
  • The rest good to keep

6
Types Of Feature Selection?
  • Feature selection reduces the number of features
  • Usually
  • Eliminating features
  • Weighting features
  • Normalizing features
  • Sometimes by transforming parameters
  • e.g. Latent Semantic Indexing using Singular
    Value Decomposition
  • Method may depend on problem type
  • For classification and filtering, may want to
    use information from example documents to guide
    selection.

7
Feature Selection
  • Task independent methods
  • Document Frequency (DF)
  • Term Strength (TS)
  • Task-dependent methods
  • Information Gain (IG)
  • Mutual Information (MI)
  • ?2 statistic (CHI)
  • Empirically compared by Yang Pedersen (1997)

8
Pedersen Yang Experiments
  • Compared feature selection methods for text
    categorization
  • 5 feature selection methods
  • DF, MI, CHI, (IG, TS)
  • Features were just words, not phrases
  • 2 classifiers
  • kNN k-Nearest Neighbor
  • LLSF Linear Least Squares Fit
  • 2 data collections
  • Reuters-22173
  • OHSUMED subset of MEDLINE (19901991 used)

9
Document Frequency (DF)
  • DF number of documents a term appears in
  • Based on Zipfs Law
  • Remove the rare terms (seen 1-2 times)
  • Spurious
  • Unreliable can be just noise
  • Unlikely to appear in new documents
  • Plus
  • Easy to compute
  • Task independent do not need to know the
    classes
  • Minus
  • Ad hoc criterion
  • For some applications, rare terms can be good
    discriminators (e.g., in IR)

10
Stop Word Removal
  • Common words from a predefined list
  • Mostly from closed-class categories
  • unlikely to have a new word added
  • include auxiliaries, conjunctions, determiners,
    prepositions, pronouns, articles
  • But also some open-class words like numerals
  • Bad discriminators
  • uniformly spread across all classes
  • can be safely removed from the vocabulary
  • Is this always a good idea? (e.g. author
    identification)

11
?2 statistic (CHI)
  • ?2 statistic (pronounced kai square)
  • A commonly used method of comparing
    proportions.
  • Measures the lack of independence between a term
    and a category (Yang Pedersen)

12
?2 statistic (CHI)
  • Is jaguar a good predictor for the auto
    class?
  • We want to compare
  • the observed distribution above and
  • null hypothesis that jaguar and auto are
    independent

Term jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
13
?2 statistic (CHI)
  • Under the null hypothesis (jaguar and auto
    independent) How many co-occurrences of jaguar
    and auto do we expect?
  • If independent Pr(j,a) Pr(j) ? Pr(a)
  • So, there would be N ? Pr(j,a), i.e. N ? Pr(j)
    ? Pr(a)
  • Pr(j) (23)/N
  • Pr(a) (2500)/N
  • N235009500
  • Which N?(5/N)?(502/N)2510/N2510/10005 ? 0.25

Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
14
?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500
Class ? auto 3 9500
expected fe
observed fo
15
?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
16
?2 statistic (CHI)
?2 is interested in (fo fe)2/fe summed over all
table entries The null hypothesis is rejected
with confidence .999, since 12.9 gt 10.83 (the
value for .999 confidence).
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
17
?2 statistic (CHI)
There is a simpler formula for ?2
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
18
?2 statistic (CHI)
  • How to use ?2 for multiple categories?
  • Compute ?2 for each category and then combine
  • To require a feature to discriminate well across
    all categories, then we need to take the expected
    value of ?2
  • Or to weight for a single category, take the
    maximum

19
?2 statistic (CHI)
  • Pluses
  • normalized and thus comparable across terms
  • ?2(t,c) is 0, when t and c are independent
  • can be compared to ?2 distribution, 1 degree of
    freedom
  • Minuses
  • unreliable for low frequency terms

20
Information Gain
  • A measure of importance of the feature for
    predicting the presence of the class.
  • Has an information theoretic justification
  • Defined as
  • The number of bits of information gained by
    knowing the term is present or absent
  • Based on Information Theory
  • We wont go into this in detail here.

21
Information Gain (IG)
IG number of bits of information gained by
knowing the term is present or absent t is
the term being scored, ci is a class variable
entropy H(c)
specific conditional entropy H(ct)
specific conditional entropy H(ct)
22
Mutual Information (MI)
  • The probability of seeing x and y together
  • vs
  • The probably of seeing x anywhere times the
    probability of seeing y anywhere (independently).
  • MI log ( P(x,y) / P(x)P(y) )
  • log(P(x,y)) log(P(x)P(y))
  • From Bayes law P(x,y) P(xy)P(y)
  • log(P(xy)P(y)) log(P(x)P(y))
  • MI log(P(xy) log(P(x))

23
Mutual Information (MI)
rare terms get higher scores
Approximation
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
does not use term absence
24
Using Mutual Information
  • Compute MI for each category and then combine
  • If we want to discriminate well across all
    categories, then we need to take the expected
    value of MI
  • To discriminate well for a single category, then
    we take the maximum

25
Mutual Information
  • Pluses
  • I(t,c) is 0, when t and c are independent
  • Has a sound information-theoretic interpretation
  • Minuses
  • Small numbers produce unreliable results
  • Does not use term absence

26
CHI max, IG, DF
Term strength
Mutual information
From Yang Pedersen 97
27
Feature Comparison
  • DF, IG and CHI are good and strongly correlated
  • thus using DF is good, cheap and task
    independent
  • can be used when IG and CHI are too expensive
  • MI is bad
  • favors rare terms (which are typically bad)

28
Term Weighting
  • In the study just shown, terms were (mainly)
    treated as binary features
  • If a term occurred in a document, it was assigned
    1
  • Else 0
  • Often it us useful to weight the selected
    features
  • Standard technique tf.idf

29
TF.IDF Term Weighting
  • TF term frequency
  • definition TF tij
  • frequency of term i in document j
  • purpose makes the frequent words for the
    document more important
  • IDF inverted document frequency
  • definition IDF log(N/ni)
  • ni number of documents containing term i
  • N total number of documents
  • purpose makes rare words across documents more
    important
  • TF.IDF (for term i in document j)
  • definition tij ? log(N/ni)

30
Term Normalization
  • Combine different words into a single
    representation
  • Stemming/morphological analysis
  • bought, buy, buys -gt buy
  • General word categories
  • 23.45, 5.30 Yen -gt MONEY
  • 1984, 10,000 -gt DATE, NUM
  • PERSON
  • ORGANIZATION
  • (Covered in Information Extraction segment)
  • Generalize with lexical hierarchies
  • WordNet, MeSH
  • (Covered later in the semester)

31
What Do People Do In Practice?
  • Feature selection
  • infrequent term removal
  • infrequent across the whole collection (i.e. DF)
  • seen in a single document
  • most frequent term removal (i.e. stop words)
  • Normalization
  • Stemming. (often)
  • Word classes (sometimes)
  • Feature weighting TF.IDF or IDF
  • Dimensionality reduction (sometimes)

32
Weka
  • Java-based tool for large-scale machine-learning
    problems
  • Tailored towards text analysis
  • http//weka.sourceforge.net/wekadoc/

33
Weka Input Format
  • Expects a particular input file format
  • Called ARFF Attribute-Relation File Format
  • Consists of a Header and a Data section

http//weka.sourceforge.net/wekadoc/index.php/enA
RFF_(3.4.6)
34
WEKA File Format ARFF
_at_relation heart-disease-simplified _at_attribute
age numeric _at_attribute sex female,
male _at_attribute chest_pain_type typ_angina,
asympt, non_anginal, atyp_angina _at_attribute
cholesterol numeric _at_attribute exercise_induced_an
gina no, yes _at_attribute class present,
not_present _at_data 63,male,typ_angina,233,no,not_
present 67,male,asympt,286,yes,present 67,male,asy
mpt,229,yes,present 38,female,non_anginal,?,no,not
_present ...
Numerical attribute
Nominal attribute
  • Other attribute types
  • String
  • Date

Missing value
http//weka.sourceforge.net/wekadoc/index.php/enA
RFF_(3.4.6)
35
WEKA Sparse File Format
  • Value 0 is not represented explicitly
  • Same header (i.e _at_relation and _at_attribute tags)
  • the _at_data section is different
  • Instead of
  • _at_data
  • 0, X, 0, Y, "class A"
  • 0, 0, W, 0, "class B"
  • We have
  • _at_data
  • 1 X, 3 Y, 4 "class A"
  • 2 W, 4 "class B"
  • This saves LOTS of space for text applications.
  • Why?

36
Next Time
  • Wed Guest lecture by Peter Jackson
  • Pure and Applied Research in NLP The Good, the
    Bad, and the Lucky.
  • Following week
  • Text Categorization Algorithms
  • How to use Weka
Write a Comment
User Comments (0)
About PowerShow.com