Title: I256: Applied Natural Language Processing
1I256 Applied Natural Language Processing
Preslav Nakov and Marti Hearst October 16,
2006 (Many slides originally by Barbara Rosario,
modified here)
2Today
- Classification
- Text categorization (and other applications)
- Various issues regarding classification
- Clustering vs. classification, binary vs.
multi-way, flat vs. hierarchical classification - Introduce the steps necessary for a
classification task - Define classes
- Label text
- Features
- Training and evaluation of a classifier
3Classification
- Goal Assign objects from a universe to two or
more classes or categories - Examples
- Problem Object
Categories - Tagging Word POS
- Sense Disambiguation Word The
words senses - Information retrieval Document
Relevant/not relevant - Sentiment classification Document
Positive/negative - Author identification Document Authors
4Text Categorization Applications
- Web pages organized into category hierarchies
- Journal articles indexed by subject categories
(e.g., the Library of Congress, MEDLINE, etc.) - Responses to Census Bureau occupations
- Patents archived using International Patent
Classification - Patient records coded using international
insurance categories - E-mail message filtering
- News events tracked and filtered by topics
- Spam vs. anti-palm
5 6Why not a semi-automatic text categorization tool?
- Humans can encode knowledge of what constitutes
membership in a category. - This encoding can then be automatically applied
by a machine to categorize new examples. - For example...
7Expert System (late 1980s)
8Rule-based Approach to Text Categorization
- Text in a Web PageSaeco revolutionized espresso
brewing a decade ago by introducing Saeco
SuperAutomatic machines, which go from bean to
coffee at the touch of a button. The all-new
Saeco Vienna Super-Automatic home coffee and
cappucino machine combines top quality with low
price! - Rules
- Rule 1. (espresso or coffee or cappucino ) and
machine Coffee Maker - Rule 2.automat and answering and machine
Phone - Rule ...
9Defining Rules By Hand
- This is fine for low-stakes applications
- Google and Yahoo alerts allow users to
automatically receive news articles containing
certain keywords - Called filtering or routing
- Works fine when its ok to miss some things
- But when high accuracy is required, experience
has shown - too time consuming
- too difficult
- inconsistency issues (as the rule set gets large)
10Replace Knowledge Engineering with a Statistical
Learner
11Cost of Manual Text Categorization
- Yahoo!
- 200 (?) people for manual labeling of Web pages
- using a hierarchy of 500,000 categories
- MEDLINE (National Library of Medicine)
- 2 million/year for manual indexing of journal
articles - using MEdical Subject Headings (18,000
categories) - Mayo Clinic
- 1.4 million annually for coding patient-record
events - using the International Classification of
Diseases (ICD) for billing insurance companies - US Census Bureau decennial census (1990 22
million responses) - 232 industry categories and 504 occupation
categories - 15 million if fully done by hand
12 Knowledge Statistical
Engineering Learning
vs.
- For US Census Bureau Decennial Census 1990
- 232 industry categories and 504 occupation
categories - 15 million if fully done by hand
- Define classification rules manually
- Expert System AIOCS
- Development time 192 person-months (2 people, 8
years) - Accuracy 47
- Learn classification function
- Nearest Neighbor classification (Creecy 92
1-NN) - Development time 4 person-months (Thinking
Machine) - Accuracy 60
13Text Topic categorization
- Topic categorization classify the document into
semantics topics
14The Reuters collection
- A gold standard
- Collection of (21,578) newswire documents.
- For research purposes a standard text collection
to compare systems and algorithms - 135 valid topics categories
15Reuters
16Reuters Document Example
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798" 2-MAR-1987 165143.42
livestockhog AMERICAN PORK CONGRESS KICKS OFF
TOMORROW CHICAGO, March 2 -
The American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3
17Classification vs. Clustering
- Classification assumes labeled data we know how
many classes there are and we have examples for
each class (labeled data). - Classification is supervised
- In Clustering we dont have labeled data we just
assume that there is a natural division in the
data and we may not know how many divisions
(clusters) there are - Clustering is unsupervised
18Classification
Class1
Class2
19Classification
Class1
Class2
20Classification
Class1
Class2
21Classification
Class1
Class2
22Clustering
23Clustering
24Clustering
25Clustering
26Clustering
27Categories (Labels, Classes)
- Labeling data
- 2 problems
- Decide the possible classes (which ones, how
many) - Domain and application dependent
- Label text
- Difficult, time consuming, inconsistency between
annotators
28Reuters Example, revisited
Why not topic policy ?
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798" 2-MAR-1987 165143.42
livestockhog AMERICAN PORK CONGRESS KICKS OFF
TOMORROW CHICAGO, March 2 -
The American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3
29Binary vs. multi-way classification
- Binary classification two classes
- Multi-way classification more than two classes
- Sometime it can be convenient to treat a
multi-way problem like a binary one one class
versus all the others, for all classes
30 Flat vs. Hierarchical classification
- Flat classification relations between the
classes undetermined - Hierarchical classification hierarchy where each
node is the sub-class of its parents node
31Single- vs. multi-category classification
- In single-category text classification each text
belongs to exactly one category - In multi-category text classification, each text
can have zero or more categories
32Features
- text "Seven-time Formula One champion
Michael Schumacher took on the Shanghai circuit
Saturday in qualifying for the first Chinese
Grand Prix." - label sport
- labeled_text LabeledText(text, label)
- Here the classification takes as input the whole
string - Whats the problem with that?
- What are the features that could be useful for
this example?
33Feature terminology
- Feature An aspect of the text that is relevant
to the task - Some typical features
- Words present in text
- Frequency of words
- Capitalization
- Are there NE?
- WordNet
- Others?
34Feature terminology
- Feature An aspect of the text that is relevant
to the task - Feature value the realization of the feature in
the text - Words present in text Kerry, Schumacher, China
- Frequency of word Kerry(10), Schumacher(1)
- Are there dates? Yes/no
- Are there PERSONS? Yes/no
- Are there ORGANIZATIONS? Yes/no
- WordNet Holonyms (China is part of Asia),
Synonyms(China, People's Republic of China, mainla
nd China)
35Feature Types
- Boolean (or Binary) Features
- Features that generate boolean (binary) values.
- Boolean features are the simplest and the most
common type of feature. - f1(text) 1 if text contain Kerry
- 0 otherwise
- f2(text) 1 if text contain PERSON
- 0 otherwise
36Feature Types
- Integer Features
- Features that generate integer values.
- Integer features can be used to give classifiers
access to more precise information about the
text. - f1(text) Number of times text contains Kerry
- f2(text) Number of times text contains PERSON
37Feature selection
- How do we choose the right features?
- A future lecture
38Classification
- Define classes
- Label text
- Extract Features
- Choose a classifier
- my_classifier.classify(token)
- The Naive Bayes Classifier
- NN (perceptron)
- SVM
- .
- Train it (and test it)
- Use it to classify new examples
39Training
- Usually the classifier is defined by a set of
parameters - Training is the procedure for finding a good
set of parameters - Goodness is determined by an optimization
criterion such as misclassification rate - Some classifiers are guaranteed to find the
optimal set of parameters
40Testing, evaluation of the classifier
- After choosing the parameters of the classifiers
(i.e. after training it) we need to test how well
its doing on a test set (not included in the
training set) - Calculate misclassification on the test set
41Evaluating classifiers
- Contingency table for the evaluation of a binary
classifier
- Accuracy (ad)/(abcd)
- Precision P_GREEN a/(ab), P_ RED d/(cd)
- Recall R_GREEN a/(ac), R_ RED d/(bd)
42Training size
- The more the better! (usually)
- Results for text classification
43Training size
44Training size
45Training Size
46Upcoming
- Classifiers
- Feature selection algorithms