I256: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation

Title:

I256: Applied Natural Language Processing

Description:

The Reuters collection. A gold standard. Collection of (21,578) ... REUTERS TOPICS='YES' LEWISSPLIT='TRAIN' CGISPLIT='TRAINING-SET' OLDID ... Reuters ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 47

Provided by: coursesIs

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: I256: Applied Natural Language Processing

1
I256 Applied Natural Language Processing
Preslav Nakov and Marti Hearst October 16,
2006 (Many slides originally by Barbara Rosario,
modified here)
2
Today

Classification
Text categorization (and other applications)
Various issues regarding classification
Clustering vs. classification, binary vs.
multi-way, flat vs. hierarchical classification
Introduce the steps necessary for a
classification task
Define classes
Label text
Features
Training and evaluation of a classifier

3
Classification

Goal Assign objects from a universe to two or
more classes or categories
Examples
Problem Object
Categories
Tagging Word POS
Sense Disambiguation Word The
words senses
Information retrieval Document
Relevant/not relevant
Sentiment classification Document
Positive/negative
Author identification Document Authors

4
Text Categorization Applications

Web pages organized into category hierarchies
Journal articles indexed by subject categories
(e.g., the Library of Congress, MEDLINE, etc.)
Responses to Census Bureau occupations
Patents archived using International Patent
Classification
Patient records coded using international
insurance categories
E-mail message filtering
News events tracked and filtered by topics
Spam vs. anti-palm

Yahoo News Categories

6
Why not a semi-automatic text categorization tool?

Humans can encode knowledge of what constitutes
membership in a category.
This encoding can then be automatically applied
by a machine to categorize new examples.
For example...

7
Expert System (late 1980s)
8
Rule-based Approach to Text Categorization

Text in a Web PageSaeco revolutionized espresso
brewing a decade ago by introducing Saeco
SuperAutomatic machines, which go from bean to
coffee at the touch of a button. The all-new
Saeco Vienna Super-Automatic home coffee and
cappucino machine combines top quality with low
price!
Rules
Rule 1. (espresso or coffee or cappucino ) and
machine Coffee Maker
Rule 2.automat and answering and machine
Phone
Rule ...

9
Defining Rules By Hand

This is fine for low-stakes applications
Google and Yahoo alerts allow users to
automatically receive news articles containing
certain keywords
Called filtering or routing
Works fine when its ok to miss some things
But when high accuracy is required, experience
has shown
too time consuming
too difficult
inconsistency issues (as the rule set gets large)

10
Replace Knowledge Engineering with a Statistical
Learner
11
Cost of Manual Text Categorization

Yahoo!
200 (?) people for manual labeling of Web pages
using a hierarchy of 500,000 categories
MEDLINE (National Library of Medicine)
2 million/year for manual indexing of journal
articles
using MEdical Subject Headings (18,000
categories)
Mayo Clinic
1.4 million annually for coding patient-record
events
using the International Classification of
Diseases (ICD) for billing insurance companies
US Census Bureau decennial census (1990 22
million responses)
232 industry categories and 504 occupation
categories
15 million if fully done by hand

12
Knowledge Statistical
Engineering Learning
vs.

For US Census Bureau Decennial Census 1990
232 industry categories and 504 occupation
categories
15 million if fully done by hand
Define classification rules manually
Expert System AIOCS
Development time 192 person-months (2 people, 8
years)
Accuracy 47
Learn classification function
Nearest Neighbor classification (Creecy 92
1-NN)
Development time 4 person-months (Thinking
Machine)
Accuracy 60

13
Text Topic categorization

Topic categorization classify the document into
semantics topics

14
The Reuters collection

A gold standard
Collection of (21,578) newswire documents.
For research purposes a standard text collection
to compare systems and algorithms
135 valid topics categories

15
Reuters

Top topics in Reuters

16
Reuters Document Example
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798" 2-MAR-1987 165143.42
livestockhog AMERICAN PORK CONGRESS KICKS OFF
TOMORROW CHICAGO, March 2 -
The American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3
17
Classification vs. Clustering

Classification assumes labeled data we know how
many classes there are and we have examples for
each class (labeled data).
Classification is supervised
In Clustering we dont have labeled data we just
assume that there is a natural division in the
data and we may not know how many divisions
(clusters) there are
Clustering is unsupervised

18
Classification
Class1
Class2
19
Classification
Class1
Class2
20
Classification
Class1
Class2
21
Classification
Class1
Class2
22
Clustering
23
Clustering
24
Clustering
25
Clustering
26
Clustering
27
Categories (Labels, Classes)

Labeling data
2 problems
Decide the possible classes (which ones, how
many)
Domain and application dependent
Label text
Difficult, time consuming, inconsistency between
annotators

28
Reuters Example, revisited
Why not topic policy ?
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798" 2-MAR-1987 165143.42
livestockhog AMERICAN PORK CONGRESS KICKS OFF
TOMORROW CHICAGO, March 2 -
The American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3
29
Binary vs. multi-way classification

Binary classification two classes
Multi-way classification more than two classes
Sometime it can be convenient to treat a
multi-way problem like a binary one one class
versus all the others, for all classes

30
Flat vs. Hierarchical classification

Flat classification relations between the
classes undetermined
Hierarchical classification hierarchy where each
node is the sub-class of its parents node

31
Single- vs. multi-category classification

In single-category text classification each text
belongs to exactly one category
In multi-category text classification, each text
can have zero or more categories

32
Features

text "Seven-time Formula One champion
Michael Schumacher took on the Shanghai circuit
Saturday in qualifying for the first Chinese
Grand Prix."
label sport
labeled_text LabeledText(text, label)

Here the classification takes as input the whole
string
Whats the problem with that?
What are the features that could be useful for
this example?

33
Feature terminology

Feature An aspect of the text that is relevant
to the task
Some typical features
Words present in text
Frequency of words
Capitalization
Are there NE?
WordNet
Others?

34
Feature terminology

Feature An aspect of the text that is relevant
to the task
Feature value the realization of the feature in
the text
Words present in text Kerry, Schumacher, China
Frequency of word Kerry(10), Schumacher(1)
Are there dates? Yes/no
Are there PERSONS? Yes/no
Are there ORGANIZATIONS? Yes/no
WordNet Holonyms (China is part of Asia),
Synonyms(China, People's Republic of China, mainla
nd China)

35
Feature Types

Boolean (or Binary) Features
Features that generate boolean (binary) values.
Boolean features are the simplest and the most
common type of feature.
f1(text) 1 if text contain Kerry
0 otherwise
f2(text) 1 if text contain PERSON
0 otherwise

36
Feature Types

Integer Features
Features that generate integer values.
Integer features can be used to give classifiers
access to more precise information about the
text.
f1(text) Number of times text contains Kerry
f2(text) Number of times text contains PERSON

37
Feature selection

How do we choose the right features?
A future lecture

38
Classification

Define classes
Label text
Extract Features
Choose a classifier
my_classifier.classify(token)
The Naive Bayes Classifier
NN (perceptron)
SVM
.
Train it (and test it)
Use it to classify new examples

39
Training

Usually the classifier is defined by a set of
parameters
Training is the procedure for finding a good
set of parameters
Goodness is determined by an optimization
criterion such as misclassification rate
Some classifiers are guaranteed to find the
optimal set of parameters

40
Testing, evaluation of the classifier

After choosing the parameters of the classifiers
(i.e. after training it) we need to test how well
its doing on a test set (not included in the
training set)
Calculate misclassification on the test set

41
Evaluating classifiers