I256: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

I256: Applied Natural Language Processing

Description:

The Reuters collection. A gold standard. Collection of (21,578) ... REUTERS TOPICS='YES' LEWISSPLIT='TRAIN' CGISPLIT='TRAINING-SET' OLDID ... Reuters ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 47
Provided by: coursesIs
Category:

less

Transcript and Presenter's Notes

Title: I256: Applied Natural Language Processing


1
I256 Applied Natural Language Processing
Preslav Nakov and Marti Hearst October 16,
2006 (Many slides originally by Barbara Rosario,
modified here)    
2
Today
  • Classification
  • Text categorization (and other applications)
  • Various issues regarding classification
  • Clustering vs. classification, binary vs.
    multi-way, flat vs. hierarchical classification
  • Introduce the steps necessary for a
    classification task
  • Define classes
  • Label text
  • Features
  • Training and evaluation of a classifier

3
Classification
  • Goal Assign objects from a universe to two or
    more classes or categories
  • Examples
  • Problem Object
    Categories
  • Tagging Word POS
  • Sense Disambiguation Word The
    words senses
  • Information retrieval Document
    Relevant/not relevant
  • Sentiment classification Document
    Positive/negative
  • Author identification Document Authors

4
Text Categorization Applications
  • Web pages organized into category hierarchies
  • Journal articles indexed by subject categories
    (e.g., the Library of Congress, MEDLINE, etc.)
  • Responses to Census Bureau occupations
  • Patents archived using International Patent
    Classification
  • Patient records coded using international
    insurance categories
  • E-mail message filtering
  • News events tracked and filtered by topics
  • Spam vs. anti-palm

5
  • Yahoo News Categories

6
Why not a semi-automatic text categorization tool?
  • Humans can encode knowledge of what constitutes
    membership in a category.
  • This encoding can then be automatically applied
    by a machine to categorize new examples.
  • For example...

7
Expert System (late 1980s)
8
Rule-based Approach to Text Categorization
  • Text in a Web PageSaeco revolutionized espresso
    brewing a decade ago by introducing Saeco
    SuperAutomatic machines, which go from bean to
    coffee at the touch of a button. The all-new
    Saeco Vienna Super-Automatic home coffee and
    cappucino machine combines top quality with low
    price!
  • Rules
  • Rule 1. (espresso or coffee or cappucino ) and
    machine Coffee Maker
  • Rule 2.automat and answering and machine
    Phone
  • Rule ...

9
Defining Rules By Hand
  • This is fine for low-stakes applications
  • Google and Yahoo alerts allow users to
    automatically receive news articles containing
    certain keywords
  • Called filtering or routing
  • Works fine when its ok to miss some things
  • But when high accuracy is required, experience
    has shown
  • too time consuming
  • too difficult
  • inconsistency issues (as the rule set gets large)

10
Replace Knowledge Engineering with a Statistical
Learner
11
Cost of Manual Text Categorization
  • Yahoo!
  • 200 (?) people for manual labeling of Web pages
  • using a hierarchy of 500,000 categories
  • MEDLINE (National Library of Medicine)
  • 2 million/year for manual indexing of journal
    articles
  • using MEdical Subject Headings (18,000
    categories)
  • Mayo Clinic
  • 1.4 million annually for coding patient-record
    events
  • using the International Classification of
    Diseases (ICD) for billing insurance companies
  • US Census Bureau decennial census (1990 22
    million responses)
  • 232 industry categories and 504 occupation
    categories
  • 15 million if fully done by hand

12
Knowledge Statistical
Engineering Learning
vs.
  • For US Census Bureau Decennial Census 1990
  • 232 industry categories and 504 occupation
    categories
  • 15 million if fully done by hand
  • Define classification rules manually
  • Expert System AIOCS
  • Development time 192 person-months (2 people, 8
    years)
  • Accuracy 47
  • Learn classification function
  • Nearest Neighbor classification (Creecy 92
    1-NN)
  • Development time 4 person-months (Thinking
    Machine)
  • Accuracy 60

13
Text Topic categorization
  • Topic categorization classify the document into
    semantics topics

14
The Reuters collection
  • A gold standard
  • Collection of (21,578) newswire documents.
  • For research purposes a standard text collection
    to compare systems and algorithms
  • 135 valid topics categories

15
Reuters
  • Top topics in Reuters

16
Reuters Document Example
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798" 2-MAR-1987 165143.42
livestockhog AMERICAN PORK CONGRESS KICKS OFF
TOMORROW CHICAGO, March 2 -
The American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3
17
Classification vs. Clustering
  • Classification assumes labeled data we know how
    many classes there are and we have examples for
    each class (labeled data).
  • Classification is supervised
  • In Clustering we dont have labeled data we just
    assume that there is a natural division in the
    data and we may not know how many divisions
    (clusters) there are
  • Clustering is unsupervised

18
Classification
Class1
Class2
19
Classification
Class1
Class2
20
Classification
Class1
Class2
21
Classification
Class1
Class2
22
Clustering
23
Clustering
24
Clustering
25
Clustering
26
Clustering
27
Categories (Labels, Classes)
  • Labeling data
  • 2 problems
  • Decide the possible classes (which ones, how
    many)
  • Domain and application dependent
  • Label text
  • Difficult, time consuming, inconsistency between
    annotators

28
Reuters Example, revisited
Why not topic policy ?
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798" 2-MAR-1987 165143.42
livestockhog AMERICAN PORK CONGRESS KICKS OFF
TOMORROW CHICAGO, March 2 -
The American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3
29
Binary vs. multi-way classification
  • Binary classification two classes
  • Multi-way classification more than two classes
  • Sometime it can be convenient to treat a
    multi-way problem like a binary one one class
    versus all the others, for all classes

30
Flat vs. Hierarchical classification
  • Flat classification relations between the
    classes undetermined
  • Hierarchical classification hierarchy where each
    node is the sub-class of its parents node

31
Single- vs. multi-category classification
  • In single-category text classification each text
    belongs to exactly one category
  • In multi-category text classification, each text
    can have zero or more categories

32
Features
  • text "Seven-time Formula One champion
    Michael Schumacher took on the Shanghai circuit
    Saturday in qualifying for the first Chinese
    Grand Prix."
  • label sport
  • labeled_text LabeledText(text, label)
  • Here the classification takes as input the whole
    string
  • Whats the problem with that?
  • What are the features that could be useful for
    this example?

33
Feature terminology
  • Feature An aspect of the text that is relevant
    to the task
  • Some typical features
  • Words present in text
  • Frequency of words
  • Capitalization
  • Are there NE?
  • WordNet
  • Others?

34
Feature terminology
  • Feature An aspect of the text that is relevant
    to the task
  • Feature value the realization of the feature in
    the text
  • Words present in text Kerry, Schumacher, China
  • Frequency of word Kerry(10), Schumacher(1)
  • Are there dates? Yes/no
  • Are there PERSONS? Yes/no
  • Are there ORGANIZATIONS? Yes/no
  • WordNet Holonyms (China is part of Asia),
    Synonyms(China, People's Republic of China, mainla
    nd China)

35
Feature Types
  • Boolean (or Binary) Features
  • Features that generate boolean (binary) values.
  • Boolean features are the simplest and the most
    common type of feature.
  • f1(text) 1 if text contain Kerry
  • 0 otherwise
  • f2(text) 1 if text contain PERSON
  • 0 otherwise

36
Feature Types
  • Integer Features
  • Features that generate integer values.
  • Integer features can be used to give classifiers
    access to more precise information about the
    text.
  • f1(text) Number of times text contains Kerry
  • f2(text) Number of times text contains PERSON

37
Feature selection
  • How do we choose the right features?
  • A future lecture

38
Classification
  • Define classes
  • Label text
  • Extract Features
  • Choose a classifier
  • my_classifier.classify(token)
  • The Naive Bayes Classifier
  • NN (perceptron)
  • SVM
  • .
  • Train it (and test it)
  • Use it to classify new examples

39
Training
  • Usually the classifier is defined by a set of
    parameters
  • Training is the procedure for finding a good
    set of parameters
  • Goodness is determined by an optimization
    criterion such as misclassification rate
  • Some classifiers are guaranteed to find the
    optimal set of parameters

40
Testing, evaluation of the classifier
  • After choosing the parameters of the classifiers
    (i.e. after training it) we need to test how well
    its doing on a test set (not included in the
    training set)
  • Calculate misclassification on the test set

41
Evaluating classifiers
  • Contingency table for the evaluation of a binary
    classifier
  • Accuracy (ad)/(abcd)
  • Precision P_GREEN a/(ab), P_ RED d/(cd)
  • Recall R_GREEN a/(ac), R_ RED d/(bd)

42
Training size
  • The more the better! (usually)
  • Results for text classification

43
Training size
44
Training size
45
Training Size
  • Author identification

46
Upcoming
  • Classifiers
  • Feature selection algorithms
Write a Comment
User Comments (0)
About PowerShow.com