SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation

Title:

SIMS 290-2: Applied Natural Language Processing

Description:

SIMS 290-2: Applied Natural Language Processing Marti Hearst November 8, 2004 Today Using Very Very Large Corpora Banko & Brill: Confusion Pair Disambiguation ... – PowerPoint PPT presentation

Number of Views:141

Avg rating:3.0/5.0

Slides: 66

Provided by: coursesIs1

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing

1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst November 8, 2004
2
Today

Using Very Very Large Corpora
Banko Brill Confusion Pair Disambiguation
Lapata Keller Several NLP Subtasks
Cucerzan Brill Spelling Correction
Clustering Introduction

3
Algorithm or Data?

Given a text analysis problem and a limited
amount of time to solve it, should we
Try to develop smarter algorithms, or
Find more training data?
Banko Brill 01
Identified a problem with a billion words of
training data
Compared using more data to
using different classification algorithms
using voting algorithms
using active learning
using semi-supervised learning

4
Banko Brill 01

Example problem confusion set disambiguation
principle, principal
then, than
to, two, too
whether, weather
Current approaches include
Latent semantic analysis
Differential grammars
Decision lists
A variety of Bayesian classifiers

5
Banko Brill 01

Collected a 1-billion-word English training
corpus
3 orders of magnitude gt than largest corpus used
previously for this problem
Consisted of
News articles
Scientific abstracts
Government transcripts
Literature
Etc.
Test set
1 million words of WSJ text (non used in
training)

6
Training on a Huge Corpus

Each learner trained at several cutoff points
First 1 million, then 5M, etc.
Items drawn by probabilistically sampling
sentences from the different sources, weighted by
source size.
Learners
Naïve bayes, perceptron, winnow, memory-based
Results
Accuracy continues to increase log-linearly even
out to 1 billion words of training data
BUT the size of the trained model also increases
log-linearly as a function of training set size.

7
Banko Brill 01
8
Comparing to Voting Algorithms

What about voting algorithms?
Ran naïve bayes, perceptron, and winnow.
Voting was done by combining the normalized
score each learner assigned.
Results
Little gain beyond 1M words of training data
Starts to perform worse than single classifier

9
Banko Brill 01
10
Comparing to Active Learning

Problem
In most cases huge volumes of labeled text is
not available.
How can we use large unlabeled corpora?
Active Learning
Intelligently select examples for hand-labeling
Choose the most uncertain examples
Has mainly been looked at for small training sets
Determining uncertainty
Look at scores assigned across examples for a
given learner use relative values to find the
tough cases
Committee-based sampling choose the ones with
the most disagreement among several algorithms

11
Comparing to Active Learning

Bagging (to create a committee of classifiers)
A variant of the original training set is
constructed by randomly sampling sentences with
replacement
Produces N new training sets of size K
Train N models and run them all on the same test
set
Compare the classification for each model on each
test set
The higher the disagreement, the tougher the
training example
Select the M toughest examples for hand-labeling
They used a variation in which M/2 of the chosen
sentences have high disagreement and M/2 are
random
Otherwise the learners become too biased toward
the hard cases and dont work as well in general.

12
Banko Brill 01
13
Active Learning

Results
Sample selection outperforms training
sequentially
However, the larger the pool of candidate
instances, the better the results
This is a way to improve results with more
unlabeled training data, with a fixed cost for
hand-annotation
Thus we can benefit from large corpora even if it
isnt all labeled

14
Weakly Supervised Learning

It would be ideal to use the large unlabeled
corpus without additional hand-labeling.
Unsupervised learning
Many approaches have been propose
Usually start with a seed set of labels and
then use unlabeled data to continue training
Use bagging again
This time choose the most certain instances to
add to the training set.
If all the classifier models agree on a given
example, then label the example and put it in the
training set.

15
Banko Brill 01
16
Banko Brill 01
17
Weakly Supervised Learning

Results
Accuracy improves over the initial seed set
However, drops off after a certain point
Using the large corpus is better.

18
Lapata Keller 04

On 6 different NLP problems, compared
Web-based n-grams (unsupervised)
BNC (smaller corpus)-based n-grams
The state-of-the-art supervised algorithm.
In all cases, the Web-based n-grams were
The same as or better than BNC-based n-grams
As good as or nearly as good as the sota
algorithm
Thus, they propose using the Web as a baseline
against which most algorithms should be compared.

19
Computing Web ngrams

Find of hits for each term via Altavista
This gives doc frequency, not term frequency
Smooth 0 counts to 0.5
All terms lower case
Three different types
Literal queries
Double-quotes around term
NEAR queries
A NEAR b within a 10-word window
Inflected queries
Expand each term to all morphological variations
history change
histories change
history changes

20
Target Word Selectionfor Machine Translation

E.g., decide between 5 alternate translations
Die Geschichte andert sich, nicht jedoch die
Geographie.
History, store, tale, saga, strip changes but
geography does not.
Looked at a test set of verb-object pairs from
Prescher et al.00
Assume that the target translation of the verb is
known
Select which noun is semantically most compatible
from the candidate translations.
Web statistics
Retrieve web counts for all possible (inflected)
verb-object translations
Choose the most frequent one

21
Results on MT term selection
22
Confusion Set Spelling Correction

Confusion sets as before principle, principal
Method
Use collocation counts as features
Words next to the target word
For each word in the confusion set
Use the web to estimate how frequently it
co-occurs with a word or a pair of words
immediately to its left or right
Disambiguate by selecting the word in the
confusion set with the highest co-occurrence
frequency
Ties go to the most frequently occurring term
Results
Best result obtained with one word to the left,
one to the right of the target word f(w1, t,
w2).
Not as good as best supervised method, but far
simpler, and far better than the baseline.
Many fewer features, doesnt use POS information

23
(No Transcript)
24
Ordering of Pre-nominal Adjectives

Which adjective comes before which others?
The small, wiggly, cute black puppy.
The small, black, wiggly cute puppy.
The wiggly, small cute puppy.
Approach
Tested on Malouf 00s adjective pair set
Choose the adjective order with the highest
frequency, using literal queries
Results
No significant difference between this and the
state-of-the-art approach
Uses a back-off bigram model, positional
probabilities, and a memory-based learner using
many features

25
Ordering of Pre-nominal Adjectives
26
Compound Noun Bracketing

Which way do the nouns group?
acute migraine treatment
acute migraine treatment
Current best model (Lauer95) uses a thesaurus
and taxonomy in an unsupervised manner
Method
Compare the probability of left-branching to
probability of right-branching (as in Lauer95)
But estimate the probabilities from web counts
Inflected queries and NEAR operator
Results
Far better than baseline no significant
difference from best model

27
(No Transcript)
28
Noun Compound Interpretation

Determine semantic relation between nouns
onion tears -gt CAUSED-BY
pet spray -gt FOR
Method
Look for prepositions that tend to indicate the
relation
Used inflected queries inserted determiners
before nouns
Story/stories about the/a/0 war/wars
Results
Best scores obtained for f(n1, p, n2)
Significantly outperforms best existing algorithm

29
Noun Compound Interpretation
30
Lapata Keller 04 Summary

Simple, unsupervised models using web counts can
be devised for a variety of NLP tasks
For 4/6 tasks, web counts better than BNC
In all but 1 case, the simple approach does not
significantly outperform the best supervised
model.
Most of these use linguistic or taxonomic
information in addition to being supervised.
But not significantly different for 3/6 problems.
But much better than the baseline for all of
these.
So a baseline that must be beat in order to
declare a new algorithm to be useful.

31
Cucerzan and Brill 04

Spelling correction for the web using query logs
Harder task than traditional spelling correction
Have to handle
Proper names, new terms, company names, etc
blog, shrek, nsync
Multi-word phrases
Frequent and severe spelling errors (10-15)
Very short contexts
Existing approaches
Rely on a dictionary for comparison
Assume a single point change
Insertion, deletion, transposition, substitution
Dont handle word substitution

32
Spelling Correction Algorithm

Main idea
Iteratively transform the query into other
strings that correspond to more likely queries.
Use statistics from query logs to determine
likelihood.
Despite the fact that many of these are
misspelled
Assume that the less wrong a misspelling is, the
more frequent it is, and correct gt incorrect
Example
ditroitigers -gt
detroittigers -gt
detroit tigers

33
Cucerzan and Brill 04
34
Spelling Correction Algorithm

Algorithm
Compute the set of all possible alternatives for
each word in the query
Look at word unigrams and bigrams from the logs
This handles concatenation and splitting of words
Find the best possible alternative string to the
input
Do this efficiently with a modified Viterbi
algorithm
Constraints
No 2 adjacent in-vocabulary words can change
simultaneously
Short queries have further (unstated)
restrictions
In-vocabulary words cant be changed in the first
round of iteration

35
Spelling Correction Algorithm

Comparing string similarity
Damerau-Levenshtein edit distance
The minimum number of point changes required to
transform a string into another
Trading off distance function leniency
A rule that allows only one letter change cant
fix
dondal duck -gt donald duck
A too permissive rule makes too many errors
log wood -gt dog food
Actual measure
a modified context-dependent weighted
Damerau-Levenshtein edit function
Point changes insertion, deletion, substitution,
immediate transpositions, long-distance movement
of letters
Weights interactively refined using statistics
from query logs

36
Spelling Correction Evaluation

Emphasizing recall
First evaluation
1044 randomly chosen queries
Annotated by two people (91.3 agreement)
180 misspelled annotators provided corrections
81.1 system agreement with annotators
131 false positives
2002 kawasaki ninja zx6e -gt 2002 kawasaki ninja
zx6r
156 suggestions for the misspelled queries
2 iterations were sufficient for most corrections
Problem annotators were guessing user intent

37
Spelling Correction Evaluation

Second evaluation
Try to find a misspelling followed by its
correction
Sample successive pairs of queries from the log
Must be sent by same user
Differ from one another by a small edit distance
Present the pair to human annotators for
verification and placement into the gold standard
Paper doesnt say how many total

38
Spelling Correction Evaluation

Results on this set
73.1 accuracy
Disagreed with gold standard 99 times 80
suggestions
40 of these were bad
15 were functionally equivalent (audio file vs.
audio files)
17 were different valid suggestions (phone
listings vs. telephone listings)
8 found errors in the gold standard (brandy
sniffers)
85.5 correct speller correct or reasonable
Sent an unspecified subset of the errors to
Googles spellchecker
Its agreement with the gold standard was slightly
lower

39
Spell Checking Summary

Can use the collective knowledge stored in query
logs
Works pretty well despite the noisiness of the
data
Exploits the errors made by people
Might be further improved to incorporate text
from other domains

40
Very Large Corpora Summary

When building unsupervised NLP algorithms
Simple algorithms applied to large datasets can
perform as well as sota algorithms (in some
cases)
If your algorithm cant do better than a simple
unsupervised web-based query, dont publish it.
Active learning (choose the most uncertain items
to hand-label) can reduce amount of labeling
needed
However, algorithms still arent 100 correct, so
it isnt proven that more data is enough good
algorithms are needed to go the extra mile.
A smaller labeled subset can provide close to the
same results as a large one, if annotated
intelligently

41
What is clustering?

Clustering
the act of grouping similar object into sets
Classification vs. Clustering
Classification assigns objects to predefined
groups
Clustering infers groups based on inter-object
similarity
Best used for exploration, rather than
presentation

42
Classification vs. Clustering
Classification Supervised learning Learns a
method for predicting the instance class from
pre-labeled (classified) instances
43
Clustering
Unsupervised learning Finds natural grouping
of instances given un-labeled data
44
Clustering Methods

Many different method and algorithms
For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up

45
Clustersexclusive vs. overlapping
Simple 2-D representation Non-overlapping
Venn diagram Overlapping

46
Clustering Evaluation

Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across
clusters

47
The distance function

Simplest case one numeric attribute A
Distance(X,Y) A(X) A(Y)
Several numeric attributes
Distance(X,Y) Euclidean distance between X,Y
Nominal attributes distance is set to 1 if
values are different, 0 if they are equal
Are all attributes equally important?
Weighting the attributes might be necessary

48
Simple Clustering K-means

Works with numeric data only
Pick a number (K) of cluster centers (at random)
Assign every item to its nearest cluster center
(e.g. using Euclidean distance)
Move each cluster center to the mean of its
assigned items
Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)

49
K-means example, step 1
Pick 3 initial cluster centers (randomly)
50
K-means example, step 2
Assign each point to the closest cluster center
51
K-means example, step 3
Move each cluster center to the mean of each
cluster
52
K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
53
K-means example, step 4
A three points with animation
54
K-means example, step 4b
re-compute cluster means
55
K-means example, step 5
move cluster centers to cluster means
56
Discussion

Result can vary significantly depending on
initial choice of seeds
Can get trapped in local minimum
Example
To increase chance of finding global optimum
restart with different random seeds

57
K-means clustering summary

Advantages
Simple, understandable
items automatically assigned to clusters

Disadvantages
Must pick number of clusters before hand
All items forced into a cluster
Too sensitive to outliers

58
K-means variations

K-medoids instead of mean, use medians of each
cluster
Mean of 1, 3, 5, 7, 9 is
Mean of 1, 3, 5, 7, 1009 is
Median of 1, 3, 5, 7, 1009 is
Median advantage not affected by extreme values
For large databases, use sampling

5
205
5
59
Hierarchical clustering

Bottom up
Start with single-instance clusters
At each step, join the two closest clusters
Design decision distance between clusters
E.g. two closest instances in clusters vs.
distance between means
Top down
Start with one universal cluster
Find two clusters
Proceed recursively on each subset
Can be very fast
Both methods produce a dendrogram

60
Hierarchical Clustering Example
agglomerative
61
Incremental clustering

Heuristic approach (COBWEB)
Form a hierarchy of clusters incrementally
Start
tree consists of empty root node
Then
add instances one by one
update tree appropriately at each stage
to update, find the right leaf for an instance
May involve restructuring the tree
Base update decisions on category utility

62
Clustering weather/tennis data

ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
5
Merge best host and runner-up
3
Consider splitting the best host if merging
doesnt help
64
Final hierarchy
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
Oops! a and b are actually very similar
65
Clustering Summary

Use clustering to find the main groups in the
data
An inexact method many parameters must be set
Results are not readily understandable enough to
show to everyday users of a search system
Evaluation is difficult
Typical method is to see if the items in
different clusters differ strongly from one
another
This doesnt tell much about how understandable
the clusters are
The important thing to do is inspect the clusters
to get ideas about the characteristics of the
collection.