Title: SIMS 290-2: Applied Natural Language Processing
1SIMS 290-2 Applied Natural Language Processing
Marti Hearst November 8, 2004
2Today
- Using Very Very Large Corpora
- Banko Brill Confusion Pair Disambiguation
- Lapata Keller Several NLP Subtasks
- Cucerzan Brill Spelling Correction
- Clustering Introduction
3Algorithm or Data?
- Given a text analysis problem and a limited
amount of time to solve it, should we - Try to develop smarter algorithms, or
- Find more training data?
- Banko Brill 01
- Identified a problem with a billion words of
training data - Compared using more data to
- using different classification algorithms
- using voting algorithms
- using active learning
- using semi-supervised learning
4Banko Brill 01
- Example problem confusion set disambiguation
- principle, principal
- then, than
- to, two, too
- whether, weather
- Current approaches include
- Latent semantic analysis
- Differential grammars
- Decision lists
- A variety of Bayesian classifiers
5Banko Brill 01
- Collected a 1-billion-word English training
corpus - 3 orders of magnitude gt than largest corpus used
previously for this problem - Consisted of
- News articles
- Scientific abstracts
- Government transcripts
- Literature
- Etc.
- Test set
- 1 million words of WSJ text (non used in
training)
6Training on a Huge Corpus
- Each learner trained at several cutoff points
- First 1 million, then 5M, etc.
- Items drawn by probabilistically sampling
sentences from the different sources, weighted by
source size. - Learners
- Naïve bayes, perceptron, winnow, memory-based
- Results
- Accuracy continues to increase log-linearly even
out to 1 billion words of training data - BUT the size of the trained model also increases
log-linearly as a function of training set size.
7Banko Brill 01
8Comparing to Voting Algorithms
- What about voting algorithms?
- Ran naïve bayes, perceptron, and winnow.
- Voting was done by combining the normalized
score each learner assigned. - Results
- Little gain beyond 1M words of training data
- Starts to perform worse than single classifier
9Banko Brill 01
10Comparing to Active Learning
- Problem
- In most cases huge volumes of labeled text is
not available. - How can we use large unlabeled corpora?
- Active Learning
- Intelligently select examples for hand-labeling
- Choose the most uncertain examples
- Has mainly been looked at for small training sets
- Determining uncertainty
- Look at scores assigned across examples for a
given learner use relative values to find the
tough cases - Committee-based sampling choose the ones with
the most disagreement among several algorithms
11Comparing to Active Learning
- Bagging (to create a committee of classifiers)
- A variant of the original training set is
constructed by randomly sampling sentences with
replacement - Produces N new training sets of size K
- Train N models and run them all on the same test
set - Compare the classification for each model on each
test set - The higher the disagreement, the tougher the
training example - Select the M toughest examples for hand-labeling
- They used a variation in which M/2 of the chosen
sentences have high disagreement and M/2 are
random - Otherwise the learners become too biased toward
the hard cases and dont work as well in general.
12Banko Brill 01
13Active Learning
- Results
- Sample selection outperforms training
sequentially - However, the larger the pool of candidate
instances, the better the results - This is a way to improve results with more
unlabeled training data, with a fixed cost for
hand-annotation - Thus we can benefit from large corpora even if it
isnt all labeled
14Weakly Supervised Learning
- It would be ideal to use the large unlabeled
corpus without additional hand-labeling. - Unsupervised learning
- Many approaches have been propose
- Usually start with a seed set of labels and
then use unlabeled data to continue training - Use bagging again
- This time choose the most certain instances to
add to the training set. - If all the classifier models agree on a given
example, then label the example and put it in the
training set.
15Banko Brill 01
16Banko Brill 01
17Weakly Supervised Learning
- Results
- Accuracy improves over the initial seed set
- However, drops off after a certain point
- Using the large corpus is better.
18Lapata Keller 04
- On 6 different NLP problems, compared
- Web-based n-grams (unsupervised)
- BNC (smaller corpus)-based n-grams
- The state-of-the-art supervised algorithm.
- In all cases, the Web-based n-grams were
- The same as or better than BNC-based n-grams
- As good as or nearly as good as the sota
algorithm - Thus, they propose using the Web as a baseline
against which most algorithms should be compared.
19Computing Web ngrams
- Find of hits for each term via Altavista
- This gives doc frequency, not term frequency
- Smooth 0 counts to 0.5
- All terms lower case
- Three different types
- Literal queries
- Double-quotes around term
- NEAR queries
- A NEAR b within a 10-word window
- Inflected queries
- Expand each term to all morphological variations
- history change
- histories change
- history changes
20Target Word Selectionfor Machine Translation
- E.g., decide between 5 alternate translations
- Die Geschichte andert sich, nicht jedoch die
Geographie. - History, store, tale, saga, strip changes but
geography does not. - Looked at a test set of verb-object pairs from
Prescher et al.00 - Assume that the target translation of the verb is
known - Select which noun is semantically most compatible
from the candidate translations. - Web statistics
- Retrieve web counts for all possible (inflected)
verb-object translations - Choose the most frequent one
21Results on MT term selection
22Confusion Set Spelling Correction
- Confusion sets as before principle, principal
- Method
- Use collocation counts as features
- Words next to the target word
- For each word in the confusion set
- Use the web to estimate how frequently it
co-occurs with a word or a pair of words
immediately to its left or right - Disambiguate by selecting the word in the
confusion set with the highest co-occurrence
frequency - Ties go to the most frequently occurring term
- Results
- Best result obtained with one word to the left,
one to the right of the target word f(w1, t,
w2). - Not as good as best supervised method, but far
simpler, and far better than the baseline. - Many fewer features, doesnt use POS information
23(No Transcript)
24Ordering of Pre-nominal Adjectives
- Which adjective comes before which others?
- The small, wiggly, cute black puppy.
- The small, black, wiggly cute puppy.
- The wiggly, small cute puppy.
- Approach
- Tested on Malouf 00s adjective pair set
- Choose the adjective order with the highest
frequency, using literal queries - Results
- No significant difference between this and the
state-of-the-art approach - Uses a back-off bigram model, positional
probabilities, and a memory-based learner using
many features
25Ordering of Pre-nominal Adjectives
26Compound Noun Bracketing
- Which way do the nouns group?
- acute migraine treatment
- acute migraine treatment
- Current best model (Lauer95) uses a thesaurus
and taxonomy in an unsupervised manner - Method
- Compare the probability of left-branching to
probability of right-branching (as in Lauer95) - But estimate the probabilities from web counts
- Inflected queries and NEAR operator
- Results
- Far better than baseline no significant
difference from best model
27(No Transcript)
28Noun Compound Interpretation
- Determine semantic relation between nouns
- onion tears -gt CAUSED-BY
- pet spray -gt FOR
- Method
- Look for prepositions that tend to indicate the
relation - Used inflected queries inserted determiners
before nouns - Story/stories about the/a/0 war/wars
- Results
- Best scores obtained for f(n1, p, n2)
- Significantly outperforms best existing algorithm
29Noun Compound Interpretation
30Lapata Keller 04 Summary
- Simple, unsupervised models using web counts can
be devised for a variety of NLP tasks - For 4/6 tasks, web counts better than BNC
- In all but 1 case, the simple approach does not
significantly outperform the best supervised
model. - Most of these use linguistic or taxonomic
information in addition to being supervised. - But not significantly different for 3/6 problems.
- But much better than the baseline for all of
these. - So a baseline that must be beat in order to
declare a new algorithm to be useful.
31Cucerzan and Brill 04
- Spelling correction for the web using query logs
- Harder task than traditional spelling correction
- Have to handle
- Proper names, new terms, company names, etc
- blog, shrek, nsync
- Multi-word phrases
- Frequent and severe spelling errors (10-15)
- Very short contexts
- Existing approaches
- Rely on a dictionary for comparison
- Assume a single point change
- Insertion, deletion, transposition, substitution
- Dont handle word substitution
32Spelling Correction Algorithm
- Main idea
- Iteratively transform the query into other
strings that correspond to more likely queries. - Use statistics from query logs to determine
likelihood. - Despite the fact that many of these are
misspelled - Assume that the less wrong a misspelling is, the
more frequent it is, and correct gt incorrect - Example
- ditroitigers -gt
- detroittigers -gt
- detroit tigers
33Cucerzan and Brill 04
34Spelling Correction Algorithm
- Algorithm
- Compute the set of all possible alternatives for
each word in the query - Look at word unigrams and bigrams from the logs
- This handles concatenation and splitting of words
- Find the best possible alternative string to the
input - Do this efficiently with a modified Viterbi
algorithm - Constraints
- No 2 adjacent in-vocabulary words can change
simultaneously - Short queries have further (unstated)
restrictions - In-vocabulary words cant be changed in the first
round of iteration
35Spelling Correction Algorithm
- Comparing string similarity
- Damerau-Levenshtein edit distance
- The minimum number of point changes required to
transform a string into another - Trading off distance function leniency
- A rule that allows only one letter change cant
fix - dondal duck -gt donald duck
- A too permissive rule makes too many errors
- log wood -gt dog food
- Actual measure
- a modified context-dependent weighted
Damerau-Levenshtein edit function - Point changes insertion, deletion, substitution,
immediate transpositions, long-distance movement
of letters - Weights interactively refined using statistics
from query logs
36Spelling Correction Evaluation
- Emphasizing recall
- First evaluation
- 1044 randomly chosen queries
- Annotated by two people (91.3 agreement)
- 180 misspelled annotators provided corrections
- 81.1 system agreement with annotators
- 131 false positives
- 2002 kawasaki ninja zx6e -gt 2002 kawasaki ninja
zx6r - 156 suggestions for the misspelled queries
- 2 iterations were sufficient for most corrections
- Problem annotators were guessing user intent
37Spelling Correction Evaluation
- Second evaluation
- Try to find a misspelling followed by its
correction - Sample successive pairs of queries from the log
- Must be sent by same user
- Differ from one another by a small edit distance
- Present the pair to human annotators for
verification and placement into the gold standard - Paper doesnt say how many total
38Spelling Correction Evaluation
- Results on this set
- 73.1 accuracy
- Disagreed with gold standard 99 times 80
suggestions - 40 of these were bad
- 15 were functionally equivalent (audio file vs.
audio files) - 17 were different valid suggestions (phone
listings vs. telephone listings) - 8 found errors in the gold standard (brandy
sniffers) - 85.5 correct speller correct or reasonable
- Sent an unspecified subset of the errors to
Googles spellchecker - Its agreement with the gold standard was slightly
lower
39Spell Checking Summary
- Can use the collective knowledge stored in query
logs - Works pretty well despite the noisiness of the
data - Exploits the errors made by people
- Might be further improved to incorporate text
from other domains
40Very Large Corpora Summary
- When building unsupervised NLP algorithms
- Simple algorithms applied to large datasets can
perform as well as sota algorithms (in some
cases) - If your algorithm cant do better than a simple
unsupervised web-based query, dont publish it. - Active learning (choose the most uncertain items
to hand-label) can reduce amount of labeling
needed - However, algorithms still arent 100 correct, so
it isnt proven that more data is enough good
algorithms are needed to go the extra mile. - A smaller labeled subset can provide close to the
same results as a large one, if annotated
intelligently
41What is clustering?
- Clustering
- the act of grouping similar object into sets
- Classification vs. Clustering
- Classification assigns objects to predefined
groups - Clustering infers groups based on inter-object
similarity - Best used for exploration, rather than
presentation
42Classification vs. Clustering
Classification Supervised learning Learns a
method for predicting the instance class from
pre-labeled (classified) instances
43Clustering
Unsupervised learning Finds natural grouping
of instances given un-labeled data
44Clustering Methods
- Many different method and algorithms
- For numeric and/or symbolic data
- Deterministic vs. probabilistic
- Exclusive vs. overlapping
- Hierarchical vs. flat
- Top-down vs. bottom-up
45Clustersexclusive vs. overlapping
Simple 2-D representation Non-overlapping
Venn diagram Overlapping
46Clustering Evaluation
- Manual inspection
- Benchmarking on existing labels
- Cluster quality measures
- distance measures
- high similarity within a cluster, low across
clusters
47The distance function
- Simplest case one numeric attribute A
- Distance(X,Y) A(X) A(Y)
- Several numeric attributes
- Distance(X,Y) Euclidean distance between X,Y
- Nominal attributes distance is set to 1 if
values are different, 0 if they are equal - Are all attributes equally important?
- Weighting the attributes might be necessary
48Simple Clustering K-means
- Works with numeric data only
- Pick a number (K) of cluster centers (at random)
- Assign every item to its nearest cluster center
(e.g. using Euclidean distance) - Move each cluster center to the mean of its
assigned items - Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)
49K-means example, step 1
Pick 3 initial cluster centers (randomly)
50K-means example, step 2
Assign each point to the closest cluster center
51K-means example, step 3
Move each cluster center to the mean of each
cluster
52K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
53K-means example, step 4
A three points with animation
54K-means example, step 4b
re-compute cluster means
55K-means example, step 5
move cluster centers to cluster means
56Discussion
- Result can vary significantly depending on
initial choice of seeds - Can get trapped in local minimum
- Example
- To increase chance of finding global optimum
restart with different random seeds
57K-means clustering summary
- Advantages
- Simple, understandable
- items automatically assigned to clusters
- Disadvantages
- Must pick number of clusters before hand
- All items forced into a cluster
- Too sensitive to outliers
58K-means variations
- K-medoids instead of mean, use medians of each
cluster - Mean of 1, 3, 5, 7, 9 is
- Mean of 1, 3, 5, 7, 1009 is
- Median of 1, 3, 5, 7, 1009 is
- Median advantage not affected by extreme values
- For large databases, use sampling
5
205
5
59Hierarchical clustering
- Bottom up
- Start with single-instance clusters
- At each step, join the two closest clusters
- Design decision distance between clusters
- E.g. two closest instances in clusters vs.
distance between means - Top down
- Start with one universal cluster
- Find two clusters
- Proceed recursively on each subset
- Can be very fast
- Both methods produce a dendrogram
60Hierarchical Clustering Example
agglomerative
61Incremental clustering
- Heuristic approach (COBWEB)
- Form a hierarchy of clusters incrementally
- Start
- tree consists of empty root node
- Then
- add instances one by one
- update tree appropriately at each stage
- to update, find the right leaf for an instance
- May involve restructuring the tree
- Base update decisions on category utility
62Clustering weather/tennis data
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
2
3
63Clustering weather/tennis data
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
5
Merge best host and runner-up
3
Consider splitting the best host if merging
doesnt help
64Final hierarchy
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
Oops! a and b are actually very similar
65Clustering Summary
- Use clustering to find the main groups in the
data - An inexact method many parameters must be set
- Results are not readily understandable enough to
show to everyday users of a search system - Evaluation is difficult
- Typical method is to see if the items in
different clusters differ strongly from one
another - This doesnt tell much about how understandable
the clusters are - The important thing to do is inspect the clusters
to get ideas about the characteristics of the
collection.