SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

SIMS 290-2: Applied Natural Language Processing

Description:

SIMS 290-2: Applied Natural Language Processing Marti Hearst November 8, 2004 Today Using Very Very Large Corpora Banko & Brill: Confusion Pair Disambiguation ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 66
Provided by: coursesIs1
Category:

less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing


1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst November 8, 2004    
2
Today
  • Using Very Very Large Corpora
  • Banko Brill Confusion Pair Disambiguation
  • Lapata Keller Several NLP Subtasks
  • Cucerzan Brill Spelling Correction
  • Clustering Introduction

3
Algorithm or Data?
  • Given a text analysis problem and a limited
    amount of time to solve it, should we
  • Try to develop smarter algorithms, or
  • Find more training data?
  • Banko Brill 01
  • Identified a problem with a billion words of
    training data
  • Compared using more data to
  • using different classification algorithms
  • using voting algorithms
  • using active learning
  • using semi-supervised learning

4
Banko Brill 01
  • Example problem confusion set disambiguation
  • principle, principal
  • then, than
  • to, two, too
  • whether, weather
  • Current approaches include
  • Latent semantic analysis
  • Differential grammars
  • Decision lists
  • A variety of Bayesian classifiers

5
Banko Brill 01
  • Collected a 1-billion-word English training
    corpus
  • 3 orders of magnitude gt than largest corpus used
    previously for this problem
  • Consisted of
  • News articles
  • Scientific abstracts
  • Government transcripts
  • Literature
  • Etc.
  • Test set
  • 1 million words of WSJ text (non used in
    training)

6
Training on a Huge Corpus
  • Each learner trained at several cutoff points
  • First 1 million, then 5M, etc.
  • Items drawn by probabilistically sampling
    sentences from the different sources, weighted by
    source size.
  • Learners
  • Naïve bayes, perceptron, winnow, memory-based
  • Results
  • Accuracy continues to increase log-linearly even
    out to 1 billion words of training data
  • BUT the size of the trained model also increases
    log-linearly as a function of training set size.

7
Banko Brill 01
8
Comparing to Voting Algorithms
  • What about voting algorithms?
  • Ran naïve bayes, perceptron, and winnow.
  • Voting was done by combining the normalized
    score each learner assigned.
  • Results
  • Little gain beyond 1M words of training data
  • Starts to perform worse than single classifier

9
Banko Brill 01
10
Comparing to Active Learning
  • Problem
  • In most cases huge volumes of labeled text is
    not available.
  • How can we use large unlabeled corpora?
  • Active Learning
  • Intelligently select examples for hand-labeling
  • Choose the most uncertain examples
  • Has mainly been looked at for small training sets
  • Determining uncertainty
  • Look at scores assigned across examples for a
    given learner use relative values to find the
    tough cases
  • Committee-based sampling choose the ones with
    the most disagreement among several algorithms

11
Comparing to Active Learning
  • Bagging (to create a committee of classifiers)
  • A variant of the original training set is
    constructed by randomly sampling sentences with
    replacement
  • Produces N new training sets of size K
  • Train N models and run them all on the same test
    set
  • Compare the classification for each model on each
    test set
  • The higher the disagreement, the tougher the
    training example
  • Select the M toughest examples for hand-labeling
  • They used a variation in which M/2 of the chosen
    sentences have high disagreement and M/2 are
    random
  • Otherwise the learners become too biased toward
    the hard cases and dont work as well in general.

12
Banko Brill 01
13
Active Learning
  • Results
  • Sample selection outperforms training
    sequentially
  • However, the larger the pool of candidate
    instances, the better the results
  • This is a way to improve results with more
    unlabeled training data, with a fixed cost for
    hand-annotation
  • Thus we can benefit from large corpora even if it
    isnt all labeled

14
Weakly Supervised Learning
  • It would be ideal to use the large unlabeled
    corpus without additional hand-labeling.
  • Unsupervised learning
  • Many approaches have been propose
  • Usually start with a seed set of labels and
    then use unlabeled data to continue training
  • Use bagging again
  • This time choose the most certain instances to
    add to the training set.
  • If all the classifier models agree on a given
    example, then label the example and put it in the
    training set.

15
Banko Brill 01
16
Banko Brill 01
17
Weakly Supervised Learning
  • Results
  • Accuracy improves over the initial seed set
  • However, drops off after a certain point
  • Using the large corpus is better.

18
Lapata Keller 04
  • On 6 different NLP problems, compared
  • Web-based n-grams (unsupervised)
  • BNC (smaller corpus)-based n-grams
  • The state-of-the-art supervised algorithm.
  • In all cases, the Web-based n-grams were
  • The same as or better than BNC-based n-grams
  • As good as or nearly as good as the sota
    algorithm
  • Thus, they propose using the Web as a baseline
    against which most algorithms should be compared.

19
Computing Web ngrams
  • Find of hits for each term via Altavista
  • This gives doc frequency, not term frequency
  • Smooth 0 counts to 0.5
  • All terms lower case
  • Three different types
  • Literal queries
  • Double-quotes around term
  • NEAR queries
  • A NEAR b within a 10-word window
  • Inflected queries
  • Expand each term to all morphological variations
  • history change
  • histories change
  • history changes

20
Target Word Selectionfor Machine Translation
  • E.g., decide between 5 alternate translations
  • Die Geschichte andert sich, nicht jedoch die
    Geographie.
  • History, store, tale, saga, strip changes but
    geography does not.
  • Looked at a test set of verb-object pairs from
    Prescher et al.00
  • Assume that the target translation of the verb is
    known
  • Select which noun is semantically most compatible
    from the candidate translations.
  • Web statistics
  • Retrieve web counts for all possible (inflected)
    verb-object translations
  • Choose the most frequent one

21
Results on MT term selection
22
Confusion Set Spelling Correction
  • Confusion sets as before principle, principal
  • Method
  • Use collocation counts as features
  • Words next to the target word
  • For each word in the confusion set
  • Use the web to estimate how frequently it
    co-occurs with a word or a pair of words
    immediately to its left or right
  • Disambiguate by selecting the word in the
    confusion set with the highest co-occurrence
    frequency
  • Ties go to the most frequently occurring term
  • Results
  • Best result obtained with one word to the left,
    one to the right of the target word f(w1, t,
    w2).
  • Not as good as best supervised method, but far
    simpler, and far better than the baseline.
  • Many fewer features, doesnt use POS information

23
(No Transcript)
24
Ordering of Pre-nominal Adjectives
  • Which adjective comes before which others?
  • The small, wiggly, cute black puppy.
  • The small, black, wiggly cute puppy.
  • The wiggly, small cute puppy.
  • Approach
  • Tested on Malouf 00s adjective pair set
  • Choose the adjective order with the highest
    frequency, using literal queries
  • Results
  • No significant difference between this and the
    state-of-the-art approach
  • Uses a back-off bigram model, positional
    probabilities, and a memory-based learner using
    many features

25
Ordering of Pre-nominal Adjectives
26
Compound Noun Bracketing
  • Which way do the nouns group?
  • acute migraine treatment
  • acute migraine treatment
  • Current best model (Lauer95) uses a thesaurus
    and taxonomy in an unsupervised manner
  • Method
  • Compare the probability of left-branching to
    probability of right-branching (as in Lauer95)
  • But estimate the probabilities from web counts
  • Inflected queries and NEAR operator
  • Results
  • Far better than baseline no significant
    difference from best model

27
(No Transcript)
28
Noun Compound Interpretation
  • Determine semantic relation between nouns
  • onion tears -gt CAUSED-BY
  • pet spray -gt FOR
  • Method
  • Look for prepositions that tend to indicate the
    relation
  • Used inflected queries inserted determiners
    before nouns
  • Story/stories about the/a/0 war/wars
  • Results
  • Best scores obtained for f(n1, p, n2)
  • Significantly outperforms best existing algorithm

29
Noun Compound Interpretation
30
Lapata Keller 04 Summary
  • Simple, unsupervised models using web counts can
    be devised for a variety of NLP tasks
  • For 4/6 tasks, web counts better than BNC
  • In all but 1 case, the simple approach does not
    significantly outperform the best supervised
    model.
  • Most of these use linguistic or taxonomic
    information in addition to being supervised.
  • But not significantly different for 3/6 problems.
  • But much better than the baseline for all of
    these.
  • So a baseline that must be beat in order to
    declare a new algorithm to be useful.

31
Cucerzan and Brill 04
  • Spelling correction for the web using query logs
  • Harder task than traditional spelling correction
  • Have to handle
  • Proper names, new terms, company names, etc
  • blog, shrek, nsync
  • Multi-word phrases
  • Frequent and severe spelling errors (10-15)
  • Very short contexts
  • Existing approaches
  • Rely on a dictionary for comparison
  • Assume a single point change
  • Insertion, deletion, transposition, substitution
  • Dont handle word substitution

32
Spelling Correction Algorithm
  • Main idea
  • Iteratively transform the query into other
    strings that correspond to more likely queries.
  • Use statistics from query logs to determine
    likelihood.
  • Despite the fact that many of these are
    misspelled
  • Assume that the less wrong a misspelling is, the
    more frequent it is, and correct gt incorrect
  • Example
  • ditroitigers -gt
  • detroittigers -gt
  • detroit tigers

33
Cucerzan and Brill 04
34
Spelling Correction Algorithm
  • Algorithm
  • Compute the set of all possible alternatives for
    each word in the query
  • Look at word unigrams and bigrams from the logs
  • This handles concatenation and splitting of words
  • Find the best possible alternative string to the
    input
  • Do this efficiently with a modified Viterbi
    algorithm
  • Constraints
  • No 2 adjacent in-vocabulary words can change
    simultaneously
  • Short queries have further (unstated)
    restrictions
  • In-vocabulary words cant be changed in the first
    round of iteration

35
Spelling Correction Algorithm
  • Comparing string similarity
  • Damerau-Levenshtein edit distance
  • The minimum number of point changes required to
    transform a string into another
  • Trading off distance function leniency
  • A rule that allows only one letter change cant
    fix
  • dondal duck -gt donald duck
  • A too permissive rule makes too many errors
  • log wood -gt dog food
  • Actual measure
  • a modified context-dependent weighted
    Damerau-Levenshtein edit function
  • Point changes insertion, deletion, substitution,
    immediate transpositions, long-distance movement
    of letters
  • Weights interactively refined using statistics
    from query logs

36
Spelling Correction Evaluation
  • Emphasizing recall
  • First evaluation
  • 1044 randomly chosen queries
  • Annotated by two people (91.3 agreement)
  • 180 misspelled annotators provided corrections
  • 81.1 system agreement with annotators
  • 131 false positives
  • 2002 kawasaki ninja zx6e -gt 2002 kawasaki ninja
    zx6r
  • 156 suggestions for the misspelled queries
  • 2 iterations were sufficient for most corrections
  • Problem annotators were guessing user intent

37
Spelling Correction Evaluation
  • Second evaluation
  • Try to find a misspelling followed by its
    correction
  • Sample successive pairs of queries from the log
  • Must be sent by same user
  • Differ from one another by a small edit distance
  • Present the pair to human annotators for
    verification and placement into the gold standard
  • Paper doesnt say how many total

38
Spelling Correction Evaluation
  • Results on this set
  • 73.1 accuracy
  • Disagreed with gold standard 99 times 80
    suggestions
  • 40 of these were bad
  • 15 were functionally equivalent (audio file vs.
    audio files)
  • 17 were different valid suggestions (phone
    listings vs. telephone listings)
  • 8 found errors in the gold standard (brandy
    sniffers)
  • 85.5 correct speller correct or reasonable
  • Sent an unspecified subset of the errors to
    Googles spellchecker
  • Its agreement with the gold standard was slightly
    lower

39
Spell Checking Summary
  • Can use the collective knowledge stored in query
    logs
  • Works pretty well despite the noisiness of the
    data
  • Exploits the errors made by people
  • Might be further improved to incorporate text
    from other domains

40
Very Large Corpora Summary
  • When building unsupervised NLP algorithms
  • Simple algorithms applied to large datasets can
    perform as well as sota algorithms (in some
    cases)
  • If your algorithm cant do better than a simple
    unsupervised web-based query, dont publish it.
  • Active learning (choose the most uncertain items
    to hand-label) can reduce amount of labeling
    needed
  • However, algorithms still arent 100 correct, so
    it isnt proven that more data is enough good
    algorithms are needed to go the extra mile.
  • A smaller labeled subset can provide close to the
    same results as a large one, if annotated
    intelligently

41
What is clustering?
  • Clustering
  • the act of grouping similar object into sets
  • Classification vs. Clustering
  • Classification assigns objects to predefined
    groups
  • Clustering infers groups based on inter-object
    similarity
  • Best used for exploration, rather than
    presentation

42
Classification vs. Clustering
Classification Supervised learning Learns a
method for predicting the instance class from
pre-labeled (classified) instances
43
Clustering
Unsupervised learning Finds natural grouping
of instances given un-labeled data
44
Clustering Methods
  • Many different method and algorithms
  • For numeric and/or symbolic data
  • Deterministic vs. probabilistic
  • Exclusive vs. overlapping
  • Hierarchical vs. flat
  • Top-down vs. bottom-up

45
Clustersexclusive vs. overlapping
Simple 2-D representation Non-overlapping
Venn diagram Overlapping


46
Clustering Evaluation
  • Manual inspection
  • Benchmarking on existing labels
  • Cluster quality measures
  • distance measures
  • high similarity within a cluster, low across
    clusters

47
The distance function
  • Simplest case one numeric attribute A
  • Distance(X,Y) A(X) A(Y)
  • Several numeric attributes
  • Distance(X,Y) Euclidean distance between X,Y
  • Nominal attributes distance is set to 1 if
    values are different, 0 if they are equal
  • Are all attributes equally important?
  • Weighting the attributes might be necessary

48
Simple Clustering K-means
  • Works with numeric data only
  • Pick a number (K) of cluster centers (at random)
  • Assign every item to its nearest cluster center
    (e.g. using Euclidean distance)
  • Move each cluster center to the mean of its
    assigned items
  • Repeat steps 2,3 until convergence (change in
    cluster assignments less than a threshold)

49
K-means example, step 1
Pick 3 initial cluster centers (randomly)
50
K-means example, step 2
Assign each point to the closest cluster center
51
K-means example, step 3
Move each cluster center to the mean of each
cluster
52
K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
53
K-means example, step 4
A three points with animation
54
K-means example, step 4b
re-compute cluster means
55
K-means example, step 5
move cluster centers to cluster means
56
Discussion
  • Result can vary significantly depending on
    initial choice of seeds
  • Can get trapped in local minimum
  • Example
  • To increase chance of finding global optimum
    restart with different random seeds

57
K-means clustering summary
  • Advantages
  • Simple, understandable
  • items automatically assigned to clusters
  • Disadvantages
  • Must pick number of clusters before hand
  • All items forced into a cluster
  • Too sensitive to outliers

58
K-means variations
  • K-medoids instead of mean, use medians of each
    cluster
  • Mean of 1, 3, 5, 7, 9 is
  • Mean of 1, 3, 5, 7, 1009 is
  • Median of 1, 3, 5, 7, 1009 is
  • Median advantage not affected by extreme values
  • For large databases, use sampling

5
205
5
59
Hierarchical clustering
  • Bottom up
  • Start with single-instance clusters
  • At each step, join the two closest clusters
  • Design decision distance between clusters
  • E.g. two closest instances in clusters vs.
    distance between means
  • Top down
  • Start with one universal cluster
  • Find two clusters
  • Proceed recursively on each subset
  • Can be very fast
  • Both methods produce a dendrogram

60
Hierarchical Clustering Example
agglomerative
61
Incremental clustering
  • Heuristic approach (COBWEB)
  • Form a hierarchy of clusters incrementally
  • Start
  • tree consists of empty root node
  • Then
  • add instances one by one
  • update tree appropriately at each stage
  • to update, find the right leaf for an instance
  • May involve restructuring the tree
  • Base update decisions on category utility

62
Clustering weather/tennis data
  • 1

ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
2
3
63
Clustering weather/tennis data
  • 4

ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
5
Merge best host and runner-up
3
Consider splitting the best host if merging
doesnt help
64
Final hierarchy
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
Oops! a and b are actually very similar
65
Clustering Summary
  • Use clustering to find the main groups in the
    data
  • An inexact method many parameters must be set
  • Results are not readily understandable enough to
    show to everyday users of a search system
  • Evaluation is difficult
  • Typical method is to see if the items in
    different clusters differ strongly from one
    another
  • This doesnt tell much about how understandable
    the clusters are
  • The important thing to do is inspect the clusters
    to get ideas about the characteristics of the
    collection.
Write a Comment
User Comments (0)
About PowerShow.com