Title: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Langu
1Special Topics in Computer Science Advanced
Topics in Information Retrieval Lecture 9
Natural Language Processing and IR. Tagging,
WSD, and Anaphora Resolution
- Alexander Gelbukh
- www.Gelbukh.com
2Previous Chapter Conclusions
- Reducing synonyms can help IR
- Better matching
- Ontologies are used. WordNet
- Morphology is a variant of synonymy
- widely used in IR systems
- Precise analysis dictionary-based analyzers
- Quick-and-dirty analysis stemmers
- Rule-based stemmers. Porter stemmer
- Statistical stemmers
3Previous Chapter Research topics
- Constructing and application of ontologies
- Building of morphological dictionaries
- Treatment of unknown words with
morphologicalanalyzers - Development of better stemmers
- Statistical stemmers?
4Contents
- Tagging for each word, determine its POS (Part
of Speech noun, ...) and grammatical
characteristics - WSD (Word Sense Disambiguation)for each word,
determine which homonym is used - Anaphora resolutionFor a pronoun (it, ...),
determine what it refers to
5Tagging The problem
- Ambiguity of parts of speech
- rice flies like sand
- insects living in rice consider send good?
- rice can fly similarly to sand?
- ... insect of a container with rice...?
- We can fly like sand ... We think fly like
sand... - Ambiguity of grammatical characteristics
- He have read the book
- He will read the book... He read the book
- Very frequent phenomenon, nearly at each word!
6Tagger...
- A program that looking at the context and decides
what the part of speech (and other
characteristics) are - Input
- He will read the book
- Morphological analysis
- Helt...gt willltNs Vagt readltVpa Vpp Vinfgt
thelt...gt - ? ?
? ? ? - Ns noun singular,
Tags Tagger - Va verb auxiliary,
- Vpa verb past
- Vpp verb past participle, Vinf verb
infinitive, ...
7...Tagger
- Input of tagger
- Helt...gt willltNs Vagt readltVpa Vpp Vinfgt
thelt...gt - Task Choose one!
- Output
- Helt...gt willltVagt readltVinfgt thelt...gt
- How we do it?
- He willltNgt not possible ? Va
- willltVagt read ? Vinf
- This is simple, but imagine He is ambiguous...
Explosion
8Applications
- Used for word sense disambiguation
- Oil well in Mexico is used.
- Oil is used well in Mexico.
- For stemming and lemmatization
- Important for matching in information retrieval
- Greatly speed ups syntactic analysis
- Tagging is local
- No need to process the whole sentence to find
that a certain tag is incorrect
9How Parsing?
- We can find all the syntactic structures
- Only the correct variants will enter the
syntactic structure - will Vinf form a syntactic unit
- will Vpa do not
- Problems
- Computationally expensive
- What to do with ambiguities?
- fly rice like sand
- Depends on what you need
10Statistical tagger
- Example TnT tagger
- Based on Hidden Markov Model (HMM)
- Idea
- Some words are more probable after some other
words - Find these probabilities
- Guess the word if you know the nearby ones
- Problem
- Letter strings denote meanings
- x is more probable after y are meanings, not
strings - so guess what you cannot see meanings
11Hidden Markov Model Idea
- A system changes its state
- What a person thinks
- Random... but not completely (how?)
- In each state, it emits an output
- What he says when he thinks something
- Random... but somehow (?) depends on what he
thinks - We know the sequence of produced outputs
- Text we can see it!
- Guess what were the underlying states
- Hidden we cannot see them
12Hidden Markov Model Hypotheses
- A finite set of states q1 ... qN (invisible)
- POS and grammatical characteristics (language)
- A finite set of observations v1 ... vM
- Strings we see in the corpus (language)
- A random sequence of states xi
- POS in the
- Probabilities of state transitions P(xi1 xi)
- Language rules and use
- Probabilities of observations P(vk xi)
- words expressing the meanings Vinf ask, V3
asks
13Hidden Markov Model Problem
- Same observation corresponds to different meaning
- Vinf read, Vpp read
- Looking at what we can see, guess what we cannot
- This is why hidden
- Given a sequence of observations oi
- The text sequence of letter strings. Training
set - Guess the sequence of states xi
- The POS of each word
- Our hypotheses on xi depend on each other
- Highly combinatorial task
14Hidden Markov Model Solutions
- Need to find the parameters of the model
- P(xi1 xi)
- P(vk xi)
- Optimal way! To maximize the probability of
generation this specific output - Optimization methods from Operation Research are
used - More details? Not so simple...
15Brill Tagger (rule-based)
- Erik Brill
- Makes an initial assumption aboutPOS tags in the
text - Uses context-dependent rewritingrules to correct
some tags - Applies them iteratively
- Learns the rules from a training corpus
- The rules are in human-understandable form
- You can correct them manually to improve the
tagger - Unlike HMM which are not understandable
16Word Sense Disambiguation
- Query international bank in Seoul
- Bank ? ?
- financial institution Korean
- river shore superior official
- place to store something ??? ...
- ... ... ...
- Hotel located at the beautiful bank of Han river.
- Relevant for the query?
- POS is the same. Tagger will not distinguish them
17Applications
- Translation
- ??? Great Governor of the Court
- ?? 10 thousand won
- international bank banco internacional
- river bank orilla del río
- Information retrieval
- Document retrieval is really useful? Same info
- Passage retrieval can prove very useful!
- Semantic analysis
18Representation of word senses
- Explanations. Semantic dictionaries
- Bank1 is an institution to keep money
- Bank2 is a sloppy edge of a river
- Synsets and ontology WordNew (HanNet Chinese)
- Synonyms bank, shore
- WordNet terminology synset 12345
- Corresponds to all ways to call a concept
- Relationships 12345 IS_PART_OF 67890 river,
stream 987 IS_A 654 institution,
organization - WordNet has also glosses
19Task
- Given a text (probably POS-tagged)
- Tag each word with its synset number 123 or
dictionary number bank1 - Input
- Mary keeps the money in a bank.
- Han rivers bank is beautiful.
- Output
- Mary keepslt1gt the moneylt1gt in a banklt1gt
- Han rivers banklt2gt is beautiful.
20Lesk algorithm
- Michael Lesk
- Explanatory dictionary
- Bank1 is an institution to keep money
- Bank2 is a sloppy edge of a river
- Mary keeps her money (savings) in a bank.
- Choose the sense which has more words in common
with immediate context - Improvements (Pedersen, Gelbukh Sidorov)
- Use synonyms when no direct matches
- Use synonyms of synonyms, ...
21Other word relatedness measures
- Lexical chains in WordNet
- The length of the path in the graph of
relationships - Mutual information frequent co-occurrences
- Collocations (Bolshakov Gelbukh)
- Keep in bank1
- Bank2 of river
- Very large dictionary of such combinations
- Number of words in common between explanations
- Recursive common words or related words(Gelbukh
Sidorov)
22Other methods
- Hidden Markov Models
- Logical reasoning
23Yarowskys Principles
- David Yarowsky
- One sense per text!
- One sense per collocation
- I keep my money in the bank1. This is an
international bank1 with a great capital. The
bank2 is located near Han river. - 3 words vote for institution, one for shore
- Institution!
- bank1 is located near Han river.
24Anaphora resolution
- Mainly pronouns.
- Also co-reference when two words refer to the
same? - John took cake from the table and ate it.
- John took cake from the table and washed it.
- Translation into Spanish la she table / lo
he cake - Methods
- Dictionaries
- Different sources of evidence
- Logical reasoning
25Applications
- Translation
- Information retrieval
- Can improve frequency counts (?)
- Passage retrieval can be very important
26Mitkovs knowledge poor method
- Ruslan Mitkov
- Rule-based and statistical-based approach
- Uses simple information on POS and general word
classes - Combines different sources of evidence
27Hidden Anaphora
- John bought a house. The kitchen is big.
- that houses kitchen
- John was eating. The food was delicious.
- that eating s food
- John was buried. The widow was mad with grief.
- that burying s deaths widow
- Intersection of scenarios of the
concepts(Gelbukh Sidorov) - house has a kitchen
- burying results from death widow results from
death
28Evaluation
- Senseval and TREC international competitions
- Korean track available
- Human annotated corpus
- Very expensive
- Inter-annotator agreement is often low!
- A program cannot do what humans cannot do
- Apply the program and compare with the corpus
- Accuracy
- Sometimes the program cannot tag a word
- Precision, recall
29Research topics
- Too many to list
- New methods
- Lexical resources (dictionaries)
- Computational linguistics
30Conclusions
- Tagging, word sense disambiguation, andanaphora
resolution are cases of disambiguation ofmeaning - Useful in translation, information retrieval, and
textundertanding - Dictionary-based methods
- good but expensive
- Statistical methods
- cheap and sometimes imperfect... but not always
(if verylarge corpora are available)
31Thank you! Till May 31? June 1?6 pm