Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Langu - PowerPoint PPT Presentation

About This Presentation

Title:

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Langu

Description:

Reducing synonyms can help IR. Better matching. Ontologies are used. WordNet ... Synonyms: {bank, shore} WordNet terminology: synset #12345 ... – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 32

Provided by: alexande95

Category:

more less

Transcript and Presenter's Notes

Title: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Langu

1
Special Topics in Computer Science Advanced
Topics in Information Retrieval Lecture 9
Natural Language Processing and IR. Tagging,
WSD, and Anaphora Resolution

Alexander Gelbukh
www.Gelbukh.com

2
Previous Chapter Conclusions

Reducing synonyms can help IR
Better matching
Ontologies are used. WordNet
Morphology is a variant of synonymy
widely used in IR systems
Precise analysis dictionary-based analyzers
Quick-and-dirty analysis stemmers
Rule-based stemmers. Porter stemmer
Statistical stemmers

3
Previous Chapter Research topics

Constructing and application of ontologies
Building of morphological dictionaries
Treatment of unknown words with
morphologicalanalyzers
Development of better stemmers
Statistical stemmers?

4
Contents

Tagging for each word, determine its POS (Part
of Speech noun, ...) and grammatical
characteristics
WSD (Word Sense Disambiguation)for each word,
determine which homonym is used
Anaphora resolutionFor a pronoun (it, ...),
determine what it refers to

5
Tagging The problem

Ambiguity of parts of speech
rice flies like sand
insects living in rice consider send good?
rice can fly similarly to sand?
... insect of a container with rice...?
We can fly like sand ... We think fly like
sand...
Ambiguity of grammatical characteristics
He have read the book
He will read the book... He read the book
Very frequent phenomenon, nearly at each word!

6
Tagger...

A program that looking at the context and decides
what the part of speech (and other
characteristics) are
Input
He will read the book
Morphological analysis
Helt...gt willltNs Vagt readltVpa Vpp Vinfgt
thelt...gt
? ?
? ? ?
Ns noun singular,
Tags Tagger
Va verb auxiliary,
Vpa verb past
Vpp verb past participle, Vinf verb
infinitive, ...

7
...Tagger

Input of tagger
Helt...gt willltNs Vagt readltVpa Vpp Vinfgt
thelt...gt
Task Choose one!
Output
Helt...gt willltVagt readltVinfgt thelt...gt
How we do it?
He willltNgt not possible ? Va
willltVagt read ? Vinf
This is simple, but imagine He is ambiguous...
Explosion

8
Applications

Used for word sense disambiguation
Oil well in Mexico is used.
Oil is used well in Mexico.
For stemming and lemmatization
Important for matching in information retrieval
Greatly speed ups syntactic analysis
Tagging is local
No need to process the whole sentence to find
that a certain tag is incorrect

9
How Parsing?

We can find all the syntactic structures
Only the correct variants will enter the
syntactic structure
will Vinf form a syntactic unit
will Vpa do not
Problems
Computationally expensive
What to do with ambiguities?
fly rice like sand
Depends on what you need

10
Statistical tagger

Example TnT tagger
Based on Hidden Markov Model (HMM)
Idea
Some words are more probable after some other
words
Find these probabilities
Guess the word if you know the nearby ones
Problem
Letter strings denote meanings
x is more probable after y are meanings, not
strings
so guess what you cannot see meanings

11
Hidden Markov Model Idea

A system changes its state
What a person thinks
Random... but not completely (how?)
In each state, it emits an output
What he says when he thinks something
Random... but somehow (?) depends on what he
thinks
We know the sequence of produced outputs
Text we can see it!
Guess what were the underlying states
Hidden we cannot see them

12
Hidden Markov Model Hypotheses

A finite set of states q1 ... qN (invisible)
POS and grammatical characteristics (language)
A finite set of observations v1 ... vM
Strings we see in the corpus (language)
A random sequence of states xi
POS in the
Probabilities of state transitions P(xi1 xi)
Language rules and use
Probabilities of observations P(vk xi)
words expressing the meanings Vinf ask, V3
asks

13
Hidden Markov Model Problem

Same observation corresponds to different meaning
Vinf read, Vpp read
Looking at what we can see, guess what we cannot
This is why hidden
Given a sequence of observations oi
The text sequence of letter strings. Training
set
Guess the sequence of states xi
The POS of each word
Our hypotheses on xi depend on each other
Highly combinatorial task

14
Hidden Markov Model Solutions

Need to find the parameters of the model
P(xi1 xi)
P(vk xi)
Optimal way! To maximize the probability of
generation this specific output
Optimization methods from Operation Research are
used
More details? Not so simple...

15
Brill Tagger (rule-based)

Erik Brill
Makes an initial assumption aboutPOS tags in the
text
Uses context-dependent rewritingrules to correct
some tags
Applies them iteratively
Learns the rules from a training corpus
The rules are in human-understandable form
You can correct them manually to improve the
tagger
Unlike HMM which are not understandable

16
Word Sense Disambiguation

Query international bank in Seoul
Bank ? ?
financial institution Korean
river shore superior official
place to store something ??? ...
... ... ...
Hotel located at the beautiful bank of Han river.
Relevant for the query?
POS is the same. Tagger will not distinguish them

17
Applications

Translation
??? Great Governor of the Court
?? 10 thousand won
international bank banco internacional
river bank orilla del río
Information retrieval
Document retrieval is really useful? Same info
Passage retrieval can prove very useful!
Semantic analysis

18
Representation of word senses

Explanations. Semantic dictionaries
Bank1 is an institution to keep money
Bank2 is a sloppy edge of a river
Synsets and ontology WordNew (HanNet Chinese)
Synonyms bank, shore
WordNet terminology synset 12345
Corresponds to all ways to call a concept
Relationships 12345 IS_PART_OF 67890 river,
stream 987 IS_A 654 institution,
organization
WordNet has also glosses

19
Task

Given a text (probably POS-tagged)
Tag each word with its synset number 123 or
dictionary number bank1
Input
Mary keeps the money in a bank.
Han rivers bank is beautiful.
Output
Mary keepslt1gt the moneylt1gt in a banklt1gt
Han rivers banklt2gt is beautiful.

20
Lesk algorithm

Michael Lesk
Explanatory dictionary
Bank1 is an institution to keep money
Bank2 is a sloppy edge of a river
Mary keeps her money (savings) in a bank.
Choose the sense which has more words in common
with immediate context
Improvements (Pedersen, Gelbukh Sidorov)
Use synonyms when no direct matches
Use synonyms of synonyms, ...

21
Other word relatedness measures

Lexical chains in WordNet
The length of the path in the graph of
relationships
Mutual information frequent co-occurrences
Collocations (Bolshakov Gelbukh)
Keep in bank1
Bank2 of river
Very large dictionary of such combinations
Number of words in common between explanations
Recursive common words or related words(Gelbukh
Sidorov)

22
Other methods

Hidden Markov Models
Logical reasoning

23
Yarowskys Principles

David Yarowsky
One sense per text!
One sense per collocation
I keep my money in the bank1. This is an
international bank1 with a great capital. The
bank2 is located near Han river.
3 words vote for institution, one for shore
Institution!
bank1 is located near Han river.

24
Anaphora resolution

Mainly pronouns.
Also co-reference when two words refer to the
same?
John took cake from the table and ate it.
John took cake from the table and washed it.
Translation into Spanish la she table / lo
he cake
Methods
Dictionaries
Different sources of evidence
Logical reasoning

25
Applications

Translation
Information retrieval
Can improve frequency counts (?)
Passage retrieval can be very important

26
Mitkovs knowledge poor method

Ruslan Mitkov
Rule-based and statistical-based approach
Uses simple information on POS and general word
classes
Combines different sources of evidence

27
Hidden Anaphora

John bought a house. The kitchen is big.
that houses kitchen
John was eating. The food was delicious.
that eating s food
John was buried. The widow was mad with grief.
that burying s deaths widow
Intersection of scenarios of the
concepts(Gelbukh Sidorov)
house has a kitchen
burying results from death widow results from
death

28
Evaluation

Senseval and TREC international competitions
Korean track available
Human annotated corpus
Very expensive
Inter-annotator agreement is often low!
A program cannot do what humans cannot do
Apply the program and compare with the corpus
Accuracy
Sometimes the program cannot tag a word
Precision, recall

29
Research topics

Too many to list
New methods
Lexical resources (dictionaries)
Computational linguistics

30
Conclusions

Tagging, word sense disambiguation, andanaphora
resolution are cases of disambiguation ofmeaning
Useful in translation, information retrieval, and
textundertanding
Dictionary-based methods
good but expensive
Statistical methods
cheap and sometimes imperfect... but not always
(if verylarge corpora are available)

31
Thank you! Till May 31? June 1?6 pm

Write a Comment

User Comments (0)