Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Langu - PowerPoint PPT Presentation

About This Presentation
Title:

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Langu

Description:

Reducing synonyms can help IR. Better matching. Ontologies are used. WordNet ... Synonyms: {bank, shore} WordNet terminology: synset #12345 ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 32
Provided by: alexande95
Category:

less

Transcript and Presenter's Notes

Title: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Langu


1
Special Topics in Computer Science Advanced
Topics in Information Retrieval Lecture 9
Natural Language Processing and IR. Tagging,
WSD, and Anaphora Resolution
  • Alexander Gelbukh
  • www.Gelbukh.com

2
Previous Chapter Conclusions
  • Reducing synonyms can help IR
  • Better matching
  • Ontologies are used. WordNet
  • Morphology is a variant of synonymy
  • widely used in IR systems
  • Precise analysis dictionary-based analyzers
  • Quick-and-dirty analysis stemmers
  • Rule-based stemmers. Porter stemmer
  • Statistical stemmers

3
Previous Chapter Research topics
  • Constructing and application of ontologies
  • Building of morphological dictionaries
  • Treatment of unknown words with
    morphologicalanalyzers
  • Development of better stemmers
  • Statistical stemmers?

4
Contents
  • Tagging for each word, determine its POS (Part
    of Speech noun, ...) and grammatical
    characteristics
  • WSD (Word Sense Disambiguation)for each word,
    determine which homonym is used
  • Anaphora resolutionFor a pronoun (it, ...),
    determine what it refers to

5
Tagging The problem
  • Ambiguity of parts of speech
  • rice flies like sand
  • insects living in rice consider send good?
  • rice can fly similarly to sand?
  • ... insect of a container with rice...?
  • We can fly like sand ... We think fly like
    sand...
  • Ambiguity of grammatical characteristics
  • He have read the book
  • He will read the book... He read the book
  • Very frequent phenomenon, nearly at each word!

6
Tagger...
  • A program that looking at the context and decides
    what the part of speech (and other
    characteristics) are
  • Input
  • He will read the book
  • Morphological analysis
  • Helt...gt willltNs Vagt readltVpa Vpp Vinfgt
    thelt...gt
  • ? ?
    ? ? ?
  • Ns noun singular,
    Tags Tagger
  • Va verb auxiliary,
  • Vpa verb past
  • Vpp verb past participle, Vinf verb
    infinitive, ...

7
...Tagger
  • Input of tagger
  • Helt...gt willltNs Vagt readltVpa Vpp Vinfgt
    thelt...gt
  • Task Choose one!
  • Output
  • Helt...gt willltVagt readltVinfgt thelt...gt
  • How we do it?
  • He willltNgt not possible ? Va
  • willltVagt read ? Vinf
  • This is simple, but imagine He is ambiguous...
    Explosion

8
Applications
  • Used for word sense disambiguation
  • Oil well in Mexico is used.
  • Oil is used well in Mexico.
  • For stemming and lemmatization
  • Important for matching in information retrieval
  • Greatly speed ups syntactic analysis
  • Tagging is local
  • No need to process the whole sentence to find
    that a certain tag is incorrect

9
How Parsing?
  • We can find all the syntactic structures
  • Only the correct variants will enter the
    syntactic structure
  • will Vinf form a syntactic unit
  • will Vpa do not
  • Problems
  • Computationally expensive
  • What to do with ambiguities?
  • fly rice like sand
  • Depends on what you need

10
Statistical tagger
  • Example TnT tagger
  • Based on Hidden Markov Model (HMM)
  • Idea
  • Some words are more probable after some other
    words
  • Find these probabilities
  • Guess the word if you know the nearby ones
  • Problem
  • Letter strings denote meanings
  • x is more probable after y are meanings, not
    strings
  • so guess what you cannot see meanings

11
Hidden Markov Model Idea
  • A system changes its state
  • What a person thinks
  • Random... but not completely (how?)
  • In each state, it emits an output
  • What he says when he thinks something
  • Random... but somehow (?) depends on what he
    thinks
  • We know the sequence of produced outputs
  • Text we can see it!
  • Guess what were the underlying states
  • Hidden we cannot see them

12
Hidden Markov Model Hypotheses
  • A finite set of states q1 ... qN (invisible)
  • POS and grammatical characteristics (language)
  • A finite set of observations v1 ... vM
  • Strings we see in the corpus (language)
  • A random sequence of states xi
  • POS in the
  • Probabilities of state transitions P(xi1 xi)
  • Language rules and use
  • Probabilities of observations P(vk xi)
  • words expressing the meanings Vinf ask, V3
    asks

13
Hidden Markov Model Problem
  • Same observation corresponds to different meaning
  • Vinf read, Vpp read
  • Looking at what we can see, guess what we cannot
  • This is why hidden
  • Given a sequence of observations oi
  • The text sequence of letter strings. Training
    set
  • Guess the sequence of states xi
  • The POS of each word
  • Our hypotheses on xi depend on each other
  • Highly combinatorial task

14
Hidden Markov Model Solutions
  • Need to find the parameters of the model
  • P(xi1 xi)
  • P(vk xi)
  • Optimal way! To maximize the probability of
    generation this specific output
  • Optimization methods from Operation Research are
    used
  • More details? Not so simple...

15
Brill Tagger (rule-based)
  • Erik Brill
  • Makes an initial assumption aboutPOS tags in the
    text
  • Uses context-dependent rewritingrules to correct
    some tags
  • Applies them iteratively
  • Learns the rules from a training corpus
  • The rules are in human-understandable form
  • You can correct them manually to improve the
    tagger
  • Unlike HMM which are not understandable

16
Word Sense Disambiguation
  • Query international bank in Seoul
  • Bank ? ?
  • financial institution Korean
  • river shore superior official
  • place to store something ??? ...
  • ... ... ...
  • Hotel located at the beautiful bank of Han river.
  • Relevant for the query?
  • POS is the same. Tagger will not distinguish them

17
Applications
  • Translation
  • ??? Great Governor of the Court
  • ?? 10 thousand won
  • international bank banco internacional
  • river bank orilla del río
  • Information retrieval
  • Document retrieval is really useful? Same info
  • Passage retrieval can prove very useful!
  • Semantic analysis

18
Representation of word senses
  • Explanations. Semantic dictionaries
  • Bank1 is an institution to keep money
  • Bank2 is a sloppy edge of a river
  • Synsets and ontology WordNew (HanNet Chinese)
  • Synonyms bank, shore
  • WordNet terminology synset 12345
  • Corresponds to all ways to call a concept
  • Relationships 12345 IS_PART_OF 67890 river,
    stream 987 IS_A 654 institution,
    organization
  • WordNet has also glosses

19
Task
  • Given a text (probably POS-tagged)
  • Tag each word with its synset number 123 or
    dictionary number bank1
  • Input
  • Mary keeps the money in a bank.
  • Han rivers bank is beautiful.
  • Output
  • Mary keepslt1gt the moneylt1gt in a banklt1gt
  • Han rivers banklt2gt is beautiful.

20
Lesk algorithm
  • Michael Lesk
  • Explanatory dictionary
  • Bank1 is an institution to keep money
  • Bank2 is a sloppy edge of a river
  • Mary keeps her money (savings) in a bank.
  • Choose the sense which has more words in common
    with immediate context
  • Improvements (Pedersen, Gelbukh Sidorov)
  • Use synonyms when no direct matches
  • Use synonyms of synonyms, ...

21
Other word relatedness measures
  • Lexical chains in WordNet
  • The length of the path in the graph of
    relationships
  • Mutual information frequent co-occurrences
  • Collocations (Bolshakov Gelbukh)
  • Keep in bank1
  • Bank2 of river
  • Very large dictionary of such combinations
  • Number of words in common between explanations
  • Recursive common words or related words(Gelbukh
    Sidorov)

22
Other methods
  • Hidden Markov Models
  • Logical reasoning

23
Yarowskys Principles
  • David Yarowsky
  • One sense per text!
  • One sense per collocation
  • I keep my money in the bank1. This is an
    international bank1 with a great capital. The
    bank2 is located near Han river.
  • 3 words vote for institution, one for shore
  • Institution!
  • bank1 is located near Han river.

24
Anaphora resolution
  • Mainly pronouns.
  • Also co-reference when two words refer to the
    same?
  • John took cake from the table and ate it.
  • John took cake from the table and washed it.
  • Translation into Spanish la she table / lo
    he cake
  • Methods
  • Dictionaries
  • Different sources of evidence
  • Logical reasoning

25
Applications
  • Translation
  • Information retrieval
  • Can improve frequency counts (?)
  • Passage retrieval can be very important

26
Mitkovs knowledge poor method
  • Ruslan Mitkov
  • Rule-based and statistical-based approach
  • Uses simple information on POS and general word
    classes
  • Combines different sources of evidence

27
Hidden Anaphora
  • John bought a house. The kitchen is big.
  • that houses kitchen
  • John was eating. The food was delicious.
  • that eating s food
  • John was buried. The widow was mad with grief.
  • that burying s deaths widow
  • Intersection of scenarios of the
    concepts(Gelbukh Sidorov)
  • house has a kitchen
  • burying results from death widow results from
    death

28
Evaluation
  • Senseval and TREC international competitions
  • Korean track available
  • Human annotated corpus
  • Very expensive
  • Inter-annotator agreement is often low!
  • A program cannot do what humans cannot do
  • Apply the program and compare with the corpus
  • Accuracy
  • Sometimes the program cannot tag a word
  • Precision, recall

29
Research topics
  • Too many to list
  • New methods
  • Lexical resources (dictionaries)
  • Computational linguistics

30
Conclusions
  • Tagging, word sense disambiguation, andanaphora
    resolution are cases of disambiguation ofmeaning
  • Useful in translation, information retrieval, and
    textundertanding
  • Dictionary-based methods
  • good but expensive
  • Statistical methods
  • cheap and sometimes imperfect... but not always
    (if verylarge corpora are available)

31
Thank you! Till May 31? June 1?6 pm
Write a Comment
User Comments (0)
About PowerShow.com