A Hidden Markov Model Information Retrieval System - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

A Hidden Markov Model Information Retrieval System

Description:

Language Model based Information Retrieval: University of Saarland. 1 ... Most predicative features. Source. Length. Average word length ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 23
Provided by: mak121
Category:

less

Transcript and Presenter's Notes

Title: A Hidden Markov Model Information Retrieval System


1
A Hidden Markov Model Information Retrieval System
  • Mahboob Alam Khalid

2
Overview
  • Motivation
  • Hidden Markov Model (Introduction)
  • HMM for Information Retrieval System
  • Probability Model
  • Baseline System Experiments
  • HMM Refinements
  • Blind Feedback
  • Bigrams
  • Document Priors
  • Conclusion

3
Motivation
  • Hidden Markov models have been applied
    successfully
  • Speech Recognition
  • Named Entity Finding
  • Optical Character Recognition
  • Topic Identification
  • Ad hoc Information Retrieval (now)

4
Hidden Markov Model (Introduction)
  • You have seen sequence of observation (words)
  • You dont know sequence of generator (states).
  • HMM is a solution for this problem
  • Two probabilities are involved in HMM
  • Jump from one state to others (Transition
    probability), whose sum is 1.
  • Probability of observations from one state, whose
    sum is 1.

5
A discrete HMM
  • Set of output symbols
  • Set of states
  • Set of transitions between states
  • Probability distribution on output symbols for
    each state
  • Observed sampling process
  • Starting from some initial state
  • Transition from it to another state
  • Sampling from the output distribution at that
    state

Repeat the steps
6
HMM for Information Retrieval System
  • Observed data query Q
  • Unknown key relevant document D
  • Noisy channel mind of user
  • Transform imagined notion into text of Q
  • P(D is RQ) ?
  • D is relevant in the users mind
  • Given that Q was the query produced

7
Probability Model
Prior probability
P(QD is R).P(D is R)
  • P(D is RQ)
  • Output symbols
  • Union of all words in the corpus
  • States
  • Mechanism of query word generations
  • Document
  • General English

P(Q)
Identical for all documents
8
A simple two-state HMM
P(q GE)
General English
a0
query start
query end
a1
P(q D)
Document
  • The choice of which kind of word to generate next
    is independent of the previous such choice.

9
Why simplification of params?
  • HMM for each document
  • EM for computing these parameters
  • Need training samples
  • Document with training queries (not available)
  • P(qDk)
  • P(qGE)
  • P(QDk is R) ? (a0P(qGE) a1P(qDk))

q appears in Dk
length of Dk
?k q appears in Dk
?k length of Dk
q Q
10
Baseline System Performance
  • of queries 50
  • Inverted index is created
  • Tf value (term frequency)
  • Ignoring case
  • Porter stemmer
  • Replaced 397 stop words with special token STOP
  • Similarly 4-digit strings by YEAR, digit
    strings by NUMBER
  • TREC-6, TREC-7 test collections
  • TREC-6
  • 556,077 documents average of 26.5 unique terms
  • News and government agencies
  • TREC-7
  • 528,155 documents average of 17.6 unique terms

11
TF.IDF model
12
Non-interpolated average precision
13
HMM Refinements
  • Blind Feedback
  • well-known technique for enhancing performance
  • Bigrams
  • distinctive meaning when used in the context of
    other word. e.g. white house, Pop John Paul II
  • Query Section Weighting
  • Some portion of query is more important than
    others.
  • Document Priors
  • longer documents are more informative than short
    ones

14
Blind Feedback
  • Constructing a new query based on top-ranked
    document
  • Rocchio algorithm
  • In 90 of top N retrieved document
  • word very is less informative
  • word Nixon is highly informative
  • a0 and a1 can be estimated by EM algorithm by
    training queries.

15
Estimate a1
  • In equation (5) of paper
  • Q general query
  • q general query word ???(am I right)
  • Qi one trained query
  • Q available training queries

Im,Qi top m documents for Qi df(w) document
frequency of w
Qi Germany
Berlin
Negative values are avoided by taking floor of
estimate
16
Performance gained
17
Bigrams
18
Query Section Weighting
  • TREC evaluation
  • Title section is more important than others.
  • vs(q)weight for the section of the query
  • vdesc1.2, vnarr1.9, vtitle5.7

19
Document Priors
  • refereed Journal may be more informative than a
    supermarket tabloid.
  • Most predicative features
  • Source
  • Length
  • Average word length

20
Conclusion
  • Novel method in IR using HMMs
  • Offer rich setting
  • Incorporate new and familiar techniques
  • Experiments with a system that implements
  • Blind feedback
  • Bigram modeling
  • Query Section weighting
  • Document priors
  • Future work
  • HMM can be extended to accommodate
  • Passage retrieval
  • Explicit synonym modeling
  • Concept modeling

21
Resources
  • D. Miller, T. Leek, R. Schwartz
  • A Hidden Markov Information Retrieval System
  • SIGIR 99 Berkeley, CA USA
  • L. Rabiner
  • A tutorial on Hidden Markov Models and selected
    applications in speech recognition
  • Proc. IEEE 77, pp 130-137

22
Thankyou very much!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com