A Hidden Markov Model Information Retrieval System - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

A Hidden Markov Model Information Retrieval System

Description:

Language Model based Information Retrieval: University of Saarland. 1 ... Most predicative features. Source. Length. Average word length ... – PowerPoint PPT presentation

Number of Views:221

Avg rating:3.0/5.0

Slides: 23

Provided by: mak121

Category:

more less

Transcript and Presenter's Notes

Title: A Hidden Markov Model Information Retrieval System

1
A Hidden Markov Model Information Retrieval System

Mahboob Alam Khalid

2
Overview

Motivation
Hidden Markov Model (Introduction)
HMM for Information Retrieval System
Probability Model
Baseline System Experiments
HMM Refinements
Blind Feedback
Bigrams
Document Priors
Conclusion

3
Motivation

Hidden Markov models have been applied
successfully
Speech Recognition
Named Entity Finding
Optical Character Recognition
Topic Identification
Ad hoc Information Retrieval (now)

4
Hidden Markov Model (Introduction)

You have seen sequence of observation (words)
You dont know sequence of generator (states).
HMM is a solution for this problem
Two probabilities are involved in HMM
Jump from one state to others (Transition
probability), whose sum is 1.
Probability of observations from one state, whose
sum is 1.

5
A discrete HMM

Set of output symbols
Set of states
Set of transitions between states
Probability distribution on output symbols for
each state
Observed sampling process
Starting from some initial state
Transition from it to another state
Sampling from the output distribution at that
state

Repeat the steps
6
HMM for Information Retrieval System

Observed data query Q
Unknown key relevant document D
Noisy channel mind of user
Transform imagined notion into text of Q
P(D is RQ) ?
D is relevant in the users mind
Given that Q was the query produced

7
Probability Model
Prior probability
P(QD is R).P(D is R)

P(D is RQ)
Output symbols
Union of all words in the corpus
States
Mechanism of query word generations
Document
General English

P(Q)
Identical for all documents
8
A simple two-state HMM
P(q GE)
General English
a0
query start
query end
a1
P(q D)
Document

The choice of which kind of word to generate next
is independent of the previous such choice.

9
Why simplification of params?

HMM for each document
EM for computing these parameters
Need training samples
Document with training queries (not available)
P(qDk)
P(qGE)
P(QDk is R) ? (a0P(qGE) a1P(qDk))

q appears in Dk
length of Dk
?k q appears in Dk
?k length of Dk
q Q
10
Baseline System Performance

of queries 50
Inverted index is created
Tf value (term frequency)
Ignoring case
Porter stemmer
Replaced 397 stop words with special token STOP
Similarly 4-digit strings by YEAR, digit
strings by NUMBER
TREC-6, TREC-7 test collections
TREC-6
556,077 documents average of 26.5 unique terms
News and government agencies
TREC-7
528,155 documents average of 17.6 unique terms

11
TF.IDF model
12
Non-interpolated average precision
13
HMM Refinements

Blind Feedback
well-known technique for enhancing performance
Bigrams
distinctive meaning when used in the context of
other word. e.g. white house, Pop John Paul II
Query Section Weighting
Some portion of query is more important than
others.
Document Priors
longer documents are more informative than short
ones

14
Blind Feedback

Constructing a new query based on top-ranked
document
Rocchio algorithm
In 90 of top N retrieved document
word very is less informative
word Nixon is highly informative
a0 and a1 can be estimated by EM algorithm by
training queries.

15
Estimate a1

In equation (5) of paper
Q general query
q general query word ???(am I right)
Qi one trained query
Q available training queries

Im,Qi top m documents for Qi df(w) document
frequency of w
Qi Germany
Berlin
Negative values are avoided by taking floor of
estimate
16
Performance gained
17
Bigrams
18
Query Section Weighting

TREC evaluation
Title section is more important than others.
vs(q)weight for the section of the query
vdesc1.2, vnarr1.9, vtitle5.7

19
Document Priors

refereed Journal may be more informative than a
supermarket tabloid.
Most predicative features
Source
Length
Average word length

20
Conclusion

Novel method in IR using HMMs
Offer rich setting
Incorporate new and familiar techniques
Experiments with a system that implements
Blind feedback
Bigram modeling
Query Section weighting
Document priors
Future work
HMM can be extended to accommodate
Passage retrieval
Explicit synonym modeling
Concept modeling

21
Resources

D. Miller, T. Leek, R. Schwartz
A Hidden Markov Information Retrieval System
SIGIR 99 Berkeley, CA USA
L. Rabiner
A tutorial on Hidden Markov Models and selected
applications in speech recognition
Proc. IEEE 77, pp 130-137

22
Thankyou very much!