Title: Robust Pseudo Feedback
1Robust Pseudo Feedback HMM Passage
ExtractionUIUC at TREC 2006 Genomics Track
- Jing Jiang, Xin He, ChengXiang Zhai
- University of Illinois at Urbana-Champaign
2Goal of Participation
- To test the effectiveness of some recent language
modeling methods for genomics retrieval - Robust pseudo feedback Tao Zhai 06
- HMM passage extraction Jiang Zhai 06
- Task at 2006 genomics track
- Document-level retrieval
- Passage-level retrieval
- Aspect-level retrieval
3Overall Approach
Medline articles
paragraphs
ranked passages
k
2
1
1
Document Retrieval Module
Passage Extraction Module
Q
2
ranked paragraphs
pseudo relevance feedback
k
user relevance feedback
4Goal of Participation
- To test the effectiveness of some recent language
modeling methods for genomics retrieval - Robust pseudo feedback Tao Zhai 06
- HMM passage extraction Jiang Zhai 06
5KL-Divergence Retrieval ModelLafferty Zhai 01
the 0.020 for 0.015 prp 0.102 mad 0.034 cow 0.034
diseas 0.068
Thefor spongiformPrP protein
D1
document
Prion diseases that(PrP C)This
D2
role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2
topic
which(PrP C)to theprion protein
Dk
6KL-Divergence Retrieval ModelLafferty Zhai 01
the 0.020 for 0.015 prp 0.102 mad 0.034 cow 0.034
diseas 0.068
Thefor spongiformPrP protein
D1
document
Prion diseases that(PrP C)This
D2
role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2
topic
which(PrP C)to theprion protein
Dk
7Model-Based FeedbackZhai Lafferty 01
Thefor spongiformPrP protein
the 0.02 for 0.01 prp 0.003 prion 0.004
background
D1
Prion diseases that(PrP C)This
D2
role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2
the ? for ? prp ? prion ?
which(PrP C)to theprion protein
topic
feedback
Dk
8Model-Based FeedbackZhai Lafferty 01
Thefor spongiformPrP protein
the 0.02 for 0.01 prp 0.003 prion 0.004
background
D1
EM algorithm
Prion diseases that(PrP C)This
D2
role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2
the 0.003 for 0.002 prp 0.02 prion 0.05
which(PrP C)to theprion protein
topic
feedback
Dk
9Model-Based FeedbackZhai Lafferty 01
Thefor spongiformPrP protein
the 0.02 for 0.01 prp 0.003 prion 0.004
background
D1
Prion diseases that(PrP C)This
D2
role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2
the 0.003 for 0.002 prp 0.02 prion 0.05
which(PrP C)to theprion protein
topic
feedback
Dk
2 parameters a and ?
10Regularized EstimationTao Zhai 06
Thefor spongiformPrP protein
the 0.02 for 0.01 prp 0.003 prion 0.004
background
D1
Prion diseases that(PrP C)This
D2
role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2
the ? for ? prp ? prion ?
which(PrP C)to theprion protein
topic
feedback
Dk
11Regularized EstimationTao Zhai 06
Thefor spongiformPrP protein
the 0.02 for 0.01 prp 0.003 prion 0.004
background
D1
regularized EM algorithm
prior
Prion diseases that(PrP C)This
D2
role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2
the 0.003 for 0.002 prp 0.02 prion 0.05
which(PrP C)to theprion protein
topic
feedback
Dk
12Regularized EstimationTao Zhai 06
Thefor spongiformPrP protein
the 0.02 for 0.01 prp 0.003 prion 0.004
background
D1
prior
Prion diseases that(PrP C)This
D2
role 0.2 prnp 0.2 mad 0.2 cow 0.2 diseas 0.2
the 0.003 for 0.002 prp 0.02 prion 0.05
which(PrP C)to theprion protein
topic
feedback
Dk
1 parameter ?
13Original vs. Regularized EM
original
a manually set
a dynamically set
14Goal of Participation
- To test the effectiveness of some recent language
modeling methods for genomics retrieval - Robust pseudo feedback Tao Zhai 06
- HMM passage extraction Jiang Zhai 06
15HMM Passage ExtractionJiang Zhai 06
relevant passage
paragraph
w
w
w
w
w
w
w
w
w
w
w
p(wB1) the 0.02 for 0.01 prp 0.001
p(wR) the 0.003 for 0.002 prp 0.02
p(wB2) the 0.02 for 0.01 prp 0.001
B1
R
B2
HMM
p(RB1) 0.1
p(B2R) 0.05
p(B1B1) 0.9
p(RR) 0.95
p(B2B2) 1
16HMM Passage ExtractionJiang Zhai 06
transition probabilities estimated from
observations
end-of-paragraph state
B1
R
B3
E
B2
a background state for smoothing
17Experiment Design
- Pre-processing
- HTML parsing
- paragraph boundaries
- Tokenization
- User relevance feedback
18Official Runs
Medline articles
paragraphs
ranked passages
k
2
1
1
KL-Div Retrieval
HMM Passage Extraction
Q
2
ranked paragraphs
Q'
k
19UIUCauto
Medline articles
paragraphs
ranked passages
k
2
1
1
KL-Div Retrieval
HMM Passage Extraction
Q
2
ranked paragraphs
Q'
k
regularized estimation
20UIUCinter
Medline articles
paragraphs
ranked passages
k
2
1
1
KL-Div Retrieval
HMM Passage Extraction
Q
2
ranked paragraphs
Q'
k
regularized estimation
21UIUCinter2
Medline articles
paragraphs
ranked passages
k
2
1
1
KL-Div Retrieval
HMM Passage Extraction
Q
2
ranked paragraphs
Q'
k
F
original estimation
22Pseudo Relevance Feedback(k 10)
? is similar to ? / (1 - ?)
23Pseudo Relevance Feedback(k 10)
? is similar to ? / (1 - ?)
24Pseudo Relevance Feedback(k 10)
? is similar to ? / (1 - ?)
25Parameter Sensitivity(pseudo feedback, k 10)
26User Relevance Feedback
27User Relevance Feedback
28User Relevance Feedback
29HMM Passage Extraction
30Passage Length (In Bytes)
HMM passages are generally too long!
31Example Passage
Prion diseases, which include Creutzfeldt-Jacob
disease in humans, mad cow disease in cattle, and
scrapie in sheep, involve the misfolding of the
benign cellular prion protein (PrP C) 1 to the
infectious disease-causing scrapie isoform PrP
Sc. The prion protein (PrP C) is a copper-binding
cell surface glycoprotein. The role of copper in
the normal function of PrP, as well as in prion
diseases, has been the subject of a number of
excellent reviews. The mature cellular form of
PrP consists of residues 23 to 231 and is
tethered to the cell surface via a
glycosylphosphatidylinositol anchor at the C
terminus. There are now a number of NMR solution
structures of copper-free mammalian PrPs. A
crystal structure of PrP C has also been
published this structure is dimeric involving
domain swapping of the monomeric form.
32Example Passage
Prion diseases, which include Creutzfeldt-Jacob
disease in humans, mad cow disease in cattle, and
scrapie in sheep, involve the misfolding of the
benign cellular prion protein (PrP C) 1 to the
infectious disease-causing scrapie isoform PrP
Sc. The prion protein (PrP C) is a copper-binding
cell surface glycoprotein. The role of copper in
the normal function of PrP, as well as in prion
diseases, has been the subject of a number of
excellent reviews. The mature cellular form of
PrP consists of residues 23 to 231 and is
tethered to the cell surface via a
glycosylphosphatidylinositol anchor at the C
terminus. There are now a number of NMR solution
structures of copper-free mammalian PrPs. A
crystal structure of PrP C has also been
published this structure is dimeric involving
domain swapping of the monomeric form.
33Conclusions and Future Work
- The two language modeling methods in general
works well in genomics domain - Regularized feedback estimation can effectively
eliminates parameter a - HMM passages improves over paragraphs
- User relevance feedback is effective
- Limitations and future work
- Regularized feedback estimation still has
parameter ? to tune - How to eliminate ??
- The inherent coherence property of HMM passages
may not suit the task well - Different/better HMM architecture?
34The End