Sentence Extraction with Classification - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Sentence Extraction with Classification

Description:

T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of ... Most of the cases, only slot fillers in the question template do a good job ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 24

Provided by: chenz

Category:

more less

Transcript and Presenter's Notes

Title: Sentence Extraction with Classification

1
Sentence Extraction with Classification

Unbalanced and Resampling

2
Data

14 question topics
100 documents per topic
50768 sentences
Positive 104 (0.2)

3
Features

Sentence length
Document score
Number of query phrases
Number of query terms
Average idf
Average log idf

4
Results
5
Discussion

Average sentence length
100 non-space characters
Allowed number of sentences
7000/100 70
Current status
gt 2227/7 300

6
Evaluation

Rank the sentences by their likelihood of
belonging to the positive class
Get the top n sentences so that their total
length reaches 7000
Calculate the precision and recall of this n
sentence list
Instance recall
Nugget recall

7
Results (Instance based)

Classification

Heuristics

8
Results (Nugget recall)

Classification
Ranking SVM

Heuristics

9
List of Heuristics Last Year

Filtering heuristics.
Document filtering
Get a subset of valid documents that contain at
least one proper noun in each facet.
Sentence filtering
Remove those whose length exceeds 50
non-stopwords stems.
Remove a sentence if it has more than 50
non-stopword stems in common with any sentence
ranked higher.
Sentence that do not contain at least one term
from the question are discarded.
Discard sentences less than 5 terms, and more
than 50 terms.
Use each sentence in the ranked list as a query
against all sentences lower in the list,
eliminating all the sentences that have a
similarity more than a particular threshold.

10
List of Heuristics Last Year

(Sentence) ranking heuristics
Number of facets the sentence contains.
Number of query terms the sentence contains.
Number of lexical bonds the sentence has with the
following sentences in the document.
Average idf of all nonstopwords in the sentence.
Re-run the original query against the set of
candidate sentences.
Document score.
Candidate recall, similar to 2b expect that the
terms are weighted by idf and synonyms are
counted in matching.
The average similarity of all matching terms
between the candidate and the query. Similarity
values are 1 for exact matching or come from
various sources of synonyms.

11
Agreement of Pyramid Scores and Vital-Okay
Distinctions
12
Pyramid Score for Vital Nuggets
13
Pyramid Score for Okay Nuggets
14
Next

Ranking SVM
Two stage-training
Data of 2006

15
Learning Algorithms for Ranking

Classification
Use the predicted likelihood for ranking score
Ranking SVM
T. Joachims, Optimizing Search Engines Using
Clickthrough Data, Proceedings of the ACM
Conference on Knowledge Discovery and Data Mining
(KDD), ACM, 2002.
Implemented by SVM-Light
RankBoost
Yoav Freund, Raj Iyer, Robert E. Schapire, Robert
E. Schapire. An Efficient Boosting Algorithm for
Combining Preferences. Journal of Machine
Learning Research 4 (2003) 933-969
Ranking Refinement
Hamed Valizadegan and Rong Jin

16
Ranking SVM

Too slow
Instances have to be sub-sampled
Keep all the positive instances
Randomly sub-sample the negative instances
A trick to speed-up
Sub-sample
Multiple runs
Combine the result

17
Results (Nugget recall)

Classification
Ranking SVM

Heuristics

18
DocRet Results on 2006 Data

Most of the cases, only slot fillers in the
question template do a good job
Add the named entities in narratives
In the case of financial relationship add words
in the narrative that relates to finance money,
trade, etc.
In the case that an abbreviation is spelled out,
use or modifier
The above strategy have 85 document recall
except for two topics

19
Results on 2006 Data
Overall recall 24.1
20
Two-stage Training
Label 0 0 1 1 0
Frame-Ind Feat 0, 0.2, 1, 0.0, 1, 0.8, 0,
1.0,
Frame-Dep Feat
Predict 1 0.012 0.015 0.018 0.021 0.020
Predict 2 0.07 0.09 0.53 0.68 0.21
Training
Label ? ? ? ? ?
Frame-Ind Feat
Frame-Dep Feat
Predict 1 ? ? ? ? ?
Predict 2 ? ? ? ? ?
Testing
21
New Features

Sentence length
Document score
Number of query phrases
Number of query terms
Average idf
Average log idf
Lexical bonds
Document position
Paragraph position

22
New Results on 2006 Data
Overall recall 27.3 ? 30.5
23
Remark 3 feature NumQueryEntities NumQueryTerms
AverageIdf 9 feature Length DocScore
AverageLogIdf LexicalBonds DocPosition
ParaPosition 9 NE GPE Organization Person
Substance Date Cardinal Money Percent Quantity 22
NE Animal Disease Event Facility Game Language
Law Location Nationality Plant Product Time
Ordinal
24