Sentence Extraction with Classification - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Sentence Extraction with Classification

Description:

T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of ... Most of the cases, only slot fillers in the question template do a good job ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 24
Provided by: chenz
Category:

less

Transcript and Presenter's Notes

Title: Sentence Extraction with Classification


1
Sentence Extraction with Classification
  • Unbalanced and Resampling

2
Data
  • 14 question topics
  • 100 documents per topic
  • 50768 sentences
  • Positive 104 (0.2)

3
Features
  • Sentence length
  • Document score
  • Number of query phrases
  • Number of query terms
  • Average idf
  • Average log idf

4
Results
5
Discussion
  • Average sentence length
  • 100 non-space characters
  • Allowed number of sentences
  • 7000/100 70
  • Current status
  • gt 2227/7 300

6
Evaluation
  • Rank the sentences by their likelihood of
    belonging to the positive class
  • Get the top n sentences so that their total
    length reaches 7000
  • Calculate the precision and recall of this n
    sentence list
  • Instance recall
  • Nugget recall

7
Results (Instance based)
  • Classification
  • Heuristics

8
Results (Nugget recall)
  • Classification
  • Ranking SVM
  • Heuristics

9
List of Heuristics Last Year
  • Filtering heuristics.
  • Document filtering
  • Get a subset of valid documents that contain at
    least one proper noun in each facet.
  • Sentence filtering
  • Remove those whose length exceeds 50
    non-stopwords stems.
  • Remove a sentence if it has more than 50
    non-stopword stems in common with any sentence
    ranked higher.
  • Sentence that do not contain at least one term
    from the question are discarded.
  • Discard sentences less than 5 terms, and more
    than 50 terms.
  • Use each sentence in the ranked list as a query
    against all sentences lower in the list,
    eliminating all the sentences that have a
    similarity more than a particular threshold.

10
List of Heuristics Last Year
  • (Sentence) ranking heuristics
  • Number of facets the sentence contains.
  • Number of query terms the sentence contains.
  • Number of lexical bonds the sentence has with the
    following sentences in the document.
  • Average idf of all nonstopwords in the sentence.
  • Re-run the original query against the set of
    candidate sentences.
  • Document score.
  • Candidate recall, similar to 2b expect that the
    terms are weighted by idf and synonyms are
    counted in matching.
  • The average similarity of all matching terms
    between the candidate and the query. Similarity
    values are 1 for exact matching or come from
    various sources of synonyms.

11
Agreement of Pyramid Scores and Vital-Okay
Distinctions
12
Pyramid Score for Vital Nuggets
13
Pyramid Score for Okay Nuggets
14
Next
  • Ranking SVM
  • Two stage-training
  • Data of 2006

15
Learning Algorithms for Ranking
  • Classification
  • Use the predicted likelihood for ranking score
  • Ranking SVM
  • T. Joachims, Optimizing Search Engines Using
    Clickthrough Data, Proceedings of the ACM
    Conference on Knowledge Discovery and Data Mining
    (KDD), ACM, 2002.
  • Implemented by SVM-Light
  • RankBoost
  • Yoav Freund, Raj Iyer, Robert E. Schapire, Robert
    E. Schapire. An Efficient Boosting Algorithm for
    Combining Preferences. Journal of Machine
    Learning Research 4 (2003) 933-969
  • Ranking Refinement
  • Hamed Valizadegan and Rong Jin

16
Ranking SVM
  • Too slow
  • Instances have to be sub-sampled
  • Keep all the positive instances
  • Randomly sub-sample the negative instances
  • A trick to speed-up
  • Sub-sample
  • Multiple runs
  • Combine the result

17
Results (Nugget recall)
  • Classification
  • Ranking SVM
  • Heuristics

18
DocRet Results on 2006 Data
  • Most of the cases, only slot fillers in the
    question template do a good job
  • Add the named entities in narratives
  • In the case of financial relationship add words
    in the narrative that relates to finance money,
    trade, etc.
  • In the case that an abbreviation is spelled out,
    use or modifier
  • The above strategy have 85 document recall
    except for two topics

19
Results on 2006 Data
Overall recall 24.1
20
Two-stage Training
Label 0 0 1 1 0
Frame-Ind Feat 0, 0.2, 1, 0.0, 1, 0.8, 0,
1.0,
Frame-Dep Feat
Predict 1 0.012 0.015 0.018 0.021 0.020
Predict 2 0.07 0.09 0.53 0.68 0.21
Training
Label ? ? ? ? ?
Frame-Ind Feat
Frame-Dep Feat
Predict 1 ? ? ? ? ?
Predict 2 ? ? ? ? ?
Testing
21
New Features
  • Sentence length
  • Document score
  • Number of query phrases
  • Number of query terms
  • Average idf
  • Average log idf
  • Lexical bonds
  • Document position
  • Paragraph position

22
New Results on 2006 Data
Overall recall 27.3 ? 30.5
23
Remark 3 feature NumQueryEntities NumQueryTerms
AverageIdf 9 feature Length DocScore
AverageLogIdf LexicalBonds DocPosition
ParaPosition 9 NE GPE Organization Person
Substance Date Cardinal Money Percent Quantity 22
NE Animal Disease Event Facility Game Language
Law Location Nationality Plant Product Time
Ordinal
24
  • Template 1 transport
  • QUANTITY0.083 LANGUAGE0.059 PRODUCT0.043
    LOCATION0.035 CARDINAL0.034
  • Template 2 financial
  • MONEY0.047 DATE0.012 ORGANIZATION0.0109
    PERCENT0.01089 CARDINAL0.009
  • Template 3 effect
  • LAW0.161 PLANT0.028 DISEASE0.026
    LOCATION0.020 SUBSTANCE0.019
  • Template 4 position
  • PERCENT0.049 DISEASE0.048 LAW0.041 DATE0.029
    ANIMAL0.028
  • Template 5 involvement
  • GAME0.059 LANGUAGE0.043 SUBSTANCE0.037
    EVENT0.027 MONEY0.0269

25
  • Template 1 transport
  • GPE(189) DATE(145) CARDINAL(107) ORGANIZATION(94)
    NATIONALITY(84)
  • Template 2 financial
  • ORGANIZATION(57) DATE(46) PERSON(35) GPE(33)
    MONEY(31)
  • Template 3 effect
  • DATE(92) SUBSTANCE(67) ORGANIZATION(47)
    CARDINAL(45) GPE(40)
  • Template 4 position
  • DATE(156) PERSON(151) ORGANIZATION(149) GPE(129)
    NATIONALITY(114)
  • Template 5 involvement
  • PERSON(162) ORGANIZATION(124) GPE(105) DATE(102)
    SUBSTANCE(76)
Write a Comment
User Comments (0)
About PowerShow.com