Title: Information Retrieval and Web Search
1Information Retrieval and Web Search
- Keyphrase Extraction
- Instructor Rada Mihalcea
- (Note most of the slides were adopted from a
presentation by Andras Csomai)
2Keyphrase extraction
- Set of words and phrases that accurately and
concisely describe a document. - Dense summary
- Purpose
- Topic/content identification
- Indexing
- Browsing
- Back of book
- Classification
- Surfing
- . . .
- Stems from early roman times
3(No Transcript)
4(No Transcript)
5Variations of keyword extraction
- Constraints on what can be selected
- Controlled vocabulary domain ontology, etc
- Uncontrolled vocabulary
- Constraints on where it has to be selected from
- Assignment the keywords are not necessarily
present in the document - Extraction - On average 75 of human expert
assigned keywords are present in the document
(depends on collection) - Conclusion extraction is a feasible approach
6Uses of keyword extraction
- Articles, journals topic/content identification
- Back of the book indexes
- Amazon SIPs
- Google Books keywords
- NLP and IR applications
- Document classification
- Document clustering
- Keyword-based information retrieval
- Building domain ontologies
7Keyword extraction for back of the book indexing
- Information likely to be sought by a user
- A guide to concepts, names, places in a document
8Evaluation data set
- Use a set of documents with expert created
indexes (Gold Standard) - Measure how well the automatically generated
keyword set matches the index - Document collections
- Gutenberg Project
- 56 documents, reduced to 29
- Balanced across multiple domains
- University of California Press
- 259 training and 30 test books
- Only from humanities
- Extract index entries
9Index granularities
- Coarse grained
- A shorter index containing only the head
expressions - A more concise version of the index
- Fine grained
- A longer, more detailed index
- More biased towards the indexer style
10Evaluation metrics
- Compare the generated index to the Gold Standard
- Only coarse grained evaluation
- Metrics
11Automated keyword extraction
- Unsupervised
- No training data required
- Generally portable
- Methods
- Candidate extraction
- Candidate ranking
- TFIDF
- Language model based
- Pre- and post-processing
12Automated keyword extraction
- Supervised
- Requires a large training corpus (documents and
human expert extracted keywords) - Higher accuracy
- Domain/Language dependent poor portability
- Methods
- Many features common with Unsupervised methods
- Additional features (Linguistic, Semantic)
13General workflow for keyword extraction
14Candidate extraction
- N-gram
- Sequences of n consecutive words
- Do not cross sentence boundaries
- n4
- The most comprehensive method
- Anything is a possible candidate
- Also the noisiest (generates candidate set more
than two times the size of the document)
15Noun phrase chunks
- Observation (Hulth) most of the keywords are
noun phrases in the document - Noun phrase chunks
- Generates a lot less noise than n-grams
- Increases precision, lowers recall
16Named entities
- Observation many of the keywords/index entries
are named entities - Treat named entities separately
- To weight them differently
- To complement other candidate extraction methods
- LingPipe
- Heuristic named entity recognition
- Capitalised phrases not appearing anywhere else
lowercased - Not in the beginning of a sentence
17Candidate extraction performance
18Filtering methods
- Eliminate stopwords, common words, paraphrases
19Unsupervised features for candidate ranking
- Informativeness features
- How representative is to the document
- TFIDF
- Information retrieval metric
- Term frequency adjusted by the frequency of words
appearing in other documents - Document frequency values obtained from the
British National Corpus
20Chi squared independence
- Chi squared independence
- Measure the degree to which two events occur more
frequently than they should by chance - Contingency table
21Language model based features
- Language Model based
- Pointwise Kulback-Liebler divergence of two
language models built on the book and a general
corpus - Background corpus the BNC collection
- Foreground corpus the book
- Good-Turing smoothing
22Features for candidate ranking
- Phraseness
- Degree to which a sequence of words can be
considered as a phrase - Chi squared independence of constituent words
- It measures if they appear together more often
than by chance - Language model based
- Information loss between a unigram and bigram
language model
23Ranking performance of features
24Length of candidates
- In supervised methods incorporated as a feature
- Enforce an observed distribution on the candidate
set - N-gram
- NP-chunks
25Combination methods
- Combine different candidate extraction and
ranking method - Each candidateranking pair provides a score
- Combine scores using a weighted sum
26Performance of combined models
27Supervised keyphrase extraction
- A machine learning component is trained to learn
to extract keyphrases - Multiple machine learning algorithms
- Neural Networks
- Capable of learning non-linear decision surfaces
- Decision Trees
- Insight into learning mechanism
- Support Vector Machines
- Capable of handling large amounts of data
- Naïve Bayes
- Used in many previous keyphrase extraction
systems - Computatinally efficient
28Features for the supervised system
- All features from the unsupervised system
- Construction integration features
- number of times a phrase is retained in the short
term memory - Phrase length
- Within document frequency
- Term and document frequency separately
29Linguistic feature
- Probability that a phrase given a part of speech
pattern is selected as a keyphrase - Estimated on training data
30Encyclopedic feature
- Previous work reports increased perfromance using
domain specific thesaurus - Wikipedia
- Can be used to extract a general thesaurus
- Estimate the likelihood that a phrase is a good
keyphrase independently from the context
31Training Data
- 259 books from the UCPress corpus
- N-gram candidate extraction
- Large, unbalanced training set
- 48.5 million negative, 71.853 positive instances
- Heuristic filtering
- 11.5 million negative, 66.349 positive
32Evaluation
- 30 Books from UCPress corpus
33Learning Curves