Title: Extracting Keyphrases from Books using Language Modeling Approaches
1Extracting Keyphrases from Books using Language
Modeling Approaches
- Rohini U
- AOL India RD,
- Bangalore India
- Rohini.uppuluri_at_corp.aol.com
- Vamshi Ambati
- Language Technologies Institute
- Carnegie Mellon University
- Pittsburgh, USA
- vamshi_at_cs.cmu.edu
2Agenda
- Keyphrase Extraction
- Value addition to Digital Libraries
- Methods of Keyphrase Extraction
- Related Work
- Our Solution
3What are Keyphrases?
- Keyphrases
- (Give example)
- Where used?
- Cataloguing in Libraries for IR purposes
- Quick Summarization of documents
4Why important to ULIB?
- Vast growth in digital content
- More than a Million books!
- Short Meta data description useful to user
while reading - For further processing of books like
summarization, IR etc
5How do we extract KPs?
- Manual entry
- Reliable, high quality outcome
- But, time-consuming, expensive
- Automatic
- Fast extraction but less reliable
- No expense at all
6Automatic techniques for KPE
- Rule based methods
- Heuristics (paragraph beginning, headline etc)
- Krulwich Burkey etc
- Using Linguistic tools
- Statistical techniques
- Term counts and weighting based Methods
- Learn model from training data
- Turney et. al5, KEA6 , KSpotter3 etc
7Requirements for a KPE for ULIB
- Automatic Identification of Keyphrases from
chapters of books - Language independent
- Easily adaptable for different domains
- No training data to learn from
- Most books in ULIB do not have keywords as part
of the metadata
8Solution Outline
- Language Modeling based
- Given n-grams
- Measure Informativeness, Phraseness
- Score n-grams based on the above measures
- Pick top K phrases as Keyphrases
9Extracting Keyphrases from Books
Text
Cleaning Initialization
Candidate Keyphrases Extraction
Scoring
Pruning
Extracted Keyphrases
10Extracting Keyphrases from Books
- Topics are also used to construct user profiles
via explicit specication of interests or
automatic analysis of Web pages visited
topics construct user profiles explicit
specification interests automatic analysis web
pages visited
Extracted Keyphrases
11Extracting Keyphrases from Books
- Topics are also used to construct user proles via
explicit specication of interests or automatic
analysis of Web pages visited
topics construct user profiles explicit
specification interests automatic analysis web
pages visited
topics construct user, construct user
profiles, user profiles explicit, profiles
explicit specification, explicit specification
interests, specification interests
automatic, automatic analysis web, analysis web
pages, web pages visited
Extracted Keyphrases
12Extracting Keyphrases from Books
- Topics are also used to construct user proles via
explicit specication of interests or automatic
analysis of Web pages visited
topics construct user profiles explicit
specification interests automatic analysis web
pages visited
profiles explicit specication 0.0281 explicit
specication interests 0.0281 specication
interests automatic 0.0272 user proles explicit
0.0260 construct user proles 0.0260 interests
automatic analysis 0.0255 topics construct user
0.0243 automatic analysis web 0.0227 web
pages visited 0.0226 analysis web pages 0.0217
Extracted Keyphrases
13Scoring
- Phraseness
- Measures degree to which a given n-gram can be
considered a phrase - Based on Co-occurrence of words
- Example..
- Informativeness
- Measures how informative a given n-gram is
- There is a, a lot of etc
- Comparing co occurrence on a general corpus Vs
given text(book) - Total Score
- Phraseness-Score Informativeness-Score
14Scoring - Phraseness
- Computed by measuring distance between unigram
model and N-gram model - Point wise KL-divergence (Takashi et. al 2004)
- dw (pq) p(w)log(p(w)/q(w))
- Phraseness measure
- dw (LMfgN LMfg1)
15Scoring - Informativeness
- Computed by measuring distance between n-gram
model from given data and n-gram model from
general data - Point wise KL-divergence (Takashi et. al 2004)
- dw (pq) p(w)log(p(w)/q(w))
- Informativeness measure
- dw (LMfg1 LMbg1)
16Extracting Keyphrases from Books
- Topics are also used to construct user proles via
explicit specication of interests or automatic
analysis of Web pages visited
topics construct user profiles explicit
specification interests automatic analysis web
pages visited
profiles explicit specication 0.0281 explicit
specication interests 0.0281 specication
interests automatic 0.0272 user proles explicit
0.0260 construct user proles 0.0260 interests
automatic analysis 0.0255 topics construct user
0.0243 automatic analysis web 0.0227 web
pages visited 0.0226 analysis web pages 0.0217
Extracted Keyphrases
17Extracting Keyphrases from Books
- Topics are also used to construct user proles via
explicit specication of interests or automatic
analysis of Web pages visited
topics construct user profiles explicit
specification interests automatic analysis web
pages visited
proles explicit specication explicit
specication interests specication interests
automatic user proles explicit construct user
proles interests automatic analysis topics
construct user automatic analysis web web
pages visited analysis web pages
Extracted Keyphrases
18(No Transcript)
19Conclusions and Future Work
- Discussed benefits of Keyphrases in ULIB context
- Demonstrated the building of a KPE that works for
books - Robust evaluation
- Building a test set from books in ULIB for
generic robust evaluation of KPE tools - Are chapters really independent in a book
- Revisit the assumption
20 21References
- Fred J. Damerau. Generating and evaluating
domain-oriented multi-word terms from texts.
Information Processing and Management,
29(4)433-447, 1993. - S.T Dumais, J Platt, D. Heckerman, and M. Sahami.
Inductive learning algorithms and representations
for text categorization. In Proceedings of the
7th international conference on information and
knowledge management, page 148-155. ACM Press,
1998. - Min Song, Il-Yeol Song, and Xiaohua Hu.
Kpspotter a exible information gain-based
keyphrase extraction system. In WIDM '03
Proceedings of the 5th ACM international workshop
on Web information and data management, pages
50-53, New York, NY, USA, 2003. ACM Press. - Takashi Tomokiyo and Mathew Hurst. A language
modeling approach to keyphrase extraction. In
Proceedings of the ACL 2003 workshop on Multiword
expressions, pages 3340, Morristown, NJ, USA,
2003. Association for Computational Linguistics. - P.D. Turney. Learning algorithms for keyphrase
extraction. Information Retrieval, 2(4)303-336,
2006. - I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin,
and C.G Nevill-Manning. Kea Practical automatic
keyphrase extraction. In E. A. Fox and N. Rowe,
editors, Proceedings of digital libraries 99 The
fourth ACM conference on digital libraries, pages
254-255. ACM Press, 1999. - Mikio Yamamoto and Kenneth W. Church. Using
suffix arrays to compute term frequency and
document frequency for all substrings in a
corpus. Computational Linguistics, 27(1)1-30,
2001