Extracting Keyphrases from Books using Language Modeling Approaches - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Extracting Keyphrases from Books using Language Modeling Approaches

Description:

Automatic Identification of Keyphrases from chapters of books. Language independent ... Kpspotter: a exible information gain-based keyphrase extraction system. ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 22

Provided by: researc88

Category:

more less

Transcript and Presenter's Notes

Title: Extracting Keyphrases from Books using Language Modeling Approaches

1
Extracting Keyphrases from Books using Language
Modeling Approaches

Rohini U
AOL India RD,
Bangalore India
Rohini.uppuluri_at_corp.aol.com
Vamshi Ambati
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, USA
vamshi_at_cs.cmu.edu

2
Agenda

Keyphrase Extraction
Value addition to Digital Libraries
Methods of Keyphrase Extraction
Related Work
Our Solution

3
What are Keyphrases?

Keyphrases
(Give example)
Where used?
Cataloguing in Libraries for IR purposes
Quick Summarization of documents

4
Why important to ULIB?

Vast growth in digital content
More than a Million books!
Short Meta data description useful to user
while reading
For further processing of books like
summarization, IR etc

5
How do we extract KPs?

Manual entry
Reliable, high quality outcome
But, time-consuming, expensive
Automatic
Fast extraction but less reliable
No expense at all

6
Automatic techniques for KPE

Rule based methods
Heuristics (paragraph beginning, headline etc)
Krulwich Burkey etc
Using Linguistic tools
Statistical techniques
Term counts and weighting based Methods
Learn model from training data
Turney et. al5, KEA6 , KSpotter3 etc

7
Requirements for a KPE for ULIB

Automatic Identification of Keyphrases from
chapters of books
Language independent
Easily adaptable for different domains
No training data to learn from
Most books in ULIB do not have keywords as part
of the metadata

8
Solution Outline

Language Modeling based
Given n-grams
Measure Informativeness, Phraseness
Score n-grams based on the above measures
Pick top K phrases as Keyphrases

9
Extracting Keyphrases from Books
Text
Cleaning Initialization
Candidate Keyphrases Extraction
Scoring
Pruning

Extracted Keyphrases
10
Extracting Keyphrases from Books

Topics are also used to construct user profiles
via explicit specication of interests or
automatic analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited

Extracted Keyphrases
11
Extracting Keyphrases from Books

Topics are also used to construct user proles via
explicit specication of interests or automatic
analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited
topics construct user, construct user
profiles, user profiles explicit, profiles
explicit specification, explicit specification
interests, specification interests
automatic, automatic analysis web, analysis web
pages, web pages visited

Extracted Keyphrases
12
Extracting Keyphrases from Books

Topics are also used to construct user proles via
explicit specication of interests or automatic
analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited
profiles explicit specication 0.0281 explicit
specication interests 0.0281 specication
interests automatic 0.0272 user proles explicit
0.0260 construct user proles 0.0260 interests
automatic analysis 0.0255 topics construct user
0.0243 automatic analysis web 0.0227 web
pages visited 0.0226 analysis web pages 0.0217

Extracted Keyphrases
13
Scoring

Phraseness
Measures degree to which a given n-gram can be
considered a phrase
Based on Co-occurrence of words
Example..
Informativeness
Measures how informative a given n-gram is
There is a, a lot of etc
Comparing co occurrence on a general corpus Vs
given text(book)
Total Score
Phraseness-Score Informativeness-Score

14
Scoring - Phraseness

Computed by measuring distance between unigram
model and N-gram model
Point wise KL-divergence (Takashi et. al 2004)
dw (pq) p(w)log(p(w)/q(w))
Phraseness measure
dw (LMfgN LMfg1)

15
Scoring - Informativeness

Computed by measuring distance between n-gram
model from given data and n-gram model from
general data
Point wise KL-divergence (Takashi et. al 2004)
dw (pq) p(w)log(p(w)/q(w))
Informativeness measure
dw (LMfg1 LMbg1)

16
Extracting Keyphrases from Books

Topics are also used to construct user proles via
explicit specication of interests or automatic
analysis of Web pages visited

Topics are also used to construct user proles via
explicit specication of interests or automatic
analysis of Web pages visited

topics construct user profiles explicit
specification interests automatic analysis web
pages visited
proles explicit specication explicit
specication interests specication interests
automatic user proles explicit construct user
proles interests automatic analysis topics
construct user automatic analysis web web
pages visited analysis web pages

Extracted Keyphrases
18
(No Transcript)
19
Conclusions and Future Work

Discussed benefits of Keyphrases in ULIB context
Demonstrated the building of a KPE that works for
books
Robust evaluation
Building a test set from books in ULIB for
generic robust evaluation of KPE tools
Are chapters really independent in a book
Revisit the assumption

Thank you

21
References

Fred J. Damerau. Generating and evaluating
domain-oriented multi-word terms from texts.
Information Processing and Management,
29(4)433-447, 1993.
S.T Dumais, J Platt, D. Heckerman, and M. Sahami.
Inductive learning algorithms and representations
for text categorization. In Proceedings of the
7th international conference on information and
knowledge management, page 148-155. ACM Press,
1998.
Min Song, Il-Yeol Song, and Xiaohua Hu.
Kpspotter a exible information gain-based
keyphrase extraction system. In WIDM '03
Proceedings of the 5th ACM international workshop
on Web information and data management, pages
50-53, New York, NY, USA, 2003. ACM Press.
Takashi Tomokiyo and Mathew Hurst. A language
modeling approach to keyphrase extraction. In
Proceedings of the ACL 2003 workshop on Multiword
expressions, pages 3340, Morristown, NJ, USA,
2003. Association for Computational Linguistics.
P.D. Turney. Learning algorithms for keyphrase
extraction. Information Retrieval, 2(4)303-336,
2006.
I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin,
and C.G Nevill-Manning. Kea Practical automatic
keyphrase extraction. In E. A. Fox and N. Rowe,
editors, Proceedings of digital libraries 99 The
fourth ACM conference on digital libraries, pages
254-255. ACM Press, 1999.
Mikio Yamamoto and Kenneth W. Church. Using
suffix arrays to compute term frequency and
document frequency for all substrings in a
corpus. Computational Linguistics, 27(1)1-30,
2001