Title: Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA)
1Automating Keyphrase Extraction with
Multi-Objective Genetic Algorithms (MOGA)
- Jia-Long Wu
- Alice M. Agogino
- Berkeley Expert System Laboratory
- U.C. Berkeley
2Outline
- Role of Keyphrases
- Phrase Extraction Algorithms
- Phrase Extraction with Multi-Objective Genetic
Algorithm - Experiment and Results
- Results Evaluation
- Conclusion
- Future Research
3Role of Keyphrases
- Concept representations
- Document indexing
- Enhance document retrieval / Browsing
- Query formulation assistance
- Document surrogates
4Vision of Unified Language System
Context Mapping Mechanism
Semantic Network
Unified Language System for Engineering Design
5Keyphrase Extraction Algorithms
- Heuristic, Syntactic, Machine Learning
- Requires prior training
- Heuristic cut-off thresholds in number of phrases
- Focuses on single document
- Redundancy when aggregated for the whole document
collection
6Keyphrase Extraction with MOGA
- Phrase extraction as an optimization problem
- Candidate phrases generation
- Optimize phrase selection with MOGA
- Model Genetic Operators
Crossover
Phenotype Genotype
Parents
Offspring
7Keyphrase Extraction with MOGA
- Optimize phrase selection with MOGA (cont.)
- Model Genetic Operators (cont.)
- Evaluation fitness functions
- Minimize clustering measure / dispersion
(Bookstein 98) - Minimize number of phrases
- Non-Dominated Sorting Genetic Algorithm (NSGA-II)
Mutation
1
0
0
1
0
1
1
0
1
0
8Experiment and Results
- Data set
- 34 papers from Design Theory and Methodology
Conference 01 - Candidate phrases
- 5000 noun phrases extracted
- Genetic Algorithm Parameters
- Population size 100
- Converges at 5000 generations
- 5 hours on Xeon 1.8GHz CPU
9Experiment and Results
Pareto plot of Dispersion versus Number of Phrases
10Experiment and Results
Histogram of number of optimal solutions a
keyphrase appears
11Evaluation
12Evaluation
- 6 domain experts participated in the evaluation.
- Core phrases vs. Non-core phrases.
- Less than 10 are deemed irrelevant.
- Significant deviation between evaluators.
13Conclusion
- Keyphrase extraction can be successfully
implemented as a multi-objective global
optimization problem. - Reasonably good keyphrases can be extracted
without prior training or domain knowledge. - Trade-off information between objectives such as
number of phrases vs. average quality of phrases
can be gained from Pareto solutions. - Preferences can be made based on the user needs
and trade-off information.
14Future Research
- Test on larger text collection.
- Implement extracted keyphrases in IR system as
browsing and query expansion tool and compare to
full-text search IR system. - Evaluate with more raters and 1-5 scale.
- Build domain thesauri with extracted keyphrases
and semantic discovery algorithms (e.g. Latent
Semantic Analysis).
15Metathesaurus in Digital Library
16Thank you!
- Comments? Questions?
- jialong_at_me.berkeley.edu
- aagogino_at_me.berkeley.edu
17Mode Analysis of Scaled Evaluation