Title: Extracting Key Terms From Noisy and Multi-theme Documents
1Extracting Key Terms From Noisy and Multi-theme
Documents
- Maria Grineva, Maxim Grinev and Dmitry Lizorkin
- Proceeding of the 18th International World Wide
Web Conference, ACM WWW, 2009
Speaker Chien-Liang Wu
2Outline
- Motivation
- Framework of Key Terms Extraction
- Candidate Terms Extraction
- Word Sense Disambiguation
- Building Semantic Graph
- Discovering Community Structure of the Semantic
Graph - Selecting Valuable Communities
- Experiments
- Conclusions
3Motivation
- Key Terms Extraction
- Basic step for various NLP tasks
- Document classification
- Document clustering
- Text summarization
- Challenges
- Web pages are typically noisy
- Side bars/menus, comments,
- Dealing with multi-theme Web pages
- Portal home pages
4Motivation (cont.)
- State-of-the-art Approaches to Key Terms
Extraction - Based on statistical learning
- TFxIDF model, keyphrase-frequency,
- Require training set
- Approach in this paper
- Based on analyzing syntactic or semantic term
relatedness within a document - Compute semantic relatedness between terms using
Wiki - Model document as a semantic graph of terms and
apply graph analysis techniques to it - No training set required
5Framework
6Candidate Terms Extraction
- Goal
- Extract all terms from the document
- For each term prepare a set of Wikipedia articles
that can describe its meaning - Parse the input document and extract all possible
n-grams - For each n-gram ( its morphological variations)
provide a set of Wikipedia article titles - "drinks", "drinking", "drink" gt Wikipedia
Drink Drinking - Avoid nonsense phrases appearing in the results
- "using", "electric cars are",
7Word Sense Disambiguation
- Goal
- Choose the most appropriate Wikipedia article
from the set of candidate articles for each
ambiguous term extracted on the previous step - Reference
- Semantic Relatedness Metric for Wikipedia
Concepts Based on Link Analysis and its
Application to Word Sense Disambiguation,
SYRCoDIS, 2008
8Word Sense Disambiguation (contd.)
- Example Text
- Jigsaw is W3C's open-source project that started
in May 1996. It is a web server platform that
provides a sample HTTP 1.1 implementation and - Ambiguous term platform
- Four Wikipedia concepts around this word
open-source, web server, HTTP, and
implementation
9Word Sense Disambiguation (contd.)
- A neighbor of an article
- All Wikipedia articles that have an incoming or
an outgoing link to the original article - Each term is assigned with a single Wikipedia
article that describes its meaning
10Building Semantic Graph
- Goal
- Build document semantic graph using semantic
relatedness between terms - Semantic graph is a weighted graph
- Vertex term
- Edge two vertices are semantically related
- Weight of edge semantic relatedness measure of
the two terms
11Building Semantic Graph (contd.)
- Using Dice-measure for Wikipedia-based semantic
relatedness (reference SYRCoDIS, 2008) - Weights for various link types
Where n(A) is the neighbors of article A
12Detecting Community Structure of the Semantic
Graph using Newman Algorithm
A news article "Apple to Make ITunes More
Accessible For the Blind"
13Selecting Valuable Communities
- Goal rank term communities in a way that
- the higher ranked communities contain key terms
- the lower ranked communities contain not
important terms, and possible disambiguation
mistakes - Use
- Density of community Ci
14Selecting Valuable Communities(contd.)
- Informativeness of community Ci
- Higher values to the named entities (for example,
Apple Inc., Steve Jobs, Braille) than to general
terms (Consumer, Agreement, Information) - Community rank densityinformativeness
- Where
- count(DLink) is the number of Wikipedia
articles in which this term appears as a link - count(Dterm) is the total number of articles in
which it appears
15Selecting Valuable Communities(contd.)
- Decline is a border between important and
non-important term communities - For test collection, decline coincides with the
maximum F-measure in 73
16Experiment
- Noise-free dataset
- 252 posts from 5 technical blogs
- 22 annotators took part in this experiment
- Each document was analyzed by 5 different
annotators - A key term was valid if at least two participants
identified it - For each document, two sets of key terms were
built - Finally, got 2009 key terms, 93 of them are Wiki
titles
Uncontrolled key terms
Controlled key terms
Match Wiki article title
Each annotator identified 510 key terms
17Evaluation
18Results
- Revision of precision and recall
- Communities-based method extracts more related
terms in each thematic group than a human ?
better terms coverage - Each participant reviewed these automatically
extracted key terms and, if possible, extended
his manually identified key terms - 389 additional manually selected key terms
- Precision ? 46.1, recall ? 67.7
19Evaluation on Web Pages
- 509 real-world web pages
- Manually select key terms from web pages in the
same manner - Noise stability
20Evaluation on Web Pages (contd.)
- Multi-theme stability
- 50 web pages with diverse topics
- News websites and home pages of Internet portals
with lists of featured articles - Result
21Conclusion
- Extract key terms from text document
- No training dataset required
- Wikipedia-based knowledge base
- Word sense disambiguation
- Semantic graph
- Semantic relatedness
- Valuable key term communities