Extracting Key Terms From Noisy and Multi-theme Documents - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Extracting Key Terms From Noisy and Multi-theme Documents

Description:

Title: PowerPoint Presentation Last modified by: toto Created Date: 1/1/1601 12:00:00 AM Document presentation format: (4:3) Other titles – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 22
Provided by: edut1508
Category:

less

Transcript and Presenter's Notes

Title: Extracting Key Terms From Noisy and Multi-theme Documents


1
Extracting Key Terms From Noisy and Multi-theme
Documents
  • Maria Grineva, Maxim Grinev and Dmitry Lizorkin
  • Proceeding of the 18th International World Wide
    Web Conference, ACM WWW, 2009

Speaker Chien-Liang Wu
2
Outline
  • Motivation
  • Framework of Key Terms Extraction
  • Candidate Terms Extraction
  • Word Sense Disambiguation
  • Building Semantic Graph
  • Discovering Community Structure of the Semantic
    Graph
  • Selecting Valuable Communities
  • Experiments
  • Conclusions

3
Motivation
  • Key Terms Extraction
  • Basic step for various NLP tasks
  • Document classification
  • Document clustering
  • Text summarization
  • Challenges
  • Web pages are typically noisy
  • Side bars/menus, comments,
  • Dealing with multi-theme Web pages
  • Portal home pages

4
Motivation (cont.)
  • State-of-the-art Approaches to Key Terms
    Extraction
  • Based on statistical learning
  • TFxIDF model, keyphrase-frequency,
  • Require training set
  • Approach in this paper
  • Based on analyzing syntactic or semantic term
    relatedness within a document
  • Compute semantic relatedness between terms using
    Wiki
  • Model document as a semantic graph of terms and
    apply graph analysis techniques to it
  • No training set required

5
Framework
6
Candidate Terms Extraction
  • Goal
  • Extract all terms from the document
  • For each term prepare a set of Wikipedia articles
    that can describe its meaning
  • Parse the input document and extract all possible
    n-grams
  • For each n-gram ( its morphological variations)
    provide a set of Wikipedia article titles
  • "drinks", "drinking", "drink" gt Wikipedia
    Drink Drinking
  • Avoid nonsense phrases appearing in the results
  • "using", "electric cars are",

7
Word Sense Disambiguation
  • Goal
  • Choose the most appropriate Wikipedia article
    from the set of candidate articles for each
    ambiguous term extracted on the previous step
  • Reference
  • Semantic Relatedness Metric for Wikipedia
    Concepts Based on Link Analysis and its
    Application to Word Sense Disambiguation,
    SYRCoDIS, 2008

8
Word Sense Disambiguation (contd.)
  • Example Text
  • Jigsaw is W3C's open-source project that started
    in May 1996. It is a web server platform that
    provides a sample HTTP 1.1 implementation and
  • Ambiguous term platform
  • Four Wikipedia concepts around this word
    open-source, web server, HTTP, and
    implementation

9
Word Sense Disambiguation (contd.)
  • A neighbor of an article
  • All Wikipedia articles that have an incoming or
    an outgoing link to the original article
  • Each term is assigned with a single Wikipedia
    article that describes its meaning

10
Building Semantic Graph
  • Goal
  • Build document semantic graph using semantic
    relatedness between terms
  • Semantic graph is a weighted graph
  • Vertex term
  • Edge two vertices are semantically related
  • Weight of edge semantic relatedness measure of
    the two terms

11
Building Semantic Graph (contd.)
  • Using Dice-measure for Wikipedia-based semantic
    relatedness (reference SYRCoDIS, 2008)
  • Weights for various link types

Where n(A) is the neighbors of article A
12
Detecting Community Structure of the Semantic
Graph using Newman Algorithm
A news article "Apple to Make ITunes More
Accessible For the Blind"
13
Selecting Valuable Communities
  • Goal rank term communities in a way that
  • the higher ranked communities contain key terms
  • the lower ranked communities contain not
    important terms, and possible disambiguation
    mistakes
  • Use
  • Density of community Ci

14
Selecting Valuable Communities(contd.)
  • Informativeness of community Ci
  • Higher values to the named entities (for example,
    Apple Inc., Steve Jobs, Braille) than to general
    terms (Consumer, Agreement, Information)
  • Community rank densityinformativeness
  • Where
  • count(DLink) is the number of Wikipedia
    articles in which this term appears as a link
  • count(Dterm) is the total number of articles in
    which it appears

15
Selecting Valuable Communities(contd.)
  • Decline is a border between important and
    non-important term communities
  • For test collection, decline coincides with the
    maximum F-measure in 73

16
Experiment
  • Noise-free dataset
  • 252 posts from 5 technical blogs
  • 22 annotators took part in this experiment
  • Each document was analyzed by 5 different
    annotators
  • A key term was valid if at least two participants
    identified it
  • For each document, two sets of key terms were
    built
  • Finally, got 2009 key terms, 93 of them are Wiki
    titles

Uncontrolled key terms
Controlled key terms
Match Wiki article title
Each annotator identified 510 key terms
17
Evaluation

18
Results
  • Revision of precision and recall
  • Communities-based method extracts more related
    terms in each thematic group than a human ?
    better terms coverage
  • Each participant reviewed these automatically
    extracted key terms and, if possible, extended
    his manually identified key terms
  • 389 additional manually selected key terms
  • Precision ? 46.1, recall ? 67.7

19
Evaluation on Web Pages
  • 509 real-world web pages
  • Manually select key terms from web pages in the
    same manner
  • Noise stability

20
Evaluation on Web Pages (contd.)
  • Multi-theme stability
  • 50 web pages with diverse topics
  • News websites and home pages of Internet portals
    with lists of featured articles
  • Result

21
Conclusion
  • Extract key terms from text document
  • No training dataset required
  • Wikipedia-based knowledge base
  • Word sense disambiguation
  • Semantic graph
  • Semantic relatedness
  • Valuable key term communities
Write a Comment
User Comments (0)
About PowerShow.com