Text mining tool for ontology engineering based on use of product taxonomy and web directory - PowerPoint PPT Presentation

About This Presentation
Title:

Text mining tool for ontology engineering based on use of product taxonomy and web directory

Description:

Text mining tool for ontology engineering based on use of product taxonomy and web directory ... lemma. word. DATESO 2005. 9. Experiments ... – PowerPoint PPT presentation

Number of Views:253
Avg rating:3.0/5.0
Slides: 14
Provided by: janne9
Category:

less

Transcript and Presenter's Notes

Title: Text mining tool for ontology engineering based on use of product taxonomy and web directory


1
Text mining tool for ontology engineering based
on use of product taxonomy and web directory
  • Jan Nemrava and Vojtech Svatek
  • Department of Information and Knowledge
    Engineering
  • VSE Praha

2
Current state
  • IE and Ontology learning are frequently discussed
    issues in the field of Semantic Web.
  • Semi-automatic and automatic methods
    ontology-based extraction of information needed
  • Web is great source for unstructured text

3
Task is
  • Collect specific words verbs in our case that
    usually occur together with particular product
    category as support for ontology designers.
  • Small and specialized ontologies concerning one
    product category and describing its frequent
    relations in common text.
  • Make use of fulltext search engines and DMOZ
    directory for retrieving information
  • And UNSPSC (United Nations Standard Products and
    Services Code) product catalogue

4
  • Web directory are rarely valid taxonomies.
  • It is easy to see that subheadings are often not
    specializations of headings
  • Some of them are even not concepts (names of
    entities) but properties that implicitly restrict
    the extension of a preceding concept in the
    hierarchy. Consider for example
    .../Industries/Construction and
    Maintenance/Materials and Supplies/
    /Masonry_and_Stone/Natural Stone/International
    Sources/Mexico.

5
Proposal of method
  • Obtain so called indicator verbs that
    characterize particular term (product category in
    our case) in UNSPSC.
  • Particular terms will be then generalized and may
    mine verbs that are indicative for the upper
    level of these terms.
  • join UNSPSC taxonomy and its list of products
    with content of company websites to gain valuable
    information about verbs that usually occur in one
    sentence with some product category from the
    taxonomy.
  • Use hand classified web directories containing
    relevant web sites.

6
Task sequence decomposition
  • Manually select UNSPSC product and corresponding
    product category from DMOZ Business branch
  • Search in directory headings names
  • Search in web site description
  • Use fulltext
  • 1) Input URL of DMOZ directory containing

    companies that manufacture desired product.
  • Output List of URL of companies.
  • 2) Input URL of company website
  • Output List of web pages containing the
    target term.
  • 3) Input Web page containing the term
  • Output File with extracted sentences
    containing the term
  • 4) Input Sentence with term.
  • Output Tagged sentences
  • 5) Input Verbs
  • Output lemmatized, grouped and saved verbs

7
Experiment
  • Handling equipment branch / UNSPSC product with
    corresponding DMOZ category
  • Goal is find verbs
  • common for most products.
  • characterizing one branch of products
  • specific for small group of products, or even
    only one product
  • 7 product categories, 303 verbs collected that
    occurred 7300 times at web sites.

8
Experiment
9
Experiments
  • some verbs are obvious to be entirely neutral and
    do not characterize the products at all. (be,
    have, provide and use)
  • Some are connected with manufacturing(design,
    require, offer, make, contact, manufacture,
    develop, supply)
  • activities describing manipulating with material.
    (handle, lift, install and move)

10
Experiments
11
  • normalization
  • Fij fij (Vtj / V)
  • Crofts normalization moderates the effect of
    high-frequency verbs
  • cf K (1 - K) fij / mij
  • TF/IDF
  • wij fij log2(N / n)

12
Problem remaining
  • Automate assigning UNSPSC category to DMOZ
    category
  • Some UNSPSC have no appropriate category leading
    in no or little web sites.
  • Some categories are less informative

13
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com