DomainSpecific Web Search with Keyword Spices - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

DomainSpecific Web Search with Keyword Spices

Description:

Domain-Specific Search Engine aims to return for only relevant pages in certain domains. ... Cora (Macallum et al, 1999): search engine for computer research papers ... – PowerPoint PPT presentation

Number of Views:272
Avg rating:3.0/5.0
Slides: 25
Provided by: mikech2
Category:

less

Transcript and Presenter's Notes

Title: DomainSpecific Web Search with Keyword Spices


1
Domain-Specific Web Search with Keyword Spices
  • Thanh Viet Nguyen
  • Faculty of Information TechnologyUniversity of
    Natural Sciences
  • nvthanh_at_fit.hcmuns.edu.vnhttp//www.fit.hcmuns.ed
    u.vn/nvthanh
  • 06/07/2005

2
Outline
  • Introduction
  • State of the art
  • Algorithm for Extracting Keyword Spices
  • Experiment
  • Future Work
  • Reference

3
Introduction
  • Domain-Specific Search Engine aims to return for
    only relevant pages in certain domains.
  • Why Domain-Specific Search Engine not
    General-Purpose Search Engine (like Google)?
  • Time consuming when trying through return pages
    from Google
  • The first 10, 20, 100 return pages is most
    important in commercial search engine (precision
    desire)
  • Bad human query refinement skill (lack
    experienced user)
  • 70 web searches use only one keyword (Butler,
    2000)

4
State of the art
  • Index only domain-specific pages using specific
    crawler
  • (machine learning approach) (1)
  • Cora (Macallum et al, 1999) search engine for
    computer research papers
  • SPIRAL (Cohen, 1998), WebKB (Craven et al, 1998),
    Google Scholar
  • Pros sophisticated search, high precision (from
    indexed data not real data)
  • Cons time and network bandwidth for crawler
  • Filtering model (meta search engine approach)
    (2)
  • Ahoy (Shakes et al, 1997) search engine for
    personal homepages
  • Pros reuse commercial search engine power
  • Cons slow response time
  • Conclusion While (1) is efficient for domain
    with convergent pages (computer research papers),
    (2) shows it is suitable for dispersed pages
    domain (personal pages) (Oyama et al, 2004)

5
State of the art (cont)
  • Query refinement (meta search engine approach)
  • Relevance feedback (Salton and Buckley, 1990)
    best for specific user
  • Query modification (Glover et al, 2001) best for
    specific domain
  • Keyword spices (Oyama et al, 2001) best for
    specific domain
  • Comparison between keyword spices and filtering
    model

6
Keyword spices - Introduction
  • Keyword spice model aim to extent (refine) users
    query by some domain-specific keyword spices.
  • The new query will be passed to general purpose
    search engine
  • The returned (relevant?) pages will be showed for
    user
  • Challenge How can we find most effective
    keywords for a specific domain?

7
Keyword spices Illustrative example
  • Basic idea Domain-specific pages have some
    identical keywords or phrases
  • Personal homepage my name is, my homepage,
  • Call for papers important dates, committee,
  • Preliminary experiment
  • Cooking recipe domain (Japanese) beef pepper
    gtgt beef
  • Computer business (Vietnamese) I try to search
    some pages that have the information about how
    (where) to buy a new (old) computer.
  • Some tried keywords máy vi tính, c?a hành máy
    vi tính, c?a hàng máy vi tính, mua bán
    máy vi tính, b?o hành máy vi tính

8
Keyword spices Illustrative example (cont)
  • Result of Google search

9
Keyword spices Illustrative example (cont)
10
Keyword spices Illustrative example (cont)
  • Judgment The best search keyword is b?o hành
    máy vi tính
  • All top 10, 20 return pages have information
    related to computer business
  • 28 out of top 100 are tightly relevant pages
  • Have links of both most computer business
    companies such as WestCom, MekongGreen, Nguyen
    Hoang, TH, FPT, CMS and many classified
    advertisements for old computers.
  • Question Should b?o hành be a keyword spice?
    How can we find other keyword spices?

11
Keyword spice extraction Pre-processing
  • Collected pages ,from a general purpose search
    engine, are classified into two class T (relevant
    to the domain) or F (irrelevant to the domain) by
    hand
  • Remove html tags and extract nouns as keywords
  • Split example pages into two disjoint subsets,
    the training set and the validation set.
  • From now, all examples (without explicit
    indication) are results of searching keyword
    spices for recipe domain (Japanese)

12
Initial Search Expression
  • Build an initial decision tree from the training
    set following Quinlan, 1986 to classify relevant
    and irrelevant documents.

13
Initial Search Expression (cont)
  • Convert the tree into a set of positive
    conjunctions (class T) the initial query
  • Problem large decision tree ? too-complex query
    that can not be used for commercial search engine
    ? We must reduce the keyword spice size

14
Simplifying Keyword Spices
  • Basic idea We simplify the initial query by
    removing keyword or conjunction without reduction
    classification result performance (from
    validation set)
  • How can we evaluate query performance?

15
Information retrieval evaluation
  • Relevant returned
    documents
  • Precision --------------------------------------
    -
  • Returned
    documents
  • Relevant returned
    documents
  • Recall -----------------------------------------
    --
  • Relevant documents

16
Keyword spices evaluation
Where Ddomain is the relevant documents
classified by human Dboolean is the
relevant documents classified by the query
  • Note The ideal case is to have a high precision
    and high recall
  • Harmonic mean F of precision P and recall R
  • The higher value of F, the more well-balanced in
    terms of precision and recall

17
Simplifying Keyword Spices Algorithm
  • Collect and classify example documents
  • Split examples into Dtraining (for generating
    initial decision tree) and Dvalidation (for
    simplifying keyword spice)
  • Build initial decision tree
  • Convert tree into a set of positive conjunctions,
    the query
  • Simplifying the query
  • For each conjunction, removing a keyword that
    results in maximum increase in the harmonic mean
    F
  • For the query, removing a conjunction that
    results in maximum increase in the harmonic mean
    F

18
Experiment Result
  • 2000 sample pages, 1000 for training and 1000 for
    validation

19
Experiment Result (cont)
  • From validation set

20
Experiment Result (cont)
  • From a general purpose search engine

21
Experiment Result (cont)
  • Comparing with filtering model

22
Future works
  • Training examples collection
  • Using a web directory as a source of examples
  • Noise in the training set
  • Bias in the training set
  • Learning Classifiers from Partially Labeled Data
  • Apply into Vietnamese?
  • Keyword extraction problem ambiguity in word
    segmentation
  • A commercial Vietnamese search engine Google
    (.vn domain)?

23
Reference
  • Domain-Specific Web Search with Keyword Spices,
    IEEE Transactions on Knowledge and Data
    Engineering, vol. 16, No.1, January 2004
  • Satoshi OYAMA, Takashi KOKUBO and Toru ISHIDA
  • Department of Social Informatics
  • Kyoto University, Kyoto 606-8501, Japan
  • Teruhiro YAMADA
  • Laboratories of Information Science and
    Technology
  • Yasuhiko KITAMURA
  • Department of Information and Communication
    Engineering
  • Osaka City University, Osaka 558-8585, Japan

24
Discussion
  • Q A
Write a Comment
User Comments (0)
About PowerShow.com