DomainSpecific Web Search with Keyword Spices

About This Presentation

Title:

DomainSpecific Web Search with Keyword Spices

Description:

Domain-Specific Search Engine aims to return for only relevant pages in certain domains. ... Cora (Macallum et al, 1999): search engine for computer research papers ... – PowerPoint PPT presentation

Number of Views:272

Avg rating:3.0/5.0

Slides: 25

Provided by: mikech2

Category:

more less

Transcript and Presenter's Notes

Title: DomainSpecific Web Search with Keyword Spices

1
Domain-Specific Web Search with Keyword Spices

Thanh Viet Nguyen
Faculty of Information TechnologyUniversity of
Natural Sciences
nvthanh_at_fit.hcmuns.edu.vnhttp//www.fit.hcmuns.ed
u.vn/nvthanh
06/07/2005

2
Outline

Introduction
State of the art
Algorithm for Extracting Keyword Spices
Experiment
Future Work
Reference

3
Introduction

Domain-Specific Search Engine aims to return for
only relevant pages in certain domains.
Why Domain-Specific Search Engine not
General-Purpose Search Engine (like Google)?
Time consuming when trying through return pages
from Google
The first 10, 20, 100 return pages is most
important in commercial search engine (precision
desire)
Bad human query refinement skill (lack
experienced user)
70 web searches use only one keyword (Butler,
2000)

4
State of the art

Index only domain-specific pages using specific
crawler
(machine learning approach) (1)
Cora (Macallum et al, 1999) search engine for
computer research papers
SPIRAL (Cohen, 1998), WebKB (Craven et al, 1998),
Google Scholar
Pros sophisticated search, high precision (from
indexed data not real data)
Cons time and network bandwidth for crawler
Filtering model (meta search engine approach)
(2)
Ahoy (Shakes et al, 1997) search engine for
personal homepages
Pros reuse commercial search engine power
Cons slow response time
Conclusion While (1) is efficient for domain
with convergent pages (computer research papers),
(2) shows it is suitable for dispersed pages
domain (personal pages) (Oyama et al, 2004)

5
State of the art (cont)

Query refinement (meta search engine approach)
Relevance feedback (Salton and Buckley, 1990)
best for specific user
Query modification (Glover et al, 2001) best for
specific domain
Keyword spices (Oyama et al, 2001) best for
specific domain
Comparison between keyword spices and filtering
model

6
Keyword spices - Introduction

Keyword spice model aim to extent (refine) users
query by some domain-specific keyword spices.
The new query will be passed to general purpose
search engine
The returned (relevant?) pages will be showed for
user
Challenge How can we find most effective
keywords for a specific domain?

7
Keyword spices Illustrative example

Basic idea Domain-specific pages have some
identical keywords or phrases
Personal homepage my name is, my homepage,
Call for papers important dates, committee,
Preliminary experiment
Cooking recipe domain (Japanese) beef pepper
gtgt beef
Computer business (Vietnamese) I try to search
some pages that have the information about how
(where) to buy a new (old) computer.
Some tried keywords máy vi tính, c?a hành máy
vi tính, c?a hàng máy vi tính, mua bán
máy vi tính, b?o hành máy vi tính

8
Keyword spices Illustrative example (cont)

Result of Google search

9
Keyword spices Illustrative example (cont)
10
Keyword spices Illustrative example (cont)

Judgment The best search keyword is b?o hành
máy vi tính
All top 10, 20 return pages have information
related to computer business
28 out of top 100 are tightly relevant pages
Have links of both most computer business
companies such as WestCom, MekongGreen, Nguyen
Hoang, TH, FPT, CMS and many classified
advertisements for old computers.
Question Should b?o hành be a keyword spice?
How can we find other keyword spices?

11
Keyword spice extraction Pre-processing

Collected pages ,from a general purpose search
engine, are classified into two class T (relevant
to the domain) or F (irrelevant to the domain) by
hand
Remove html tags and extract nouns as keywords
Split example pages into two disjoint subsets,
the training set and the validation set.
From now, all examples (without explicit
indication) are results of searching keyword
spices for recipe domain (Japanese)

12
Initial Search Expression

Build an initial decision tree from the training
set following Quinlan, 1986 to classify relevant
and irrelevant documents.

13
Initial Search Expression (cont)

Convert the tree into a set of positive
conjunctions (class T) the initial query
Problem large decision tree ? too-complex query
that can not be used for commercial search engine
? We must reduce the keyword spice size

14
Simplifying Keyword Spices

Basic idea We simplify the initial query by
removing keyword or conjunction without reduction
classification result performance (from
validation set)
How can we evaluate query performance?

15
Information retrieval evaluation

Relevant returned
documents
Precision --------------------------------------
-
Returned
documents
Relevant returned
documents
Recall -----------------------------------------
--
Relevant documents

16
Keyword spices evaluation
Where Ddomain is the relevant documents
classified by human Dboolean is the
relevant documents classified by the query

Note The ideal case is to have a high precision
and high recall
Harmonic mean F of precision P and recall R

The higher value of F, the more well-balanced in
terms of precision and recall

17
Simplifying Keyword Spices Algorithm

Collect and classify example documents
Split examples into Dtraining (for generating
initial decision tree) and Dvalidation (for
simplifying keyword spice)
Build initial decision tree
Convert tree into a set of positive conjunctions,
the query
Simplifying the query
For each conjunction, removing a keyword that
results in maximum increase in the harmonic mean
F
For the query, removing a conjunction that
results in maximum increase in the harmonic mean
F

18
Experiment Result

2000 sample pages, 1000 for training and 1000 for
validation

19
Experiment Result (cont)

From validation set

20
Experiment Result (cont)

From a general purpose search engine

21
Experiment Result (cont)

Comparing with filtering model

22
Future works

Training examples collection
Using a web directory as a source of examples
Noise in the training set
Bias in the training set
Learning Classifiers from Partially Labeled Data
Apply into Vietnamese?
Keyword extraction problem ambiguity in word
segmentation
A commercial Vietnamese search engine Google
(.vn domain)?

23
Reference

Domain-Specific Web Search with Keyword Spices,
IEEE Transactions on Knowledge and Data
Engineering, vol. 16, No.1, January 2004
Satoshi OYAMA, Takashi KOKUBO and Toru ISHIDA
Department of Social Informatics
Kyoto University, Kyoto 606-8501, Japan
Teruhiro YAMADA
Laboratories of Information Science and
Technology
Yasuhiko KITAMURA
Department of Information and Communication
Engineering
Osaka City University, Osaka 558-8585, Japan

24
Discussion

Write a Comment

User Comments (0)