Title: DomainSpecific Web Search with Keyword Spices
1Domain-Specific Web Search with Keyword Spices
- Thanh Viet Nguyen
- Faculty of Information TechnologyUniversity of
Natural Sciences - nvthanh_at_fit.hcmuns.edu.vnhttp//www.fit.hcmuns.ed
u.vn/nvthanh -
- 06/07/2005
2Outline
- Introduction
- State of the art
- Algorithm for Extracting Keyword Spices
- Experiment
- Future Work
- Reference
3Introduction
- Domain-Specific Search Engine aims to return for
only relevant pages in certain domains. - Why Domain-Specific Search Engine not
General-Purpose Search Engine (like Google)? - Time consuming when trying through return pages
from Google - The first 10, 20, 100 return pages is most
important in commercial search engine (precision
desire) - Bad human query refinement skill (lack
experienced user) - 70 web searches use only one keyword (Butler,
2000)
4State of the art
- Index only domain-specific pages using specific
crawler - (machine learning approach) (1)
- Cora (Macallum et al, 1999) search engine for
computer research papers - SPIRAL (Cohen, 1998), WebKB (Craven et al, 1998),
Google Scholar - Pros sophisticated search, high precision (from
indexed data not real data) - Cons time and network bandwidth for crawler
- Filtering model (meta search engine approach)
(2) - Ahoy (Shakes et al, 1997) search engine for
personal homepages - Pros reuse commercial search engine power
- Cons slow response time
- Conclusion While (1) is efficient for domain
with convergent pages (computer research papers),
(2) shows it is suitable for dispersed pages
domain (personal pages) (Oyama et al, 2004)
5State of the art (cont)
- Query refinement (meta search engine approach)
- Relevance feedback (Salton and Buckley, 1990)
best for specific user - Query modification (Glover et al, 2001) best for
specific domain - Keyword spices (Oyama et al, 2001) best for
specific domain - Comparison between keyword spices and filtering
model
6Keyword spices - Introduction
- Keyword spice model aim to extent (refine) users
query by some domain-specific keyword spices. - The new query will be passed to general purpose
search engine - The returned (relevant?) pages will be showed for
user - Challenge How can we find most effective
keywords for a specific domain?
7Keyword spices Illustrative example
- Basic idea Domain-specific pages have some
identical keywords or phrases - Personal homepage my name is, my homepage,
- Call for papers important dates, committee,
- Preliminary experiment
- Cooking recipe domain (Japanese) beef pepper
gtgt beef - Computer business (Vietnamese) I try to search
some pages that have the information about how
(where) to buy a new (old) computer. - Some tried keywords máy vi tính, c?a hành máy
vi tính, c?a hàng máy vi tính, mua bán
máy vi tính, b?o hành máy vi tính
8Keyword spices Illustrative example (cont)
9Keyword spices Illustrative example (cont)
10Keyword spices Illustrative example (cont)
- Judgment The best search keyword is b?o hành
máy vi tính - All top 10, 20 return pages have information
related to computer business - 28 out of top 100 are tightly relevant pages
- Have links of both most computer business
companies such as WestCom, MekongGreen, Nguyen
Hoang, TH, FPT, CMS and many classified
advertisements for old computers. - Question Should b?o hành be a keyword spice?
How can we find other keyword spices?
11Keyword spice extraction Pre-processing
- Collected pages ,from a general purpose search
engine, are classified into two class T (relevant
to the domain) or F (irrelevant to the domain) by
hand - Remove html tags and extract nouns as keywords
- Split example pages into two disjoint subsets,
the training set and the validation set. - From now, all examples (without explicit
indication) are results of searching keyword
spices for recipe domain (Japanese)
12Initial Search Expression
- Build an initial decision tree from the training
set following Quinlan, 1986 to classify relevant
and irrelevant documents.
13Initial Search Expression (cont)
- Convert the tree into a set of positive
conjunctions (class T) the initial query - Problem large decision tree ? too-complex query
that can not be used for commercial search engine
? We must reduce the keyword spice size
14Simplifying Keyword Spices
- Basic idea We simplify the initial query by
removing keyword or conjunction without reduction
classification result performance (from
validation set) - How can we evaluate query performance?
15Information retrieval evaluation
- Relevant returned
documents - Precision --------------------------------------
- - Returned
documents - Relevant returned
documents - Recall -----------------------------------------
-- - Relevant documents
16Keyword spices evaluation
Where Ddomain is the relevant documents
classified by human Dboolean is the
relevant documents classified by the query
- Note The ideal case is to have a high precision
and high recall - Harmonic mean F of precision P and recall R
- The higher value of F, the more well-balanced in
terms of precision and recall
17Simplifying Keyword Spices Algorithm
- Collect and classify example documents
- Split examples into Dtraining (for generating
initial decision tree) and Dvalidation (for
simplifying keyword spice) - Build initial decision tree
- Convert tree into a set of positive conjunctions,
the query - Simplifying the query
- For each conjunction, removing a keyword that
results in maximum increase in the harmonic mean
F - For the query, removing a conjunction that
results in maximum increase in the harmonic mean
F
18Experiment Result
- 2000 sample pages, 1000 for training and 1000 for
validation
19Experiment Result (cont)
20Experiment Result (cont)
- From a general purpose search engine
21Experiment Result (cont)
- Comparing with filtering model
22Future works
- Training examples collection
- Using a web directory as a source of examples
- Noise in the training set
- Bias in the training set
- Learning Classifiers from Partially Labeled Data
- Apply into Vietnamese?
- Keyword extraction problem ambiguity in word
segmentation - A commercial Vietnamese search engine Google
(.vn domain)?
23Reference
- Domain-Specific Web Search with Keyword Spices,
IEEE Transactions on Knowledge and Data
Engineering, vol. 16, No.1, January 2004 - Satoshi OYAMA, Takashi KOKUBO and Toru ISHIDA
- Department of Social Informatics
- Kyoto University, Kyoto 606-8501, Japan
- Teruhiro YAMADA
- Laboratories of Information Science and
Technology - Yasuhiko KITAMURA
- Department of Information and Communication
Engineering - Osaka City University, Osaka 558-8585, Japan
24Discussion