Learning Based Web Query Processing - PowerPoint PPT Presentation

About This Presentation

Title:

Learning Based Web Query Processing

Description:

Definition: Sijk: Segment. Lm:Hyperlink. S1. S11. S12. S13. S131. S2. S21. S3 ... Holiday ... of Tsim Sha Tsui - Kowloon, Holiday Inn Golden Mile is your ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 67

Provided by: dom1

Learn more at: https://people.cs.umass.edu

Category:

more less

Transcript and Presenter's Notes

Title: Learning Based Web Query Processing

1
Learning Based Web Query Processing

Yanlei Diao
Computer Science Department
Hong Kong U. of Science Technology

2
Outline

Background
Learning Based Web Query Processing
FACT A Prototype System
Preliminary System Evaluation
Conclusions
Demonstration

3
Searching the Web

Want to find a piece of information on the Web?

Heterogeneity
Huge Size
Lack of Structure
Diversified User Bases
Ever- Changing
4
Search Engines

Maintain indices, keyword input, match input
keywords with indices, return relevant documents.
Problems
Large hit lists with low precision. Users find
relevant documents by browsing.
URLs but not the required information are
returned. Users read the pages for the required
information.

5
Web Information Retrieval

IR Vector-space model, search and browse
capabilities
Web IR Web navigation, indexing, query
languages, query-document matching, output
ranking, user relevance feedback
Recent Improvement Hierarchical classification,
better presentation of results, hypertext study,
metasearching...

6
Web IR for Query Processing

Problems
A list of URLs or documents is returned. Users
browse a lot to find information.
It asks users for precise query requirements,
which is hard for casual users.
It lacks a well-defined underlying model.
Vector-space model does not convey as much as
Hypertext.
?Large hit lists with low precision, rely on
input queries

7
Intelligent Agents

The agents learn user profiles/models from their
search behaviors and employ the knowledge to
predict URLs of interest to the user.
Some rely on search engines and heuristics to
find targets of a specific type e.g. papers or
homepages
Some help users in an interactive mode They
learn while users are browsing.
Some adaptive agents work autonomously They use
heuristics, recommend pages of interest and take
user feedback to improve.

8
Agents for Query Processing

Problems
Recommending pages of interest, but not
information of interest to the user
Using vector-space model or converting HTML to
text documents
Requiring a prior knowledge, such as user
profiles, or using heuristics for a particular
domain
?Not well suited for ad hoc queries

9
Database Approaches

The Web is a directed graph nodes are Web pages
and edges are hyperlinks between pages.
Query languages 1st generation combines
content-based and structure-based queries. 2nd
generation accesses structure of Web objects and
creates complex objects.
Wrappers and mediators they present an
integrated view of the resources.

10
DB Approaches for Query Processing

Problems
Wrapper generation is only feasible for a number
of sites in a domain. The Web is growing very
fast!
Web query languages require knowledge of the Web
sites (content and linkage) and the language
syntax. They are hard to use.
?Not scalable, good for Web site management but
not queries on the entire Web.

11
Our Goal

A Web query processing system for any Web users
that
processes ad hoc queries on HTML pages
automatically extracts succinct and precise query
results ( a result may take the form of a table,
a list or a paragraph).
? Learn the knowledge for query processing from
the User!

12
Proposed Approach

An approach with learning capabilities
Keyword input (probably not precise)
Search engines return a URL list
During browsing, learns from users
to navigate through the web pages
to identify the required information on a web
page
Processes the rest URLs automatically
Returns succinct and precise results

13
Unique Features

Returning succinct and precise results, i.e.
segments of pages
No a prior knowledge or preprocessing, suited for
ad hoc queries
exploiting page formatting and linkage
information simultaneously, good use of rich
information conveyed by HTML.

14
Benefits from Learning

Bridging the gap between keyword input and real
query requirements
Capable of navigating in the neighborhoods of
documents returned by search engines
Automating the processing of all possibly
relevant documents in one query
Almost imperceptible to users, user-friendly

15
Outline

Background
Learning Based Web Query Processing
FACT A Prototype System
Preliminary System Evaluation
Conclusions
Demonstration

16
Modeling a Web Page

Segment a group of tag delimited elements, unit
in query processing, e.g. paragraph, table, list,
nested (atomic segments to the document), Segment
Tree
Attributes of a segment
content text in the scope of the segment
description summary of the content
Hyperlink represented as segments to be
comparable
content URL
description anchor text
associated with the parent segment

17
A Sample
18
Modeling a Web Site

Ignore backward links, links pointing to
themselves, links outside a site.
A Web site is modeled as hyperlink-connected
segment trees, called
Segment Graph.

19
Knowledge for the Locating Task
The locating task is to find a segment in the
Segment Graph of a site as the query result.
20
Two Types of Knowledge
A link conveys description of the pointed page
while a queried segment contains both description
and the result itself.
21
Navigation Knowledge

concerns descriptive information and helps find
the navigational path
a set of (term, weight) pairs
Term a selected word f the description of
segments and links on the navigational path
Weight indicating the importance of the term in
leading to the queried segment

22
Learning Navigation Knowledge

Navigational path, (link?)segment, e.g.
L2?L4?S41.
Extended navigational path, ((segment ?)link?)
((segment ?) segment), e.g. (S1?S11?L2) ?
(S3?S31?L4) ? (S4?S41).

Step1. Assign a weight to each component on the
path, e.g. L2, S31, S41. The closer to the
target, the higher the weight. Step2. Assign a
weight to each term in the description of a
component on the path.
The weight of a term can be summed up over
navigational paths. The set of (term, weight)
pairs is stored into the navigation knowledge
base.
23
Classification knowledge

Checks if a segment meets query requirements on
both descriptive information and the result.
Cast in the Bayesian learning framework.
Set of triples (feature, NP, NN)
Feature word, integer, real, symbol, , date,
time, email address, , contained in a segment
NP occurrences of the feature in positive
samples
NN occurrences of the feature in negative
samples

24
Learning Classification knowledge
The queried segment is a positive sample. All
other segments on the same page are negative
samples.
The content of each segment is parsed into a set
of features, either simple and complex types.
Count NP and NN accumulatively for each feature
over all samples. Store all triples (feature, NP,
NN) into the classification knowledge base.
25
Query Processing Using Learned Knowledge

After a Web page is retrieved, the segment graph
is built
For each segment and link, a score is computed by
applying the navigation knowledge
(ApplyNavigation).
Segments/links are sorted on the score
If a link has the highest score, the system
navigates through the link
If a segment has the highest score, all segments
on the page are checked to see if there is a
queried segment
The process is repeated until either a segment is
found or conclusion can be made that the site
does not contain queried information.

26
Locating Algorithm
On each page, if the result is not found
choosing an unprocessed component with highest
score if a link is chosen ? if a segment is
chosen
27
Locating Algorithm
On each page, if the result is not found
choosing an unprocessed component with highest
score if a link is chosen if a segment is chosen
? (ApplyClassification)
28
Applying Learned Knowledge

Application of Navigation Knowledge
extracts terms in the description of a
link/segment
reads the weights of the terms and assigns a
score to the link/segment by a certain function
(max currently)
sorts all links and segments by their scores
Application of Classification Knowledge
computes the confidence C to classify a segment
as the queried result
chooses the segment on a page with the largest C.
If the largest C is over a threshold, returns the
segment

29
forward
Hotel 1
3
Hotel 2
User browses it!
done
30
User clicks here!
31
Room information
User marks it!
32
Generating Navigation Knowledge

The navigation path looks like
Hotel Reservation-gtsingle hk double hk standard
room deluxe room executive room
By our weighting scheme, a weight is assigned to
each term

33
Generating Classification Knowledge

Training Samples
Occurrences of each feature are counted

Negative Holiday Inn Golden Mile In the heart
of Tsim Sha Tsui - Kowloon, Holiday Inn Golden
Mile is your number one choice for accommodation,
dining, meetings and banquets. Ideally situated
in the heart of ...
Positive single hk double hk standard room
999.00 1,039.00 deluxe room
1,199.00 1,239.00 executive room 1,399.00
1,499.00
34
back
Fact starts here!
35
(No Transcript)
36
Applying Navigation Knowledge

The page contains
Navigation knowledge shows

Paragraph 57 - 73 Lockhart Road, Wanchai, Hong
Kong, SAR, PRC Paragraph Located in the hub of
Wanchai, the Wharney Hotel is within walking
distance of the Hong Kong Arts Centre, Convention
and Exhibition Centre, busy commercial complexes
and shopping malls. ... Paragraph TEL (852)
2861-1000 FAX (852) 2865-6023
Links Main Features Services Dining and
Banqueting Hotel Rates Reservation ...
37
Navigation Knowledge assigns scores
Fact chooses it!
38
Navigation Knowledge assigns scores
39
Classification Knowledge computes confidence
Apply Classification Knowledge to all Segments
40
Fact finds it!
41
Outline

Background
Learning Based Web Query Processing
FACT A Prototype System
Preliminary System Evaluation
Conclusions
Demonstration

42
A Query Processing System

A learning based query processing system
User Interface accepts user queries, presents
query results, a browser capable of capturing
user actions
Query Analyzer analyzes and transforms user
queries
Session Controller coordinates learning and
locating
Learner generates knowledge from captured user
actions
Locator applies knowledge and locates query
results
Retriever Parser retrieves pages and parses to
trees
Knowledge Base stores learned knowledge

43
Reference Architecture
44
A Query Session
45
Training Strategies

Sequential
First n sites user browses and system learns
Next N-n sites system processes
Random
Randomly choose n sites user browses and system
learns
the system processes the rest
Interleaved
First n0 sites, user browses and system learns
Next n - n0 site, system makes decision. For
incorrect ones, user browses and system re-learns
Next N-n sites system processes

46
Outline

Background
Learning Based Web Query Processing
FACT A Prototype System
Preliminary System Evaluation
Conclusions
Demonstration

47
System Evaluation

System Capabilities
Performance
Effectiveness precision, recall, correctness
Efficiency in a site, how many pages the system
visits to find a result or to recognize the
irrelevancy
Training efficiency how many training samples
are needed
Key Issues
Effectiveness of the knowledge
Effectiveness of training strategies
Tests on A Range of Queries

48
A System Output Sample
49
System Capabilities

The system returns segments of the Web pages
The segments may not contain any input keyword
but meet the requirement of room rates.
The system learned the query requirement from the
user!
Segments can be from pages whose URLs are not
directly returned by Yahoo!.
The system learned how to follow the hyperlinks
to the queried segment!

50
System Evaluation - Effectiveness

Given a set of URLs in a query session, the
system makes N decisions
N N1 N2 N3 N4
Precision N1 / (N1N3) ,
Recall N1 / sites that contain results,
Correctness (N1N2) / N .

51
System Evaluation - Efficiency

How efficiently the system finds a queried
segment in a site?
Level of a Queried Segment the length of the
shortest path to find it
Absolute Path length Visited pages,
Relative Path Length Visited pages / Level of
the Queried Segment .

52
Basic Performance

Q11 Hong Kong Hotel Room Rate Q12 Hong Kong
Hotel
Sequential training
53
Effectiveness of Knowledge

Other two systems implemented for comparison
Classification Knowledge Only treat links and
segments the same by the Bayes classifier
Learning
Locating

Action positive negative click a
link the link other links on the
page mark a segment the segment other segments
on the page
Classify all segments and links If a link has the
highest confidence, follow the link If a segment
has the highest confidence and passes the
threshold, return it.
54
Effectiveness of Knowledge

Navigation Knowledge Only only checks the
descriptive information of links and segments
Learning
Locating

Navigational path ? Navigation Knowledge
Assigns scores to all links and segments using
navigation knowledge If a link has the highest
score, follow the link If a segment has the
highest score, return it.
55
Effectiveness of Knowledge
56
Effects of Training Strategies
Query Q12 Training Size 3-10
57
Effects of Training Strategies

Random training performs badly, low in recall
As the training size increases, interleaved
training outperforms sequential training
Best accuracy reaches or exceeds 90 in all
metrics when the interleaved training strategy is
used
Enlarging the training size for random and
sequential training is not effective

58
Improved Performance
Interleaved training
59
A Range of Queries

Hotel room rates targets at prices, easy to
identify
Admission requirements on graduate student
includes items such as degree, GPA, GRE, etc.
that are not easy to specify in keywords but easy
to show by marking
Data Mining Researcher concept, subjective,
evidence including research interests, projects,
professional activity, etc

60
Results of A Range of Queries
Interleaved training
61
Performance for the Queries

Effectiveness
first 4 queries accuracy is 80 to above 90
the last query still capable of filtering out
irrelevant sites
Efficiency
relative path length to locate a queried segment
is close to 1
absolute path length to conclude irrelevancy is
no more than 2.5 pages.
The performance is not affected much by how
precise the keyword query is. The system learns
query requirements

62
Outline

Background
Learning Based Web Query Processing
FACT A Prototype System
Preliminary System Evaluation
Conclusions
Demonstration

63
Conclusions

Proposed and implemented learning based Web query
processing with the following features
Returning succinct results segments of pages
No a prior knowledge or preprocessing, suited for
ad hoc queries
exploiting page formatting and linkage
information simultaneously.
The preliminary results are promising

64
Future Work