Title: Learning Based Web Query Processing
1Learning Based Web Query Processing
- Yanlei Diao
- Computer Science Department
- Hong Kong U. of Science Technology
-
2Outline
- Background
- Learning Based Web Query Processing
- FACT A Prototype System
- Preliminary System Evaluation
- Conclusions
- Demonstration
3Searching the Web
- Want to find a piece of information on the Web?
Heterogeneity
Huge Size
Lack of Structure
Diversified User Bases
Ever- Changing
4Search Engines
- Maintain indices, keyword input, match input
keywords with indices, return relevant documents. - Problems
- Large hit lists with low precision. Users find
relevant documents by browsing. - URLs but not the required information are
returned. Users read the pages for the required
information.
5Web Information Retrieval
- IR Vector-space model, search and browse
capabilities - Web IR Web navigation, indexing, query
languages, query-document matching, output
ranking, user relevance feedback - Recent Improvement Hierarchical classification,
better presentation of results, hypertext study,
metasearching...
6Web IR for Query Processing
- Problems
- A list of URLs or documents is returned. Users
browse a lot to find information. - It asks users for precise query requirements,
which is hard for casual users. - It lacks a well-defined underlying model.
Vector-space model does not convey as much as
Hypertext. - ?Large hit lists with low precision, rely on
input queries
7Intelligent Agents
- The agents learn user profiles/models from their
search behaviors and employ the knowledge to
predict URLs of interest to the user. - Some rely on search engines and heuristics to
find targets of a specific type e.g. papers or
homepages - Some help users in an interactive mode They
learn while users are browsing. - Some adaptive agents work autonomously They use
heuristics, recommend pages of interest and take
user feedback to improve.
8Agents for Query Processing
- Problems
- Recommending pages of interest, but not
information of interest to the user - Using vector-space model or converting HTML to
text documents - Requiring a prior knowledge, such as user
profiles, or using heuristics for a particular
domain - ?Not well suited for ad hoc queries
9Database Approaches
- The Web is a directed graph nodes are Web pages
and edges are hyperlinks between pages. - Query languages 1st generation combines
content-based and structure-based queries. 2nd
generation accesses structure of Web objects and
creates complex objects. - Wrappers and mediators they present an
integrated view of the resources.
10DB Approaches for Query Processing
- Problems
- Wrapper generation is only feasible for a number
of sites in a domain. The Web is growing very
fast! - Web query languages require knowledge of the Web
sites (content and linkage) and the language
syntax. They are hard to use. - ?Not scalable, good for Web site management but
not queries on the entire Web.
11Our Goal
- A Web query processing system for any Web users
that - processes ad hoc queries on HTML pages
- automatically extracts succinct and precise query
results ( a result may take the form of a table,
a list or a paragraph). - ? Learn the knowledge for query processing from
the User!
12Proposed Approach
- An approach with learning capabilities
- Keyword input (probably not precise)
- Search engines return a URL list
- During browsing, learns from users
- to navigate through the web pages
- to identify the required information on a web
page - Processes the rest URLs automatically
- Returns succinct and precise results
13Unique Features
- Returning succinct and precise results, i.e.
segments of pages - No a prior knowledge or preprocessing, suited for
ad hoc queries - exploiting page formatting and linkage
information simultaneously, good use of rich
information conveyed by HTML.
14Benefits from Learning
- Bridging the gap between keyword input and real
query requirements - Capable of navigating in the neighborhoods of
documents returned by search engines - Automating the processing of all possibly
relevant documents in one query - Almost imperceptible to users, user-friendly
15Outline
- Background
- Learning Based Web Query Processing
- FACT A Prototype System
- Preliminary System Evaluation
- Conclusions
- Demonstration
16Modeling a Web Page
- Segment a group of tag delimited elements, unit
in query processing, e.g. paragraph, table, list,
nested (atomic segments to the document), Segment
Tree - Attributes of a segment
- content text in the scope of the segment
- description summary of the content
- Hyperlink represented as segments to be
comparable - content URL
- description anchor text
- associated with the parent segment
17A Sample
18Modeling a Web Site
- Ignore backward links, links pointing to
themselves, links outside a site. - A Web site is modeled as hyperlink-connected
segment trees, called - Segment Graph.
19Knowledge for the Locating Task
The locating task is to find a segment in the
Segment Graph of a site as the query result.
20Two Types of Knowledge
A link conveys description of the pointed page
while a queried segment contains both description
and the result itself.
21Navigation Knowledge
- concerns descriptive information and helps find
the navigational path - a set of (term, weight) pairs
- Term a selected word f the description of
segments and links on the navigational path - Weight indicating the importance of the term in
leading to the queried segment
22Learning Navigation Knowledge
- Navigational path, (link?)segment, e.g.
L2?L4?S41. - Extended navigational path, ((segment ?)link?)
((segment ?) segment), e.g. (S1?S11?L2) ?
(S3?S31?L4) ? (S4?S41).
Step1. Assign a weight to each component on the
path, e.g. L2, S31, S41. The closer to the
target, the higher the weight. Step2. Assign a
weight to each term in the description of a
component on the path.
The weight of a term can be summed up over
navigational paths. The set of (term, weight)
pairs is stored into the navigation knowledge
base.
23Classification knowledge
- Checks if a segment meets query requirements on
both descriptive information and the result. - Cast in the Bayesian learning framework.
- Set of triples (feature, NP, NN)
- Feature word, integer, real, symbol, , date,
time, email address, , contained in a segment - NP occurrences of the feature in positive
samples - NN occurrences of the feature in negative
samples
24Learning Classification knowledge
The queried segment is a positive sample. All
other segments on the same page are negative
samples.
The content of each segment is parsed into a set
of features, either simple and complex types.
Count NP and NN accumulatively for each feature
over all samples. Store all triples (feature, NP,
NN) into the classification knowledge base.
25Query Processing Using Learned Knowledge
- After a Web page is retrieved, the segment graph
is built - For each segment and link, a score is computed by
applying the navigation knowledge
(ApplyNavigation). - Segments/links are sorted on the score
- If a link has the highest score, the system
navigates through the link - If a segment has the highest score, all segments
on the page are checked to see if there is a
queried segment - The process is repeated until either a segment is
found or conclusion can be made that the site
does not contain queried information.
26Locating Algorithm
On each page, if the result is not found
choosing an unprocessed component with highest
score if a link is chosen ? if a segment is
chosen
27Locating Algorithm
On each page, if the result is not found
choosing an unprocessed component with highest
score if a link is chosen if a segment is chosen
? (ApplyClassification)
28Applying Learned Knowledge
- Application of Navigation Knowledge
- extracts terms in the description of a
link/segment - reads the weights of the terms and assigns a
score to the link/segment by a certain function
(max currently) - sorts all links and segments by their scores
- Application of Classification Knowledge
- computes the confidence C to classify a segment
as the queried result - chooses the segment on a page with the largest C.
If the largest C is over a threshold, returns the
segment
29forward
Hotel 1
3
Hotel 2
User browses it!
done
30User clicks here!
31Room information
User marks it!
32Generating Navigation Knowledge
- The navigation path looks like
- Hotel Reservation-gtsingle hk double hk standard
room deluxe room executive room - By our weighting scheme, a weight is assigned to
each term
33Generating Classification Knowledge
- Training Samples
- Occurrences of each feature are counted
Negative Holiday Inn Golden Mile In the heart
of Tsim Sha Tsui - Kowloon, Holiday Inn Golden
Mile is your number one choice for accommodation,
dining, meetings and banquets. Ideally situated
in the heart of ...
Positive single hk double hk standard room
999.00 1,039.00 deluxe room
1,199.00 1,239.00 executive room 1,399.00
1,499.00
34back
Fact starts here!
35(No Transcript)
36Applying Navigation Knowledge
- The page contains
- Navigation knowledge shows
Paragraph 57 - 73 Lockhart Road, Wanchai, Hong
Kong, SAR, PRC Paragraph Located in the hub of
Wanchai, the Wharney Hotel is within walking
distance of the Hong Kong Arts Centre, Convention
and Exhibition Centre, busy commercial complexes
and shopping malls. ... Paragraph TEL (852)
2861-1000 FAX (852) 2865-6023
Links Main Features Services Dining and
Banqueting Hotel Rates Reservation ...
37Navigation Knowledge assigns scores
Fact chooses it!
38Navigation Knowledge assigns scores
39Classification Knowledge computes confidence
Apply Classification Knowledge to all Segments
40Fact finds it!
41Outline
- Background
- Learning Based Web Query Processing
- FACT A Prototype System
- Preliminary System Evaluation
- Conclusions
- Demonstration
42A Query Processing System
- A learning based query processing system
- User Interface accepts user queries, presents
query results, a browser capable of capturing
user actions - Query Analyzer analyzes and transforms user
queries - Session Controller coordinates learning and
locating - Learner generates knowledge from captured user
actions - Locator applies knowledge and locates query
results - Retriever Parser retrieves pages and parses to
trees - Knowledge Base stores learned knowledge
43Reference Architecture
44A Query Session
45Training Strategies
- Sequential
- First n sites user browses and system learns
- Next N-n sites system processes
- Random
- Randomly choose n sites user browses and system
learns - the system processes the rest
- Interleaved
- First n0 sites, user browses and system learns
- Next n - n0 site, system makes decision. For
incorrect ones, user browses and system re-learns
- Next N-n sites system processes
46Outline
- Background
- Learning Based Web Query Processing
- FACT A Prototype System
- Preliminary System Evaluation
- Conclusions
- Demonstration
47System Evaluation
- System Capabilities
- Performance
- Effectiveness precision, recall, correctness
- Efficiency in a site, how many pages the system
visits to find a result or to recognize the
irrelevancy - Training efficiency how many training samples
are needed - Key Issues
- Effectiveness of the knowledge
- Effectiveness of training strategies
- Tests on A Range of Queries
48A System Output Sample
49System Capabilities
- The system returns segments of the Web pages
- The segments may not contain any input keyword
but meet the requirement of room rates. - The system learned the query requirement from the
user! - Segments can be from pages whose URLs are not
directly returned by Yahoo!. - The system learned how to follow the hyperlinks
to the queried segment!
50System Evaluation - Effectiveness
- Given a set of URLs in a query session, the
system makes N decisions -
- N N1 N2 N3 N4
- Precision N1 / (N1N3) ,
- Recall N1 / sites that contain results,
- Correctness (N1N2) / N .
51System Evaluation - Efficiency
- How efficiently the system finds a queried
segment in a site? -
- Level of a Queried Segment the length of the
shortest path to find it - Absolute Path length Visited pages,
- Relative Path Length Visited pages / Level of
the Queried Segment .
52Basic Performance
Q11 Hong Kong Hotel Room Rate Q12 Hong Kong
Hotel
Sequential training
53Effectiveness of Knowledge
- Other two systems implemented for comparison
- Classification Knowledge Only treat links and
segments the same by the Bayes classifier - Learning
- Locating
Action positive negative click a
link the link other links on the
page mark a segment the segment other segments
on the page
Classify all segments and links If a link has the
highest confidence, follow the link If a segment
has the highest confidence and passes the
threshold, return it.
54Effectiveness of Knowledge
- Navigation Knowledge Only only checks the
descriptive information of links and segments - Learning
- Locating
Navigational path ? Navigation Knowledge
Assigns scores to all links and segments using
navigation knowledge If a link has the highest
score, follow the link If a segment has the
highest score, return it.
55Effectiveness of Knowledge
56Effects of Training Strategies
Query Q12 Training Size 3-10
57Effects of Training Strategies
- Random training performs badly, low in recall
- As the training size increases, interleaved
training outperforms sequential training - Best accuracy reaches or exceeds 90 in all
metrics when the interleaved training strategy is
used - Enlarging the training size for random and
sequential training is not effective
58Improved Performance
Interleaved training
59A Range of Queries
- Hotel room rates targets at prices, easy to
identify - Admission requirements on graduate student
includes items such as degree, GPA, GRE, etc.
that are not easy to specify in keywords but easy
to show by marking - Data Mining Researcher concept, subjective,
evidence including research interests, projects,
professional activity, etc
60Results of A Range of Queries
Interleaved training
61Performance for the Queries
- Effectiveness
- first 4 queries accuracy is 80 to above 90
- the last query still capable of filtering out
irrelevant sites - Efficiency
- relative path length to locate a queried segment
is close to 1 - absolute path length to conclude irrelevancy is
no more than 2.5 pages. - The performance is not affected much by how
precise the keyword query is. The system learns
query requirements
62Outline
- Background
- Learning Based Web Query Processing
- FACT A Prototype System
- Preliminary System Evaluation
- Conclusions
- Demonstration
63Conclusions
- Proposed and implemented learning based Web query
processing with the following features - Returning succinct results segments of pages
- No a prior knowledge or preprocessing, suited for
ad hoc queries - exploiting page formatting and linkage
information simultaneously. - The preliminary results are promising
64Future Work
- Better segmentation for HTML documents
- Better knowledge, key factor that affects system
performance - other weighting schemes for navigation knowledge
- other implementation of classification knowledge
- More system evaluation
- Dynamic web pages
65Outline
- Background
- Learning Based Web Query Processing
- FACT A Prototype System
- Preliminary System Evaluation
- Conclusions
- Demonstration
66Demonstration