On the Automatic Extraction of Data from the Hidden Web presentation

About This Presentation

Transcript and Presenter's Notes

Title: On the Automatic Extraction of Data from the Hidden Web

1
On the Automatic Extraction of Data from the
Hidden Web

Stephen W. Liddle, Sai Ho Yau, David W. Embley
Brigham Young University

2
The Hidden Web

Many Web documents are hidden in some form
Requires user/password authentication
Firewall restricts access
Search engines simply miss these pages
Proprietary document format
A common cause of hidden documents
Page is dynamically generated from a query
specified through an HTML form
Solution
Automatically fill in forms to retrieve records
from underlying databases

3
Reasons to Crawl the Hidden Web

Why fill in forms automatically?
Automated agents (bots)
Site wrappers for higher-level queries
Multi-site information extraction and integration

4
A Reference Model of Info Search Task

Formulate query or task description
Find sources that pertain to the task
For each potentially useful source
Fill in the sources search form
Analyze the results
Gather any useful information supporting the task
Refine the query criteria and repeat if necessary

5
Issues in Automatic Form Filling

Wide variety of controls in forms
Text fields, radio buttons, check boxes, lists,
push buttons, hidden fields, MIME encoded
attachments, etc.
CGI request is fundamentally a list of name/value
pairs
F ?U, (N1,V1), (N2,V2), , (Nn, Vn)?
But there are other complications

6
Difficulties in Automatic Form Filling

HTTP GET vs. POST
One form leads to another, specialized form
Logical request is physically divided into
sub-steps
State information captured on the server
Session structure required to enforce sequence of
interactions
Cookies
Hidden fields
Values encoded into the base URL

7
More Difficulties

Some fields may be required
Rely on user to supply required text values
Semantic constraints known to users
When searching for cars by location, within 500
kilometers is more inclusive than within 50
kilometers
When searching by price, 35,000 to 75,000 is
less inclusive than 0 to 35,000
Some combinations dont make sense
4-door motorcycles

8
Scripts

Some forms rely on scripts to transform fields
and then submit the form
Range checking, other field validation
Automatic calculation of certain fields
Understanding arbitrary scripts is
computationally hard
Can watch what gets submitted when a user
interacts with a form
But in general cant predict what a script will
do, or even guarantee that the script will halt

9
Our Approach

Within context of ontology-based data extraction
system
Attempt to retrieve all data behind a particular
form
Not directed search supporting a specific query

10
Filling in the Form

Parsing an HTML form and encoding a particular
request is straightforward
Fill in a form by choosing a value for each field
We could attempt to fill in the form in all
possible ways
Text fields are practically, if not literally,
unbounded in possibilities
Aside from text fields, the process may be too
time consuming
50 choices in one list, 25 in another 1250 HTTP
transactions
We likely would have retrieved all data before
exhausting all possible combinations
Indeed some choices in lists represent any

11
Query Submission Plan

Issue default query
Sample a small number of non-default queries
If the sample set yields no new records, assume
we have retrieved all data
Otherwise proceed to exhaustive phase
Try all combinations
But get users permission first

12
Using Default Values

Assign default values to each field
The form always supplies a default
Our system does allow user to provide specific
choices for text fields
Otherwise these retain their default value
(usually the empty string)
Encode and submit default request to see what
happens
This is like the user submitting the form without
making any changes

13
Result of Default Query

Often the default query is set to return all
records
Sometimes the default query gives an error
Required fields
Sometimes text field must be given
Or a non-default selection is required in a list
or radio-button group
Time-out because default request is too large
Designers obviously expected the user to narrow
the search

14
Sampling Phase

Choose a random stratified sample of combinations
For each combination
Issue query
Validate result
Filter duplicate records
Store any new records found

15
Sampling Approach

Random sample might ignore some fields and
overemphasize others

16
Sampling Approach

Regular stratified sample is biased

17
Sampling Approach

Random stratified sample seems reasonable
If N is total number of combinations, our sample
size should be ?log2N?

18
Exhaustive Phase

For each combination
Issue query
Validate result
Remove duplicates
Store any new records found
Dont repeat combinations that were already
sampled

19
User Input

First we get permission from our user
Estimate maximum required space
And time

20
Validating Results

Possible results
HTTP error
Page contains no records
Determined based on size of unique portion of the
page
Page contains links to more result records
E.g., displaying 1 to 10 of 47
Need to follow next links to get complete
results
Page contains all records
No next links found

21
Retrieving More Results

Presence of next or more in a hyperlink or
button often signals a link to more results
Often a numeric sequence signals more results
1 2 3 4
10 20 30
We follow these links, assemble all the results,
and consider this a single query
But multiple HTTP requests

22
Filtering Duplicates

Compare records and discard duplicates
Based on string comparison
Compute hash value for each candidate record
string
Identical hash values indicate duplicate records

23
Filtering Duplicates

Separate records heuristically
HTML tags that constitute likely record
separators mark boundaries
ltHRgt, ltPgt, lt/TRgt,
Strip non-boundary tags
Sometimes there are minor variations in tags or
their attributes that interfere with duplicate
detection
Now calculate hash values and remove any
duplicate strings
If ratio of unique strings to total document size
is lt 5, we assume no new records are present
There is noise in page headers, footers,
advertisements, etc.

24
Experimental Results

Roughly 80 of forms in our test set were
automatically processed correctly
Sources of failure
Missing required fields (user must supply)
No records from default and sample queries
Invalid URL (Web site error)
For 1/3 of forms, the default query returned all
records

25
Experimental Results

Processing a single HTTP request took between 2
and 25 seconds on average
A single query (including following links) took
between 5 seconds and 14 minutes
The number of next links ranged from none to
more than 140
Sampling took from 30 seconds to 3 hours per form
In all cases, manual verification corroborated
what the system reported

26
Time Saved

When the sampling phase successfully returned all
records, considerable time was saved compared to
exhaustive query
15 minutes
Almost 3 hours
gt 4 days
gt 40 days

27
Future Work

Conduct more experiments
To further validate our initial results
To learn how to improve
Better metrics
Integrate this tool into our ontology-based data
extraction framework
Upstream automatic selection of
domain-appropriate forms
Downstream automatic record-boundary detection
and extraction

28
Intent of Form

Is the purpose of the form transactional or
informational?
Transactional
Purchase a DVD
Transfer money between accounts
Update customer information
Request contact from a sales representative
Goal of transactional form is to interact with a
business partner to support a business process of
some kind

29
Transactional vs. Informational

Informational form
Issues a query
Find documents or records matching given criteria
Goal of informational form is to retrieve data,
not execute a business process
Were typically interested only in the
informational forms
But eventually agents will need to handle
transactional forms also

30
Conclusion

We have presented the prototype of a synergistic
tool that
Automatically retrieves data behind HTML forms
Including following links to retrieve multiple
pages of results associated with a single query
Is domain-independent
Can easily integrate with our source
ontology-based source discovery and data
extraction tools
The world is ready for tools that understand and
access the Hidden Web

Write a Comment

User Comments (0)

About PowerShow.com

On the Automatic Extraction of Data from the Hidden Web PowerPoint PPT Presentation