On the Automatic Extraction of Data from the Hidden Web PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: On the Automatic Extraction of Data from the Hidden Web


1
On the Automatic Extraction of Data from the
Hidden Web
  • Stephen W. Liddle, Sai Ho Yau, David W. Embley
  • Brigham Young University

2
The Hidden Web
  • Many Web documents are hidden in some form
  • Requires user/password authentication
  • Firewall restricts access
  • Search engines simply miss these pages
  • Proprietary document format
  • A common cause of hidden documents
  • Page is dynamically generated from a query
    specified through an HTML form
  • Solution
  • Automatically fill in forms to retrieve records
    from underlying databases

3
Reasons to Crawl the Hidden Web
  • Why fill in forms automatically?
  • Automated agents (bots)
  • Site wrappers for higher-level queries
  • Multi-site information extraction and integration

4
A Reference Model of Info Search Task
  • Formulate query or task description
  • Find sources that pertain to the task
  • For each potentially useful source
  • Fill in the sources search form
  • Analyze the results
  • Gather any useful information supporting the task
  • Refine the query criteria and repeat if necessary

5
Issues in Automatic Form Filling
  • Wide variety of controls in forms
  • Text fields, radio buttons, check boxes, lists,
    push buttons, hidden fields, MIME encoded
    attachments, etc.
  • CGI request is fundamentally a list of name/value
    pairs
  • F ?U, (N1,V1), (N2,V2), , (Nn, Vn)?
  • But there are other complications

6
Difficulties in Automatic Form Filling
  • HTTP GET vs. POST
  • One form leads to another, specialized form
  • Logical request is physically divided into
    sub-steps
  • State information captured on the server
  • Session structure required to enforce sequence of
    interactions
  • Cookies
  • Hidden fields
  • Values encoded into the base URL

7
More Difficulties
  • Some fields may be required
  • Rely on user to supply required text values
  • Semantic constraints known to users
  • When searching for cars by location, within 500
    kilometers is more inclusive than within 50
    kilometers
  • When searching by price, 35,000 to 75,000 is
    less inclusive than 0 to 35,000
  • Some combinations dont make sense
  • 4-door motorcycles

8
Scripts
  • Some forms rely on scripts to transform fields
    and then submit the form
  • Range checking, other field validation
  • Automatic calculation of certain fields
  • Understanding arbitrary scripts is
    computationally hard
  • Can watch what gets submitted when a user
    interacts with a form
  • But in general cant predict what a script will
    do, or even guarantee that the script will halt

9
Our Approach
  • Within context of ontology-based data extraction
    system
  • Attempt to retrieve all data behind a particular
    form
  • Not directed search supporting a specific query

10
Filling in the Form
  • Parsing an HTML form and encoding a particular
    request is straightforward
  • Fill in a form by choosing a value for each field
  • We could attempt to fill in the form in all
    possible ways
  • Text fields are practically, if not literally,
    unbounded in possibilities
  • Aside from text fields, the process may be too
    time consuming
  • 50 choices in one list, 25 in another 1250 HTTP
    transactions
  • We likely would have retrieved all data before
    exhausting all possible combinations
  • Indeed some choices in lists represent any

11
Query Submission Plan
  • Issue default query
  • Sample a small number of non-default queries
  • If the sample set yields no new records, assume
    we have retrieved all data
  • Otherwise proceed to exhaustive phase
  • Try all combinations
  • But get users permission first

12
Using Default Values
  • Assign default values to each field
  • The form always supplies a default
  • Our system does allow user to provide specific
    choices for text fields
  • Otherwise these retain their default value
    (usually the empty string)
  • Encode and submit default request to see what
    happens
  • This is like the user submitting the form without
    making any changes

13
Result of Default Query
  • Often the default query is set to return all
    records
  • Sometimes the default query gives an error
  • Required fields
  • Sometimes text field must be given
  • Or a non-default selection is required in a list
    or radio-button group
  • Time-out because default request is too large
  • Designers obviously expected the user to narrow
    the search

14
Sampling Phase
  • Choose a random stratified sample of combinations
  • For each combination
  • Issue query
  • Validate result
  • Filter duplicate records
  • Store any new records found

15
Sampling Approach
  • Random sample might ignore some fields and
    overemphasize others

16
Sampling Approach
  • Regular stratified sample is biased

17
Sampling Approach
  • Random stratified sample seems reasonable
  • If N is total number of combinations, our sample
    size should be ?log2N?

18
Exhaustive Phase
  • For each combination
  • Issue query
  • Validate result
  • Remove duplicates
  • Store any new records found
  • Dont repeat combinations that were already
    sampled

19
User Input
  • First we get permission from our user
  • Estimate maximum required space
  • And time

20
Validating Results
  • Possible results
  • HTTP error
  • Page contains no records
  • Determined based on size of unique portion of the
    page
  • Page contains links to more result records
  • E.g., displaying 1 to 10 of 47
  • Need to follow next links to get complete
    results
  • Page contains all records
  • No next links found

21
Retrieving More Results
  • Presence of next or more in a hyperlink or
    button often signals a link to more results
  • Often a numeric sequence signals more results
  • 1 2 3 4
  • 10 20 30
  • We follow these links, assemble all the results,
    and consider this a single query
  • But multiple HTTP requests

22
Filtering Duplicates
  • Compare records and discard duplicates
  • Based on string comparison
  • Compute hash value for each candidate record
    string
  • Identical hash values indicate duplicate records

23
Filtering Duplicates
  • Separate records heuristically
  • HTML tags that constitute likely record
    separators mark boundaries
  • ltHRgt, ltPgt, lt/TRgt,
  • Strip non-boundary tags
  • Sometimes there are minor variations in tags or
    their attributes that interfere with duplicate
    detection
  • Now calculate hash values and remove any
    duplicate strings
  • If ratio of unique strings to total document size
    is lt 5, we assume no new records are present
  • There is noise in page headers, footers,
    advertisements, etc.

24
Experimental Results
  • Roughly 80 of forms in our test set were
    automatically processed correctly
  • Sources of failure
  • Missing required fields (user must supply)
  • No records from default and sample queries
  • Invalid URL (Web site error)
  • For 1/3 of forms, the default query returned all
    records

25
Experimental Results
  • Processing a single HTTP request took between 2
    and 25 seconds on average
  • A single query (including following links) took
    between 5 seconds and 14 minutes
  • The number of next links ranged from none to
    more than 140
  • Sampling took from 30 seconds to 3 hours per form
  • In all cases, manual verification corroborated
    what the system reported

26
Time Saved
  • When the sampling phase successfully returned all
    records, considerable time was saved compared to
    exhaustive query
  • 15 minutes
  • Almost 3 hours
  • gt 4 days
  • gt 40 days

27
Future Work
  • Conduct more experiments
  • To further validate our initial results
  • To learn how to improve
  • Better metrics
  • Integrate this tool into our ontology-based data
    extraction framework
  • Upstream automatic selection of
    domain-appropriate forms
  • Downstream automatic record-boundary detection
    and extraction

28
Intent of Form
  • Is the purpose of the form transactional or
    informational?
  • Transactional
  • Purchase a DVD
  • Transfer money between accounts
  • Update customer information
  • Request contact from a sales representative
  • Goal of transactional form is to interact with a
    business partner to support a business process of
    some kind

29
Transactional vs. Informational
  • Informational form
  • Issues a query
  • Find documents or records matching given criteria
  • Goal of informational form is to retrieve data,
    not execute a business process
  • Were typically interested only in the
    informational forms
  • But eventually agents will need to handle
    transactional forms also

30
Conclusion
  • We have presented the prototype of a synergistic
    tool that
  • Automatically retrieves data behind HTML forms
  • Including following links to retrieve multiple
    pages of results associated with a single query
  • Is domain-independent
  • Can easily integrate with our source
    ontology-based source discovery and data
    extraction tools
  • The world is ready for tools that understand and
    access the Hidden Web
Write a Comment
User Comments (0)
About PowerShow.com