Using Weight-controlled Token Matching to Extract Data From HTML Files PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Using Weight-controlled Token Matching to Extract Data From HTML Files


1
Using Weight-controlled Token Matching to Extract
Data From HTML Files
  • Yan Xu, Tok Wang Ling
  • Dept. of Computer Science
  • National University of Singapore
  • (xuyan, lingtw)_at_comp.nus.edu.sg

2
Outline
Outline
  • Motivation and background
  • Our approach
  • Generate wrapper
  • Extraction data
  • Experimental Result
  • Conclusion

3
Motivation and Background
Motivation and Background
  • What is a wrapper
  • XML and HTML
  • Related works
  • Some criteria to build wrappers for Web pages

4
What Is a Wrapper?
Motivation and Background
  • Wrapper is a software component.
  • Wrapper is used to extract data from source files
    and convert them into a structured way.
  • On the Web, the source files are usually HTML
    files.

5
What is a Wrapper? (Cont)
Motivation and Background
  • The source files are usually semistructured or
    unstructured.
  • We only discuss HTML files as source files in
    this paper.

6
XML and HTML
Motivation and Background
  • XML is more suitable to organize data than HTML.
    HTML is simple and widely used and accepted.
  • More and more XML sites appear on the Web. HTML
    files are far many more than XML files on the Web

7
XML and HTML (Cont)
Motivation and Background
  • Query XML is easier and standard query language
    is coming. HTML files are usually queried by
    using web search engines.
  • XML contains information no less than HTML
  • XML is easier to be converted to other data
    models, especially semistructured data models
  • XML is suitable to be the output of a wrapper
  • XML provides one of possible semantic
    interpretations of a document.

8
Related Works
Motivation and Background
  • Construct wrappers for HTML files manually or
    automatically
  • Using specification files to extract data. Such
    as extraction system in TSIMMIS
  • Advantages sufficient expressive and high
    precision
  • Limitations built by experienced programmers and
    hard to maintain
  • Very time consuming to build a wrapper

9
Related Works (Cont)
Motivation and Background
  • Rule-based wrappers
  • Using rules to extract data
  • Inducing rules from training examples
  • Using delimiter-based rules
  • Such as WIEN, STALKER, SoftMealy
  • Our wrapper is rule-based

10
Some Criteria to Build Wrappers for Web Pages
Motivation and Background
  • Simple and powerful extracting rules
  • Need less examples and less users interaction
  • Use HTML structure information as much as
    possible
  • Easy to maintain and update
  • Less time to build a wrapper

11
Our Approach
Our Approach
  • Rule-based wrapper
  • Use delimiter to identify data
  • Use training examples to induce rules
  • Use Weighted Token List to identify delimiter
  • Use rules and threshold to extract data

12
Our Approach (Cont)
Our Approach
  • Weighted token list (WTL) a list of vector. Each
    vector contains a set of
  • lttoken, weightgt pair
  • Token could be a HTML tag, a word, or a
    punctuation in HTML files
  • Weight how important a token is in its position.
    Its a number between 0 and 1.
  • Generate WTL using labeled examples

13
An Example
Our Approach
  • Part of a page from Amazon.Com

14
An Example (Cont)
Our Approach
  • We hope the wrapper could output the below result
  • TITLE professional xml (2nd edition)
  • AUTHOR nikola ozu, et al
  • TYPE paperback
  • DATE may 2001
  • SHIPINFO usually ships in 24 hours
  • LISTPRICE 59.99
  • OURPRICE 47.99
  • Save 20

15
An Example (Cont)
Our Approach
  • The label information is input by the user
  • The label is the meaning of the data. So we could
    identify the extracted data. Such as TITLE,
    AUTHOR etc. in the previous page
  • There are two kinds of users.
  • The user who build the wrapper
  • The user who use the generated wrapper to extract
    data

16
An Example (Cont)
Our Approach
  • Part of HTML source code about author information
    in the example is
  • lt/tdgt lttd gt ltfontgtltbgt
  • ltA href"/exec/obidos/ASIN/1861005059/qid992196
    84 8/sr1-4/refsc_b_4/104-9977965-3139126"gt
  • Professional XML (2nd Edition)
  • lt/Agtlt/bgtltbrgtltfontgtby Nikola Ozu, et al lt/fontgt
  • ( Paperback May 2001)ltbrgt

17
An Example (Cont)
Our Approach
  • If we choose token by as the left delimiter of
    the author data, and HTML tag lt/fontgt as the
    right delimiter, we will have high recall but low
    precision when we try to extract author data.
  • If we choose a sequence of tokens as the
    delimiter, for example 5 tokens
  • The 5 tokens before the author information
  • ltagt lt/bgt ltbrgt ltfontgt by
  • and the 5 tokens after the author information
  • lt/fontgt ( Paperback - May

18
An Example (Cont)
Our Approach
  • Surveying the entire example page (25 books), we
    find
  • The 5 tokens before the author data do not change
    and they are expressive enough to be left
    delimiter.
  • The 5 tokens after the author data are not
    precise enough to be the delimiter. For example,
    there is a book that is hardcover and do not have
    publish date, the right 5 tokens after author
    data is
  • lt/fontgt ( Hardcover ) ltbrgt
  • We will have high precision but low recall

19
An Example (Cont)
Our Approach
  • Surveying the example page, we find 6 out of 25
    books are hardcover and 18 out of 25 books are
    paperback.
  • Using 3 books as training example, we obtain the
    following token lists

20
An Example (Cont)
Our Approach
  • The begin weighted token list (the tokens before
    the author data)
  • The end weighted token list (the tokens after the
    author data)

21
An Example (Cont)
Our Approach
  • One token near the data associates its weight at
    its position
  • For example means token
  • paperback is found 2 out of 3 times (i.e. 67)
    in training examples. We allocate the possibility
    of token paperback (0.67) as weight to this
    token in this position

Paperback, 0.67
22
An Example (Cont)
Our Approach
  • The Weighted Token List to identify the left
    delimiter is
  • by,1.0
  • ltfontgt,1.0
  • ltbrgt,1.0
  • lt/bgt,1.0
  • ltagt,1.0

23
An Example (Cont)
Our Approach
  • The Weighted Token List to identify the left
    delimiter is
  • lt/fontgt,1.0
  • (,1.0
  • hardcover,0.33 paperback,0.67
  • ),0.67 -,0.33
  • ltbrgt,0.67 may,0.33
  • The colored line means there are two tokens are
    found in the third position after the author data.

24
An Example (Cont)
Our Approach
  • Using Weighted Token List, we achieve
  • A list of tokens as the delimiter
  • Associating weights to tokens, we could obtain a
    better recall-precision tradeoff
  • We can bear small modification of HTML pages,
    especially, the modification is not occurred near
    the data

25
Label the Example Page
Our Approach
  • Using our GUI tool

26
Label the Example Page (Cont)
Our Approach
  • User highlights the interested data
  • User clicks the input label button
  • A dialog window pops up and user inputs the label
  • We insert the label into HTML file following our
    specification

27
Label the Example Page (Cont)
Our Approach
  • After labeling, the modified HTML file is
  • LABELTITLEProfessional XML (2nd
    Edition)INFOREND
  • lt/agtlt/bgtltbrgtltfontgtby
  • LABELAUTHORNikola Ozu, et alINFOREND
  • lt/fontgt(
  • LABELTYPEPaperbackINFOREND
  • LABELDATEMay 2001INFOREND)ltbrgt
  • The user input parts are TITLE, AUTHOR etc.

28
An Extraction Rule Has
Our Approach
  • Label information
  • Delimiters information
  • Begin WTL (BWTL) a WTL that describe a list of
    tokens as begin delimiter
  • End WTL (EWTL) a WTL that describe a list of
    tokens as end delimiter
  • A rule contains enough information to extract a
    piece of data

29
A Rule Looks Like
Our Approach
  • ltLABEL, BWTL, EWTLgt
  • LABEL is AUTHOR in our example
  • BWTL in our example is
  • by,1.0
  • ltfontgt,1.0
  • ltbrgt,1.0
  • lt/bgt,1.0
  • ltagt,1.0
  • EWTL in our example is
  • lt/fontgt,1.0
  • (,1.0
  • hardcover,0.33paperback,0.67
  • ),0.67-,0.33
  • ltbrgt,0.67 may,0.33

30
How to Generate Rule?
Our Approach
  • Find label information after LABEL
  • and before the next from examples that is
    user labeled using the our GUI tool
  • We set the number of tokens needed as 5. User
    could use it to generate rules and test the
    result. If not good, user could set it manually
  • Generate BWTL for left delimiter
  • Generate EWTL for right delimiter
  • Assemble label, BWTL and EWTL to a rule

31
How to Generate WTL
Our Approach
  • Find the begin point and end point of the data
    from the labeled training example
  • Detect the lists of tokens before and after the
    data
  • Use the collected tokens to generate new WTL or
    add the lists of tokens into correspond WTL and
    calculate the weight for each token
  • Weight of a token is calculated by using the
    times that the token appears near the data
    divided by the sum of the times that all the
    tokens appear in training examples near the same
    data

32
Extract Data Using Rules
Our Approach
  • Tokenize the object HTML file
  • Obtain a list of tokens and find the correspond
    rule in rule set
  • Obtain the data
  • Associate label with data
  • Output the result

33
Find the Correspond Rule
Our Approach
  • Obtain a list of tokens from web pages
  • Find a rule in rule set that if the given tokens
    are found in Weighted Token List and the sum of
    the weight of the tokens are larger than the
    threshold multiply the number of tokens

34
Threshold
Our Approach
  • threshold is between 0 and 1
  • After testing, we found the result is usually
    good when the threshold is set between 0.4 to
    0.6. We set it to 0.5 by default
  • User could test the wrapper and change the
    threshold

35
Another Example
Our Approach
  • HTML source code from Amazon.com about author
    data of a book
  • lt/agtlt/bgt
  • ltbrgtltfontgtby Cisco Systems (Editor), Vito
    Amatolt/fontgt
  • (Hardcover)
  • ltbrgt

36
Another Example (Cont)
Our Approach
  • We detect ltagtlt/bgtltbrgtltfontgtby as left delimiter
    of author data, the weight is 5 larger than 50.5
  • We detect lt/fontgt(Hardcover)ltbrgt as the right
    delimiter of author data, the weight is 3.7
    larger than 50.5
  • The author data is between two list of tokens
  • Cisco Systems (Editor), Vito Amato

37
Result Analysis
Result Analysis
  • We define
  • field a piece of data. The smallest unit that
    our wrapper could handle. For example, the author
    data of a book
  • item a group of fields such as all the data of a
    book.
  • Our wrappers training example is item. For
    example, an Amazon.com page usually contains
    information more than ten books (ten items), we
    need only several of them (3 items) but not the
    entire page to be labeled as training examples

38
Result Analysis (Cont)
Result Analysis
  • Ten test web sites basic information
  • Java SDK1.3. PC with Windows NT 4.0 workstation
    (Intel PIII 800/128 M RAM)

39
Result Analysis (Cont)
Result Analysis
  • Ten test web sites recall-precision table

40
Recall and Precision
Result Analysis
  • Recall and precision
  • Recall 80 has a 100 recall
  • Precision 50 has a 100 precision all have a
    more than 80 precision
  • Four sites has 100 both in recall and precision
    test
  • The recall of News.com is lowest because
    News.coms web pages are assembled from several
    news and newspaper web sites
  • The result shows the best recall-precision
    balance. Increase the number of tokens will have
    a better recall but lower precision. Increase the
    threshold will cause a better precision but lower
    recall.

41
Wrapper Generation Time
Result Analysis
  • Wrapper generation time Except two examples, all
    the others need less than 5 seconds
  • Labeling time is not included in wrapper
    generation time and labels are input with the
    help of our GUI tool. The time depends on how
    many items are selected as training example and
    how many fields contained in one item. All
    examples labeling time is less than 10 minutes
    except worldfact book example page

42
Extraction Time
Result Analysis
  • Data extraction time 70 less than 10 seconds
  • The extraction time is related to the HTML file
    size. The HTML file size is usually not quite
    large.
  • The wrapper generation time and the labeling time
    are acceptable
  • The Data extraction time is not too long and is
    bearable when used in real time web applications

43
Compare to other approaches
Result Analysis
  • Automatically generate wrappers and implement a
    friendly GUI tool to help user input labels and
    extract data
  • Simple and powerful rules that could deal with
    missed and mis-ordered items in web pages

44
Compare to other approaches (Cont)
Result Analysis
  • We need a less number of training examples
    because
  • when HTML file does not have missed and
    mis-ordered items, we demand no more examples
    than other methods.
  • When there is missed and mis-ordered items, we
    need not to meet every situation of missed and
    mis-ordered items in web pages
  • Quickly generated wrapper and the allocation
    Weights to token assures a easier maintenance and
    update

45
Conclusion
Conclusion
  • Use weighted token list to find and extract data
    from HTML files.
  • A friendly GUI tool to generate wrappers easily
  • Acceptable result

46
Reference
  • 1 S. Abiteboul. Querying Semistructured Data.
    In Proceedings of the International Conference on
    Datbase Theory (ICDT), January 1997.
  • 2 S. Abiteboul, D.Quass, J.McHugh, J.Widom,
    and J.Wiener. The Lorel Query Language for
    Semistructured Data. Journal of Digital
    Libraries, November 1996 68-88
  • 3 Naveen Ashish, Craig A. Knoblock.
    Semi-Automatic Wrapper Generation for Internet
    Information Sources. CoopIS 1997 160-169
  • 4 Naveen Ashish and Craig Knoblock. Wrapper
    Generation for Semi-Structured Internet Sources.
    . SIGMOD Record26 (4) 8-15, 1997
  • 5 S. Chawathe, H.Garcia-Molina, J. Hammer, K.
    Ireland, Y. Papakonstantinou, J. Ullman, and J.
    Widom The TSIMMIS Project Integration of
    Heterogeneous Information sources. Proceedings of
    Tenth Anniversary Meeting of Information
    Processing Society of Japan, Tokyo, Japan, 1994
    7-18.
  • 6 J. Hammer, H. Garcia-Molina , J. Cho , R.
    Aranha, A. Crespo. Extracting Semistructured
    Information from the Web. In Proceedings of the
    Workshop on Management of Semistructured Data.
    Tucson, Arizona, May 1997
  • 7 Chun-nan Hsu et al. Finite-State Transducers
    for Semi-structured Data Extraction From the Web.
    Information Systems, 23(8)521-538, 1998
  • 8 Nicholas Kushmerick, Daniel S. Weld, Robert
    Doorenbos. Wrapper Induction for Information
    Extraction. International Joint Conference on
    Artificial Intelligence 729-737, 1997
  • 9 Ion Muslea, Steve Minton, Craig Knoblock.
    Hierarchical Wrapper Induction for Semistructured
    Information Sources. Journal of Autonomous Agents
    and Multi-Agent Systems 493-114, 2001
  • 10 Arnaud Sahuguet, Fabien Azavant. WysiWyg Web
    Wrapper Factory (W4F). unpublished, 1999.
    http//db.cis.upenn.edu/Research/w4f.html
  • 11 W3C. HTML 4.01 specification,
  • http//www.w3.org/TR/html4/
  • 12 W3C. XML1.0,
  • http//www.w3.org/TR/1998/REC-xml-19980210
Write a Comment
User Comments (0)
About PowerShow.com