Title: Using Weight-controlled Token Matching to Extract Data From HTML Files
1Using Weight-controlled Token Matching to Extract
Data From HTML Files
- Yan Xu, Tok Wang Ling
- Dept. of Computer Science
- National University of Singapore
- (xuyan, lingtw)_at_comp.nus.edu.sg
2Outline
Outline
- Motivation and background
- Our approach
- Generate wrapper
- Extraction data
- Experimental Result
- Conclusion
3Motivation and Background
Motivation and Background
- What is a wrapper
- XML and HTML
- Related works
- Some criteria to build wrappers for Web pages
4What Is a Wrapper?
Motivation and Background
- Wrapper is a software component.
- Wrapper is used to extract data from source files
and convert them into a structured way. - On the Web, the source files are usually HTML
files.
5What is a Wrapper? (Cont)
Motivation and Background
- The source files are usually semistructured or
unstructured. - We only discuss HTML files as source files in
this paper.
6XML and HTML
Motivation and Background
- XML is more suitable to organize data than HTML.
HTML is simple and widely used and accepted. - More and more XML sites appear on the Web. HTML
files are far many more than XML files on the Web
7XML and HTML (Cont)
Motivation and Background
- Query XML is easier and standard query language
is coming. HTML files are usually queried by
using web search engines. - XML contains information no less than HTML
- XML is easier to be converted to other data
models, especially semistructured data models - XML is suitable to be the output of a wrapper
- XML provides one of possible semantic
interpretations of a document.
8Related Works
Motivation and Background
- Construct wrappers for HTML files manually or
automatically - Using specification files to extract data. Such
as extraction system in TSIMMIS - Advantages sufficient expressive and high
precision - Limitations built by experienced programmers and
hard to maintain - Very time consuming to build a wrapper
9Related Works (Cont)
Motivation and Background
- Rule-based wrappers
- Using rules to extract data
- Inducing rules from training examples
- Using delimiter-based rules
- Such as WIEN, STALKER, SoftMealy
- Our wrapper is rule-based
10Some Criteria to Build Wrappers for Web Pages
Motivation and Background
- Simple and powerful extracting rules
- Need less examples and less users interaction
- Use HTML structure information as much as
possible - Easy to maintain and update
- Less time to build a wrapper
11Our Approach
Our Approach
- Rule-based wrapper
- Use delimiter to identify data
- Use training examples to induce rules
- Use Weighted Token List to identify delimiter
- Use rules and threshold to extract data
12Our Approach (Cont)
Our Approach
- Weighted token list (WTL) a list of vector. Each
vector contains a set of - lttoken, weightgt pair
- Token could be a HTML tag, a word, or a
punctuation in HTML files - Weight how important a token is in its position.
Its a number between 0 and 1. - Generate WTL using labeled examples
13An Example
Our Approach
- Part of a page from Amazon.Com
-
14An Example (Cont)
Our Approach
- We hope the wrapper could output the below result
- TITLE professional xml (2nd edition)
- AUTHOR nikola ozu, et al
- TYPE paperback
- DATE may 2001
- SHIPINFO usually ships in 24 hours
- LISTPRICE 59.99
- OURPRICE 47.99
- Save 20
15An Example (Cont)
Our Approach
- The label information is input by the user
- The label is the meaning of the data. So we could
identify the extracted data. Such as TITLE,
AUTHOR etc. in the previous page - There are two kinds of users.
- The user who build the wrapper
- The user who use the generated wrapper to extract
data
16An Example (Cont)
Our Approach
- Part of HTML source code about author information
in the example is -
- lt/tdgt lttd gt ltfontgtltbgt
- ltA href"/exec/obidos/ASIN/1861005059/qid992196
84 8/sr1-4/refsc_b_4/104-9977965-3139126"gt - Professional XML (2nd Edition)
- lt/Agtlt/bgtltbrgtltfontgtby Nikola Ozu, et al lt/fontgt
- ( Paperback May 2001)ltbrgt
-
17An Example (Cont)
Our Approach
- If we choose token by as the left delimiter of
the author data, and HTML tag lt/fontgt as the
right delimiter, we will have high recall but low
precision when we try to extract author data. - If we choose a sequence of tokens as the
delimiter, for example 5 tokens - The 5 tokens before the author information
- ltagt lt/bgt ltbrgt ltfontgt by
- and the 5 tokens after the author information
- lt/fontgt ( Paperback - May
18An Example (Cont)
Our Approach
- Surveying the entire example page (25 books), we
find - The 5 tokens before the author data do not change
and they are expressive enough to be left
delimiter. - The 5 tokens after the author data are not
precise enough to be the delimiter. For example,
there is a book that is hardcover and do not have
publish date, the right 5 tokens after author
data is - lt/fontgt ( Hardcover ) ltbrgt
- We will have high precision but low recall
19An Example (Cont)
Our Approach
- Surveying the example page, we find 6 out of 25
books are hardcover and 18 out of 25 books are
paperback. - Using 3 books as training example, we obtain the
following token lists
20An Example (Cont)
Our Approach
- The begin weighted token list (the tokens before
the author data) -
- The end weighted token list (the tokens after the
author data)
21An Example (Cont)
Our Approach
- One token near the data associates its weight at
its position - For example means token
- paperback is found 2 out of 3 times (i.e. 67)
in training examples. We allocate the possibility
of token paperback (0.67) as weight to this
token in this position
Paperback, 0.67
22An Example (Cont)
Our Approach
- The Weighted Token List to identify the left
delimiter is - by,1.0
- ltfontgt,1.0
- ltbrgt,1.0
- lt/bgt,1.0
- ltagt,1.0
23An Example (Cont)
Our Approach
- The Weighted Token List to identify the left
delimiter is - lt/fontgt,1.0
- (,1.0
- hardcover,0.33 paperback,0.67
- ),0.67 -,0.33
- ltbrgt,0.67 may,0.33
- The colored line means there are two tokens are
found in the third position after the author data.
24An Example (Cont)
Our Approach
- Using Weighted Token List, we achieve
- A list of tokens as the delimiter
- Associating weights to tokens, we could obtain a
better recall-precision tradeoff - We can bear small modification of HTML pages,
especially, the modification is not occurred near
the data
25Label the Example Page
Our Approach
26Label the Example Page (Cont)
Our Approach
- User highlights the interested data
- User clicks the input label button
- A dialog window pops up and user inputs the label
- We insert the label into HTML file following our
specification
27Label the Example Page (Cont)
Our Approach
- After labeling, the modified HTML file is
-
- LABELTITLEProfessional XML (2nd
Edition)INFOREND - lt/agtlt/bgtltbrgtltfontgtby
- LABELAUTHORNikola Ozu, et alINFOREND
- lt/fontgt(
- LABELTYPEPaperbackINFOREND
- LABELDATEMay 2001INFOREND)ltbrgt
-
- The user input parts are TITLE, AUTHOR etc.
28An Extraction Rule Has
Our Approach
- Label information
- Delimiters information
- Begin WTL (BWTL) a WTL that describe a list of
tokens as begin delimiter - End WTL (EWTL) a WTL that describe a list of
tokens as end delimiter - A rule contains enough information to extract a
piece of data
29A Rule Looks Like
Our Approach
- ltLABEL, BWTL, EWTLgt
- LABEL is AUTHOR in our example
- BWTL in our example is
- by,1.0
- ltfontgt,1.0
- ltbrgt,1.0
- lt/bgt,1.0
- ltagt,1.0
- EWTL in our example is
- lt/fontgt,1.0
- (,1.0
- hardcover,0.33paperback,0.67
- ),0.67-,0.33
- ltbrgt,0.67 may,0.33
30How to Generate Rule?
Our Approach
- Find label information after LABEL
- and before the next from examples that is
user labeled using the our GUI tool - We set the number of tokens needed as 5. User
could use it to generate rules and test the
result. If not good, user could set it manually - Generate BWTL for left delimiter
- Generate EWTL for right delimiter
- Assemble label, BWTL and EWTL to a rule
31How to Generate WTL
Our Approach
- Find the begin point and end point of the data
from the labeled training example - Detect the lists of tokens before and after the
data - Use the collected tokens to generate new WTL or
add the lists of tokens into correspond WTL and
calculate the weight for each token - Weight of a token is calculated by using the
times that the token appears near the data
divided by the sum of the times that all the
tokens appear in training examples near the same
data
32Extract Data Using Rules
Our Approach
- Tokenize the object HTML file
- Obtain a list of tokens and find the correspond
rule in rule set - Obtain the data
- Associate label with data
- Output the result
33Find the Correspond Rule
Our Approach
- Obtain a list of tokens from web pages
- Find a rule in rule set that if the given tokens
are found in Weighted Token List and the sum of
the weight of the tokens are larger than the
threshold multiply the number of tokens
34Threshold
Our Approach
- threshold is between 0 and 1
- After testing, we found the result is usually
good when the threshold is set between 0.4 to
0.6. We set it to 0.5 by default - User could test the wrapper and change the
threshold
35Another Example
Our Approach
- HTML source code from Amazon.com about author
data of a book -
- lt/agtlt/bgt
- ltbrgtltfontgtby Cisco Systems (Editor), Vito
Amatolt/fontgt - (Hardcover)
- ltbrgt
-
36Another Example (Cont)
Our Approach
- We detect ltagtlt/bgtltbrgtltfontgtby as left delimiter
of author data, the weight is 5 larger than 50.5 - We detect lt/fontgt(Hardcover)ltbrgt as the right
delimiter of author data, the weight is 3.7
larger than 50.5 - The author data is between two list of tokens
- Cisco Systems (Editor), Vito Amato
37Result Analysis
Result Analysis
- We define
- field a piece of data. The smallest unit that
our wrapper could handle. For example, the author
data of a book - item a group of fields such as all the data of a
book. - Our wrappers training example is item. For
example, an Amazon.com page usually contains
information more than ten books (ten items), we
need only several of them (3 items) but not the
entire page to be labeled as training examples
38Result Analysis (Cont)
Result Analysis
- Ten test web sites basic information
- Java SDK1.3. PC with Windows NT 4.0 workstation
(Intel PIII 800/128 M RAM)
39Result Analysis (Cont)
Result Analysis
- Ten test web sites recall-precision table
40Recall and Precision
Result Analysis
- Recall and precision
- Recall 80 has a 100 recall
- Precision 50 has a 100 precision all have a
more than 80 precision - Four sites has 100 both in recall and precision
test - The recall of News.com is lowest because
News.coms web pages are assembled from several
news and newspaper web sites - The result shows the best recall-precision
balance. Increase the number of tokens will have
a better recall but lower precision. Increase the
threshold will cause a better precision but lower
recall.
41Wrapper Generation Time
Result Analysis
- Wrapper generation time Except two examples, all
the others need less than 5 seconds - Labeling time is not included in wrapper
generation time and labels are input with the
help of our GUI tool. The time depends on how
many items are selected as training example and
how many fields contained in one item. All
examples labeling time is less than 10 minutes
except worldfact book example page
42Extraction Time
Result Analysis
- Data extraction time 70 less than 10 seconds
- The extraction time is related to the HTML file
size. The HTML file size is usually not quite
large. - The wrapper generation time and the labeling time
are acceptable - The Data extraction time is not too long and is
bearable when used in real time web applications
43Compare to other approaches
Result Analysis
- Automatically generate wrappers and implement a
friendly GUI tool to help user input labels and
extract data - Simple and powerful rules that could deal with
missed and mis-ordered items in web pages
44Compare to other approaches (Cont)
Result Analysis
- We need a less number of training examples
because - when HTML file does not have missed and
mis-ordered items, we demand no more examples
than other methods. - When there is missed and mis-ordered items, we
need not to meet every situation of missed and
mis-ordered items in web pages - Quickly generated wrapper and the allocation
Weights to token assures a easier maintenance and
update
45Conclusion
Conclusion
- Use weighted token list to find and extract data
from HTML files. - A friendly GUI tool to generate wrappers easily
- Acceptable result
46Reference
- 1 S. Abiteboul. Querying Semistructured Data.
In Proceedings of the International Conference on
Datbase Theory (ICDT), January 1997. - 2 S. Abiteboul, D.Quass, J.McHugh, J.Widom,
and J.Wiener. The Lorel Query Language for
Semistructured Data. Journal of Digital
Libraries, November 1996 68-88 - 3 Naveen Ashish, Craig A. Knoblock.
Semi-Automatic Wrapper Generation for Internet
Information Sources. CoopIS 1997 160-169 - 4 Naveen Ashish and Craig Knoblock. Wrapper
Generation for Semi-Structured Internet Sources.
. SIGMOD Record26 (4) 8-15, 1997 - 5 S. Chawathe, H.Garcia-Molina, J. Hammer, K.
Ireland, Y. Papakonstantinou, J. Ullman, and J.
Widom The TSIMMIS Project Integration of
Heterogeneous Information sources. Proceedings of
Tenth Anniversary Meeting of Information
Processing Society of Japan, Tokyo, Japan, 1994
7-18. - 6 J. Hammer, H. Garcia-Molina , J. Cho , R.
Aranha, A. Crespo. Extracting Semistructured
Information from the Web. In Proceedings of the
Workshop on Management of Semistructured Data.
Tucson, Arizona, May 1997 - 7 Chun-nan Hsu et al. Finite-State Transducers
for Semi-structured Data Extraction From the Web.
Information Systems, 23(8)521-538, 1998 - 8 Nicholas Kushmerick, Daniel S. Weld, Robert
Doorenbos. Wrapper Induction for Information
Extraction. International Joint Conference on
Artificial Intelligence 729-737, 1997 - 9 Ion Muslea, Steve Minton, Craig Knoblock.
Hierarchical Wrapper Induction for Semistructured
Information Sources. Journal of Autonomous Agents
and Multi-Agent Systems 493-114, 2001 - 10 Arnaud Sahuguet, Fabien Azavant. WysiWyg Web
Wrapper Factory (W4F). unpublished, 1999.
http//db.cis.upenn.edu/Research/w4f.html - 11 W3C. HTML 4.01 specification,
- http//www.w3.org/TR/html4/
- 12 W3C. XML1.0,
- http//www.w3.org/TR/1998/REC-xml-19980210