Using Weight-controlled Token Matching to Extract Data From HTML Files presentation

About This Presentation

Transcript and Presenter's Notes

Title: Using Weight-controlled Token Matching to Extract Data From HTML Files

1
Using Weight-controlled Token Matching to Extract
Data From HTML Files

Yan Xu, Tok Wang Ling
Dept. of Computer Science
National University of Singapore
(xuyan, lingtw)_at_comp.nus.edu.sg

2
Outline
Outline

Motivation and background
Our approach
Generate wrapper
Extraction data
Experimental Result
Conclusion

3
Motivation and Background
Motivation and Background

What is a wrapper
XML and HTML
Related works
Some criteria to build wrappers for Web pages

4
What Is a Wrapper?
Motivation and Background

Wrapper is a software component.
Wrapper is used to extract data from source files
and convert them into a structured way.
On the Web, the source files are usually HTML
files.

5
What is a Wrapper? (Cont)
Motivation and Background

The source files are usually semistructured or
unstructured.
We only discuss HTML files as source files in
this paper.

6
XML and HTML
Motivation and Background

XML is more suitable to organize data than HTML.
HTML is simple and widely used and accepted.
More and more XML sites appear on the Web. HTML
files are far many more than XML files on the Web

7
XML and HTML (Cont)
Motivation and Background

Query XML is easier and standard query language
is coming. HTML files are usually queried by
using web search engines.
XML contains information no less than HTML
XML is easier to be converted to other data
models, especially semistructured data models
XML is suitable to be the output of a wrapper
XML provides one of possible semantic
interpretations of a document.

8
Related Works
Motivation and Background

Construct wrappers for HTML files manually or
automatically
Using specification files to extract data. Such
as extraction system in TSIMMIS
Advantages sufficient expressive and high
precision
Limitations built by experienced programmers and
hard to maintain
Very time consuming to build a wrapper

9
Related Works (Cont)
Motivation and Background

Rule-based wrappers
Using rules to extract data
Inducing rules from training examples
Using delimiter-based rules
Such as WIEN, STALKER, SoftMealy
Our wrapper is rule-based

10
Some Criteria to Build Wrappers for Web Pages
Motivation and Background

Simple and powerful extracting rules
Need less examples and less users interaction
Use HTML structure information as much as
possible
Easy to maintain and update
Less time to build a wrapper

11
Our Approach
Our Approach

Rule-based wrapper
Use delimiter to identify data
Use training examples to induce rules
Use Weighted Token List to identify delimiter
Use rules and threshold to extract data

12
Our Approach (Cont)
Our Approach

Weighted token list (WTL) a list of vector. Each
vector contains a set of
lttoken, weightgt pair
Token could be a HTML tag, a word, or a
punctuation in HTML files
Weight how important a token is in its position.
Its a number between 0 and 1.
Generate WTL using labeled examples

13
An Example
Our Approach

Part of a page from Amazon.Com

14
An Example (Cont)
Our Approach

We hope the wrapper could output the below result
TITLE professional xml (2nd edition)
AUTHOR nikola ozu, et al
TYPE paperback
DATE may 2001
SHIPINFO usually ships in 24 hours
LISTPRICE 59.99
OURPRICE 47.99
Save 20

15
An Example (Cont)
Our Approach

The label information is input by the user
The label is the meaning of the data. So we could
identify the extracted data. Such as TITLE,
AUTHOR etc. in the previous page
There are two kinds of users.
The user who build the wrapper
The user who use the generated wrapper to extract
data

16
An Example (Cont)
Our Approach

Part of HTML source code about author information
in the example is
lt/tdgt lttd gt ltfontgtltbgt
ltA href"/exec/obidos/ASIN/1861005059/qid992196
84 8/sr1-4/refsc_b_4/104-9977965-3139126"gt
Professional XML (2nd Edition)
lt/Agtlt/bgtltbrgtltfontgtby Nikola Ozu, et al lt/fontgt
( Paperback May 2001)ltbrgt

17
An Example (Cont)
Our Approach

If we choose token by as the left delimiter of
the author data, and HTML tag lt/fontgt as the
right delimiter, we will have high recall but low
precision when we try to extract author data.
If we choose a sequence of tokens as the
delimiter, for example 5 tokens
The 5 tokens before the author information
ltagt lt/bgt ltbrgt ltfontgt by
and the 5 tokens after the author information
lt/fontgt ( Paperback - May

18
An Example (Cont)
Our Approach

Surveying the entire example page (25 books), we
find
The 5 tokens before the author data do not change
and they are expressive enough to be left
delimiter.
The 5 tokens after the author data are not
precise enough to be the delimiter. For example,
there is a book that is hardcover and do not have
publish date, the right 5 tokens after author
data is
lt/fontgt ( Hardcover ) ltbrgt
We will have high precision but low recall

19
An Example (Cont)
Our Approach

Surveying the example page, we find 6 out of 25
books are hardcover and 18 out of 25 books are
paperback.
Using 3 books as training example, we obtain the
following token lists

20
An Example (Cont)
Our Approach

The begin weighted token list (the tokens before
the author data)
The end weighted token list (the tokens after the
author data)

21
An Example (Cont)
Our Approach

One token near the data associates its weight at
its position
For example means token
paperback is found 2 out of 3 times (i.e. 67)
in training examples. We allocate the possibility
of token paperback (0.67) as weight to this
token in this position

Paperback, 0.67
22
An Example (Cont)
Our Approach

The Weighted Token List to identify the left
delimiter is
by,1.0
ltfontgt,1.0
ltbrgt,1.0
lt/bgt,1.0
ltagt,1.0

23
An Example (Cont)
Our Approach

The Weighted Token List to identify the left
delimiter is
lt/fontgt,1.0
(,1.0
hardcover,0.33 paperback,0.67
),0.67 -,0.33
ltbrgt,0.67 may,0.33
The colored line means there are two tokens are
found in the third position after the author data.

24
An Example (Cont)
Our Approach

Using Weighted Token List, we achieve
A list of tokens as the delimiter
Associating weights to tokens, we could obtain a
better recall-precision tradeoff
We can bear small modification of HTML pages,
especially, the modification is not occurred near
the data

25
Label the Example Page
Our Approach

Using our GUI tool

26
Label the Example Page (Cont)
Our Approach

User highlights the interested data
User clicks the input label button
A dialog window pops up and user inputs the label
We insert the label into HTML file following our
specification

27
Label the Example Page (Cont)
Our Approach

After labeling, the modified HTML file is
LABELTITLEProfessional XML (2nd
Edition)INFOREND
lt/agtlt/bgtltbrgtltfontgtby
LABELAUTHORNikola Ozu, et alINFOREND
lt/fontgt(
LABELTYPEPaperbackINFOREND
LABELDATEMay 2001INFOREND)ltbrgt
The user input parts are TITLE, AUTHOR etc.

28
An Extraction Rule Has
Our Approach

Label information
Delimiters information
Begin WTL (BWTL) a WTL that describe a list of
tokens as begin delimiter
End WTL (EWTL) a WTL that describe a list of
tokens as end delimiter
A rule contains enough information to extract a
piece of data

29
A Rule Looks Like
Our Approach

ltLABEL, BWTL, EWTLgt
LABEL is AUTHOR in our example
BWTL in our example is
by,1.0
ltfontgt,1.0
ltbrgt,1.0
lt/bgt,1.0
ltagt,1.0
EWTL in our example is
lt/fontgt,1.0
(,1.0
hardcover,0.33paperback,0.67
),0.67-,0.33
ltbrgt,0.67 may,0.33

30
How to Generate Rule?
Our Approach

Find label information after LABEL
and before the next from examples that is
user labeled using the our GUI tool
We set the number of tokens needed as 5. User
could use it to generate rules and test the
result. If not good, user could set it manually
Generate BWTL for left delimiter
Generate EWTL for right delimiter
Assemble label, BWTL and EWTL to a rule

31
How to Generate WTL
Our Approach

Find the begin point and end point of the data
from the labeled training example
Detect the lists of tokens before and after the
data
Use the collected tokens to generate new WTL or
add the lists of tokens into correspond WTL and
calculate the weight for each token
Weight of a token is calculated by using the
times that the token appears near the data
divided by the sum of the times that all the
tokens appear in training examples near the same
data

32
Extract Data Using Rules
Our Approach

Tokenize the object HTML file
Obtain a list of tokens and find the correspond
rule in rule set
Obtain the data
Associate label with data
Output the result

33
Find the Correspond Rule
Our Approach

Obtain a list of tokens from web pages
Find a rule in rule set that if the given tokens
are found in Weighted Token List and the sum of
the weight of the tokens are larger than the
threshold multiply the number of tokens

34
Threshold
Our Approach

threshold is between 0 and 1
After testing, we found the result is usually
good when the threshold is set between 0.4 to
0.6. We set it to 0.5 by default
User could test the wrapper and change the
threshold

35
Another Example
Our Approach

HTML source code from Amazon.com about author
data of a book
lt/agtlt/bgt
ltbrgtltfontgtby Cisco Systems (Editor), Vito
Amatolt/fontgt
(Hardcover)
ltbrgt

36
Another Example (Cont)
Our Approach

We detect ltagtlt/bgtltbrgtltfontgtby as left delimiter
of author data, the weight is 5 larger than 50.5
We detect lt/fontgt(Hardcover)ltbrgt as the right
delimiter of author data, the weight is 3.7
larger than 50.5
The author data is between two list of tokens
Cisco Systems (Editor), Vito Amato

37
Result Analysis
Result Analysis

We define
field a piece of data. The smallest unit that
our wrapper could handle. For example, the author
data of a book
item a group of fields such as all the data of a
book.
Our wrappers training example is item. For
example, an Amazon.com page usually contains
information more than ten books (ten items), we
need only several of them (3 items) but not the
entire page to be labeled as training examples

38
Result Analysis (Cont)
Result Analysis

Ten test web sites basic information
Java SDK1.3. PC with Windows NT 4.0 workstation
(Intel PIII 800/128 M RAM)

39
Result Analysis (Cont)
Result Analysis

Ten test web sites recall-precision table

40
Recall and Precision
Result Analysis

Recall and precision
Recall 80 has a 100 recall
Precision 50 has a 100 precision all have a
more than 80 precision
Four sites has 100 both in recall and precision
test
The recall of News.com is lowest because
News.coms web pages are assembled from several
news and newspaper web sites
The result shows the best recall-precision
balance. Increase the number of tokens will have
a better recall but lower precision. Increase the
threshold will cause a better precision but lower
recall.

41
Wrapper Generation Time
Result Analysis

Wrapper generation time Except two examples, all
the others need less than 5 seconds
Labeling time is not included in wrapper
generation time and labels are input with the
help of our GUI tool. The time depends on how
many items are selected as training example and
how many fields contained in one item. All
examples labeling time is less than 10 minutes
except worldfact book example page

42
Extraction Time
Result Analysis

Data extraction time 70 less than 10 seconds
The extraction time is related to the HTML file
size. The HTML file size is usually not quite
large.
The wrapper generation time and the labeling time
are acceptable
The Data extraction time is not too long and is
bearable when used in real time web applications

43
Compare to other approaches
Result Analysis

Automatically generate wrappers and implement a
friendly GUI tool to help user input labels and
extract data
Simple and powerful rules that could deal with
missed and mis-ordered items in web pages

44
Compare to other approaches (Cont)
Result Analysis

We need a less number of training examples
because
when HTML file does not have missed and
mis-ordered items, we demand no more examples
than other methods.
When there is missed and mis-ordered items, we
need not to meet every situation of missed and
mis-ordered items in web pages
Quickly generated wrapper and the allocation
Weights to token assures a easier maintenance and
update

45
Conclusion
Conclusion

Use weighted token list to find and extract data
from HTML files.
A friendly GUI tool to generate wrappers easily
Acceptable result

46
Reference

1 S. Abiteboul. Querying Semistructured Data.
In Proceedings of the International Conference on
Datbase Theory (ICDT), January 1997.
2 S. Abiteboul, D.Quass, J.McHugh, J.Widom,
and J.Wiener. The Lorel Query Language for
Semistructured Data. Journal of Digital
Libraries, November 1996 68-88
3 Naveen Ashish, Craig A. Knoblock.
Semi-Automatic Wrapper Generation for Internet
Information Sources. CoopIS 1997 160-169
4 Naveen Ashish and Craig Knoblock. Wrapper
Generation for Semi-Structured Internet Sources.
. SIGMOD Record26 (4) 8-15, 1997
5 S. Chawathe, H.Garcia-Molina, J. Hammer, K.
Ireland, Y. Papakonstantinou, J. Ullman, and J.
Widom The TSIMMIS Project Integration of
Heterogeneous Information sources. Proceedings of
Tenth Anniversary Meeting of Information
Processing Society of Japan, Tokyo, Japan, 1994
7-18.
6 J. Hammer, H. Garcia-Molina , J. Cho , R.
Aranha, A. Crespo. Extracting Semistructured
Information from the Web. In Proceedings of the
Workshop on Management of Semistructured Data.
Tucson, Arizona, May 1997
7 Chun-nan Hsu et al. Finite-State Transducers
for Semi-structured Data Extraction From the Web.
Information Systems, 23(8)521-538, 1998
8 Nicholas Kushmerick, Daniel S. Weld, Robert
Doorenbos. Wrapper Induction for Information
Extraction. International Joint Conference on
Artificial Intelligence 729-737, 1997
9 Ion Muslea, Steve Minton, Craig Knoblock.
Hierarchical Wrapper Induction for Semistructured
Information Sources. Journal of Autonomous Agents
and Multi-Agent Systems 493-114, 2001
10 Arnaud Sahuguet, Fabien Azavant. WysiWyg Web
Wrapper Factory (W4F). unpublished, 1999.
http//db.cis.upenn.edu/Research/w4f.html
11 W3C. HTML 4.01 specification,
http//www.w3.org/TR/html4/
12 W3C. XML1.0,
http//www.w3.org/TR/1998/REC-xml-19980210

Write a Comment

User Comments (0)

About PowerShow.com

Using Weight-controlled Token Matching to Extract Data From HTML Files PowerPoint PPT Presentation