Structured Data Extraction From Web Based on Partial Tree Alignment presentation

About This Presentation

Transcript and Presenter's Notes

Title: Structured Data Extraction From Web Based on Partial Tree Alignment

1
Structured Data Extraction From Web Based on
Partial Tree Alignment

by
Yanhong zhai and Bing Liu

2
Introduction

A large amount of information on the Web is
contained in regularly structured data objects
Which are data records retrieved from databases.
Such Web data records are important because
They often present the essential information of
their host pages, e.g., lists of products and
services.
Applications integrated and value-added
services,
e.g., Comparative shopping, meta-search query,
etc.

3
Example (a)-
4
Years Persons () Persons () Persons ()
40-49 51,000 0.1 80,000 0.2 131,000 0.3
50-59 45,000 0.1 102,000 0.3 147,000 0.4
60-69 59,000 0.3 178,000 0.9 235,000 1.2
70-79 134,000 0.8 471,000 3.0 605,000 3.8
gt80 648,000 7.0 1,532,00 16 905,00 23
5
Existing Methods

Wrapper Programming languages
This approach provides some languages to
facilitate the construction of data extraction
programs.
Wrapper Induction
This approach use machine learning techniques
to learn data extraction rules from
set
of manually labeled examples.
Automatic Extraction
This approach is based on the idea of automatic
pattern discovery.

6
Proposed Method

DEPTA (Data extraction based on partial tree
alignment
This method consists of two steps
1)Identifying individual records in a
page.
2)Aligning and extracting data items
from
the Identified records.

7
Architecture of DEPTA System
Input a web page
DOM Tree Builder
Data Region Identifier
Data Records Identifier
Output Data Tables
Data Items Extractor

8
DATA RECORD IDENTIFICATION

MDR Mining Data Records
Given a single page with multiple data records,
MDR extracts data records ,but not data
items(step1).
MDR is based on
two observations about data records in a Web
page and
a tree matching algorithm
Consider both
Contiguous
non contiguous records

9
Two Observations

A group of data records are presented
In a contiguous region (a data region) of a page
and
are formatted using similar HTML tags
A set of similar data records are formed by some
child sub trees of the same parent node.

10
DOM tree of the previous page

TABLE
TBODY
TR
TR
TR
TR
TR
TR
TD
TD
TD
Data record2
Data record1
TD
TD
TD
TD
TD
11
The approach

Given a page ,
Building the Dom Trees Based on
Visual Information
Mining Data Regions
Identifying Data Records
Rendering (or Visual) information is very
useful in the whole process.

12
Building Dom Trees Based on Visual Information

1.lttablegt
2.lttrgt
3.lttdgtdata1lt/tdgt
4.lttdgtdata2lt/tdgt
5.lttrgt
6.lttdgtdata3lt/tdgt
7.lttdgtdata4lt/tdgt
8.lt/trgt
9.lt/tablegt

Left right top bottom
table
100 300 200 400
100 300 200 300
100 300 200 400
200 300 200 300
tr
tr
100 300 300 400
100 200 300 400
tr
tr
tr
tr
200 300 200 400
13
Enhanced Simple Tree Matching
T1
T2
p
p
T2
T1
p
p
a
a
a
a
a
a
a
b
a
b
ltdata1gt
ltdata2gt
ltdata3gt
ltdata2gt
ltdata3gt
ltdata4gt
c
c
g
data1 data2 data3
data2 data3 data4
data1 data2 data3
data2 data3 data4
c
ltdata1gt
ltdata2gt
ltdata1gt
Wrong alignment
Correct alignment
(b)
(a)
Alignment using tags only can produce wrong
alignments
Two trees with more than one possible matches
14
Mining Data Regions

Find every data region with similar data records.
Definition A generalized node (or a node
combination)
of length r consists of r (r1)nodes in the HTML
tag tree
with the following two properties
1. the nodes all have the same parent and
2. the nodes are adjacent.
Definition A data region is a collection of two
or more
generalized nodes with the following properties
1.The generalized nodes all have the same
parent.
2.The generalized nodes are all adjacent.
3.Adjacent generalized nodes are similar.

15
Determining Data Regions

To find each data region , the algorithm needs to
find the following .
1. Where does the first generalized node of the
data region start?
Try to start from each child node under a parent
2. How many tag nodes or components does a
generalized node have?
We try one node, two node,., K node combinations

16
An illustration of generalized nodes and data
regions
Shades nodes are generalized nodes
data regions
1
2
3
4
5
6
7
8
9
10
Region 1
Region 2
11
12
13
14
15
16
17
19
18
Region 3
17
Identifying Data Records

A generalized node may not
be a data record.
Extra mechanisms are
needed to identify true
atomic objects
Some highlights
contiguous
non-contiguous data records

Name1 Description of object 1
Name2 Description of object2
Name3 Description of object3
Name4 Description of object4
Name1
Name2
Description Of object 1
Description Of object2
Name3
Name4
Description Of object 3
Description Of object4
18
DEPTA Extract Data from Data Records

Once a list of data records are identified, we
can align and extract items in them
Multiple tree alignment
We need multiple alignment as we have multiple
data records
Most multiple alignment methods work like
hierarchical clustering , and require n2 pair
wise matching.
Too expensive
Optimal alignment/ matching is exponential
A partial tree matching algorithm is proposed in
Depta to perform multiple tree alignment

19
The partial Tree Alignment Approach

Choose a seed tree A seed tree , denoted by Ts,
is picked with the maximum number of data items.
Tree matching
For each unmatched tree Ti (i?s),
Match Ts and Tr
Each pair of matched nodes are linked (aligned)
For each unmatched node nj in Ti do
Expand Ts by inserting n into Ts if a position
for insertion can be uniquely determined in Ts.
The expanded seed tree Ts is then used in
subsequent matching.

20
Illustration of partial tree alignment
TS
Ti
p
p
a
b
b
e
c
d
e
New part of Ts
Insertion is possible
p
a
b
c
d
e
Ts
Ti
p
p
Insertion is not possible
a
a
b
e
x
e
21
A complete example
Ts

T1
T2
T3
p
p
p
..
X
b
d
b
b
c
c
n
k
k
g
d
h
Ts
p
No node inserted
X
b
d

Ts
New
p
C, h and k inserted
T2 is matched again
X
b
d
c
k
h
T2
p
b
c
n
k
g
p

X
b
d
c
n
k
h
g
22
Output data table

X
b
n
c
d
h
K
g
.
T1
1
1
1
.
1
1
1
1
1
T2
1
1
1
1
1
T3

The final tree may also be used to match and
extract data from other
similar pages

23
Conclusion

Existing techniques either inaccurate or make
several assumptions.
Our method does not make these assumptions
Our technique consists of two steps
Identifying data records
Aligning corresponding data items from multiple
data records.
Step1 is based on visual cues
Step2 is based on partial tree aligment

24
References

1. Arasu, A. and Garcia-Molina, H. Extracting
Structured Data
from Web Pages. SIGMOD-03, 2003.
2. Baeza-Yates, R. Algorithms for string
matching A survey.
ACM SIGIR Forum, 23(3-4)34-58, 1989.
3. Barton, G., Sternberg, M. A strategy for the
rapid multiple
alignment of protein sequences confidence levels
from
tertiary structure comparisons. J. Mol. Biol.
1987, 327-337.
4. Bar-Yossef, Z. and Rajagopalan, S. Template
Detection via
Data Mining and its Applications, WWW 2002, 2002.
5. Buttler, D., Liu, L., Pu, C. A fully
automated extraction
system for the World Wide Web. IEEE ICDCS-21,
2001.
6. Carrillo, H., Lipman, D. The multiple
sequence alignment
problem in biology. SIAM J. Applied Math.,
198848(5).
7. Chakrabarti, S. Mining the Web Discovering
Knowledge
from Hypertext Data. Morgan Kaufmann Publishers,
2002.
8. Chang, C. and Lui, S-L. IEPAD Information
extraction
based on pattern discovery. WWW-10, 2001.
9. Chen, H.-H., Tsai, S.-C., and Tsai, J.-H.
Mining tables from
large scale html texts. COLING-00, 2000.

Write a Comment

User Comments (0)

About PowerShow.com

Structured Data Extraction From Web Based on Partial Tree Alignment PowerPoint PPT Presentation