Title: Structured Data Extraction From Web Based on Partial Tree Alignment
1Structured Data Extraction From Web Based on
Partial Tree Alignment
- by
- Yanhong zhai and Bing Liu
2Introduction
- A large amount of information on the Web is
contained in regularly structured data objects - Which are data records retrieved from databases.
- Such Web data records are important because
- They often present the essential information of
their host pages, e.g., lists of products and
services. - Applications integrated and value-added
services, - e.g., Comparative shopping, meta-search query,
etc.
3Example (a)-
4Years Persons () Persons () Persons ()
40-49 51,000 0.1 80,000 0.2 131,000 0.3
50-59 45,000 0.1 102,000 0.3 147,000 0.4
60-69 59,000 0.3 178,000 0.9 235,000 1.2
70-79 134,000 0.8 471,000 3.0 605,000 3.8
gt80 648,000 7.0 1,532,00 16 905,00 23
5Existing Methods
- Wrapper Programming languages
- This approach provides some languages to
facilitate the construction of data extraction
programs. - Wrapper Induction
- This approach use machine learning techniques
- to learn data extraction rules from
set - of manually labeled examples.
- Automatic Extraction
- This approach is based on the idea of automatic
pattern discovery.
6Proposed Method
- DEPTA (Data extraction based on partial tree
alignment - This method consists of two steps
- 1)Identifying individual records in a
page. - 2)Aligning and extracting data items
from - the Identified records.
7Architecture of DEPTA System
Input a web page
DOM Tree Builder
Data Region Identifier
Data Records Identifier
Output Data Tables
Data Items Extractor
8DATA RECORD IDENTIFICATION
- MDR Mining Data Records
- Given a single page with multiple data records,
MDR extracts data records ,but not data
items(step1). - MDR is based on
- two observations about data records in a Web
page and - a tree matching algorithm
- Consider both
- Contiguous
- non contiguous records
9Two Observations
- A group of data records are presented
- In a contiguous region (a data region) of a page
and - are formatted using similar HTML tags
- A set of similar data records are formed by some
child sub trees of the same parent node.
10DOM tree of the previous page
TABLE
TBODY
TR
TR
TR
TR
TR
TR
TD
TD
TD
Data record2
Data record1
TD
TD
TD
TD
TD
11The approach
- Given a page ,
- Building the Dom Trees Based on
- Visual Information
- Mining Data Regions
- Identifying Data Records
- Rendering (or Visual) information is very
- useful in the whole process.
12Building Dom Trees Based on Visual Information
- 1.lttablegt
- 2.lttrgt
- 3.lttdgtdata1lt/tdgt
- 4.lttdgtdata2lt/tdgt
- 5.lttrgt
- 6.lttdgtdata3lt/tdgt
- 7.lttdgtdata4lt/tdgt
- 8.lt/trgt
- 9.lt/tablegt
Left right top bottom
table
100 300 200 400
100 300 200 300
100 300 200 400
200 300 200 300
tr
tr
100 300 300 400
100 200 300 400
tr
tr
tr
tr
200 300 200 400
13Enhanced Simple Tree Matching
T1
T2
p
p
T2
T1
p
p
a
a
a
a
a
a
a
b
a
b
ltdata1gt
ltdata2gt
ltdata3gt
ltdata2gt
ltdata3gt
ltdata4gt
c
c
g
data1 data2 data3
data2 data3 data4
data1 data2 data3
data2 data3 data4
c
ltdata1gt
ltdata2gt
ltdata1gt
Wrong alignment
Correct alignment
(b)
(a)
Alignment using tags only can produce wrong
alignments
Two trees with more than one possible matches
14Mining Data Regions
- Find every data region with similar data records.
- Definition A generalized node (or a node
combination) - of length r consists of r (r1)nodes in the HTML
tag tree - with the following two properties
- 1. the nodes all have the same parent and
- 2. the nodes are adjacent.
- Definition A data region is a collection of two
or more - generalized nodes with the following properties
- 1.The generalized nodes all have the same
parent. - 2.The generalized nodes are all adjacent.
- 3.Adjacent generalized nodes are similar.
15Determining Data Regions
- To find each data region , the algorithm needs to
find the following . - 1. Where does the first generalized node of the
data region start? - Try to start from each child node under a parent
- 2. How many tag nodes or components does a
generalized node have? - We try one node, two node,., K node combinations
16An illustration of generalized nodes and data
regions
Shades nodes are generalized nodes
data regions
1
2
3
4
5
6
7
8
9
10
Region 1
Region 2
11
12
13
14
15
16
17
19
18
Region 3
17Identifying Data Records
- A generalized node may not
- be a data record.
- Extra mechanisms are
- needed to identify true
- atomic objects
- Some highlights
- contiguous
- non-contiguous data records
Name1 Description of object 1
Name2 Description of object2
Name3 Description of object3
Name4 Description of object4
Name1
Name2
Description Of object 1
Description Of object2
Name3
Name4
Description Of object 3
Description Of object4
18DEPTA Extract Data from Data Records
- Once a list of data records are identified, we
can align and extract items in them - Multiple tree alignment
- We need multiple alignment as we have multiple
data records - Most multiple alignment methods work like
hierarchical clustering , and require n2 pair
wise matching. - Too expensive
- Optimal alignment/ matching is exponential
- A partial tree matching algorithm is proposed in
Depta to perform multiple tree alignment
19The partial Tree Alignment Approach
- Choose a seed tree A seed tree , denoted by Ts,
is picked with the maximum number of data items. - Tree matching
- For each unmatched tree Ti (i?s),
- Match Ts and Tr
- Each pair of matched nodes are linked (aligned)
- For each unmatched node nj in Ti do
- Expand Ts by inserting n into Ts if a position
for insertion can be uniquely determined in Ts. - The expanded seed tree Ts is then used in
subsequent matching.
20Illustration of partial tree alignment
TS
Ti
p
p
a
b
b
e
c
d
e
New part of Ts
Insertion is possible
p
a
b
c
d
e
Ts
Ti
p
p
Insertion is not possible
a
a
b
e
x
e
21A complete example
Ts
T1
T2
T3
p
p
p
..
X
b
d
b
b
c
c
n
k
k
g
d
h
Ts
p
No node inserted
X
b
d
Ts
New
p
C, h and k inserted
T2 is matched again
X
b
d
c
k
h
T2
p
b
c
n
k
g
p
X
b
d
c
n
k
h
g
22Output data table
X
b
n
c
d
h
K
g
.
T1
1
1
1
.
1
1
1
1
1
T2
1
1
1
1
1
T3
- The final tree may also be used to match and
extract data from other - similar pages
23Conclusion
- Existing techniques either inaccurate or make
several assumptions. - Our method does not make these assumptions
- Our technique consists of two steps
- Identifying data records
- Aligning corresponding data items from multiple
data records. - Step1 is based on visual cues
- Step2 is based on partial tree aligment
-
24References
- 1. Arasu, A. and Garcia-Molina, H. Extracting
Structured Data - from Web Pages. SIGMOD-03, 2003.
- 2. Baeza-Yates, R. Algorithms for string
matching A survey. - ACM SIGIR Forum, 23(3-4)34-58, 1989.
- 3. Barton, G., Sternberg, M. A strategy for the
rapid multiple - alignment of protein sequences confidence levels
from - tertiary structure comparisons. J. Mol. Biol.
1987, 327-337. - 4. Bar-Yossef, Z. and Rajagopalan, S. Template
Detection via - Data Mining and its Applications, WWW 2002, 2002.
- 5. Buttler, D., Liu, L., Pu, C. A fully
automated extraction - system for the World Wide Web. IEEE ICDCS-21,
2001. - 6. Carrillo, H., Lipman, D. The multiple
sequence alignment - problem in biology. SIAM J. Applied Math.,
198848(5). - 7. Chakrabarti, S. Mining the Web Discovering
Knowledge - from Hypertext Data. Morgan Kaufmann Publishers,
2002. - 8. Chang, C. and Lui, S-L. IEPAD Information
extraction - based on pattern discovery. WWW-10, 2001.
- 9. Chen, H.-H., Tsai, S.-C., and Tsai, J.-H.
Mining tables from - large scale html texts. COLING-00, 2000.