Title: SchemaGuided Wrapper
1Schema-Guided Wrapper (SG-WRAM) Maintenance for
Web-Data Extraction
2Why Wrapper Maintenance?
Web pages can be very dynamic and continually
evolving, earlier constructed wrapper may stop
working. In this paper, a novel approach to
wrapper maintenance, SG-WRAM, is introduced. Its
strategy is based on the fact that some features
of the extracted data items, such as data
patterns (syntactic pattern), hyperlinks, and
annotations, are often preserved while the
schema for the extracted data does not change.
3Overview Schema-Guided Wrapper Generation
A new wrapper is generated by mapping between the
HTML tree and user defined schema tree.
Internally the system parses the HTML page
into a HTML tree, computes the corresponding
internal mappings from the HTML tree to the
schema tree, and output an extraction rule
(wrapper) as an XQuery expression.
4Maintaining Extraction Rules
Four steps
- Data-feature discovery
- Data-item recovery
- Block configuration
- Rule re-induction
5Step 1 Data-Feature Discovery
Data features are computed from the given DTD,
the previous extraction rule, and the previous
extraction results.
6Step 2 Data-item Recovery
Data features are used to recognize the relevant
data items in the new page.
We traverse the new HTML tree following the
depth-first traversal order and compare each node
with DTD element.
7Step 2 Data-item Recovery (cont.)
Case 1 if the leaf node n is an annotation of an
item, we try to find the corresponding value of
this item starting from this node. Otherwise, we
check the item table to see if it satisfies the
three features of an item row r. If all these
features are satisfied, this node is an instance
of the item r.
8Step 2 Data-item Recovery (cont.)
Case 2 if the annotation of an item changed in
the new pages, such as Our price 1.00 ?
Price 1.00, user need to decide whether it
should be treated as a different item.
9Step 2 Data-item Recovery (cont.)
Case 3 Noises could be treated as a data item,
such as click here for more info. They can be
removed by computing the frequencies of occurring
of them.
10Step 3 Block Configuration
We group the recognized data items according to
the user-defined schema and the HTML tree
structure. We compute a semantic block as an
instance of the given schema.
11Step 3 Block Configuration (cont.)
1. Identify the level of block configuration
k. 2. Compare the HTML tree with the schema.
3. Obtain schema instances semantic blocks.
12Step 3 Block Configuration (cont.)
Three cases of match
- Full match a block contains all items of the
schema and satisfies the constraint of each item
in the schema. - Over match there is at least one item in the
schema that occurs at least twice in a block.
(k1) - Partial match a block contains a proper subset
of items of the schema. (k-1)
13Step 4 Rule Re-induction
The computed instances (semantic blocks) are
used to re-induce the new extraction rule for
this changed page.
14Step 4 Rule Re-induction (cont.)
15Experiment Results(1)
16Experiment Results(2)
17Related Work
Two approaches from Kushmerich and Lerman heavily
rely on the syntactic features of data items.
They cannot detect the different of following two
tables.
SG-WRAM algorithm can effectively find the
appropriate level in the HTML tree where the
correct semantic blocks reside.
18Conclusion
- SG-WRAM a new approach to wrapper maintenance.
- It detects the desired data based on the
preserved data features (annotation, hyperlink
and data pattern). - Semantic block conforming to the user-defined
schema are used for grouping data items to
recognize the underlying structure of Web pages.
19QUESTIONS?