Title: Machinelearning based Semistructured IE
1Machine-learning based Semi-structured IE
- Chia-Hui Chang
- Department of Computer Science Information
Engineering - National Central University
- chia_at_csie.ncu.edu.tw
- 9/24/2002
2Wrapper Induction
- Wrapper
- An extracting program to extract desired
information from Web pages. - Semi-Structure Doc. wrapper? Structure Info.
- Web wrappers wrap...
- Query-able or Search-able Web sites
- Web pages with large itemized lists
- The primary issues are
- How to build the extractor quickly?
3Semi-structured IE
- Independently of the traditional IE
- The necessity of extracting and integrating data
from multiple Web-based sources
4Machine-Learning Based Approach
- A key component of IE systems is
- a set of extraction patterns
- that can be generated by machine learning
algorithms. - Extractor
- Driver Architecture
- Rule Format
5Related Work
- Shopbot
- Doorenbos, Etzioni, Weld, AA-97
- Ariadne
- Ashish, Knoblock, Coopis-97
- WIEN
- Kushmerick, Weld, Doorenbos, IJCAI-97
- SoftMealy wrapper representation
- Hsu, IJCAI-99
- STALKER
- Muslea, Minton, Knoblock, AA-99
- A hierarchical FST
6WIEN
- N. Kushmerick, D. S. Weld,
- R. Doorenbos,
- University of Washington, 1997
- http//www.cs.ucd.ie/staff/nick/
7Example 1
8Extractor for Example 1
9HLRT
10Wrapper Induction
- Induction
- The task of generalizing from labeled examples to
a hypothesis - Instances pages
- Labels (Congo, 242), (Egypt, 20), (Belize,
501), (Spain, 34) - Hypotheses
- E.g. (ltpgt, ltHRgt, ltBgt, lt/Bgt, ltIgt, lt/Igt)
11BuildHLRT
succeeds
12Other Family
- OCLR (Open-Close-Left-Right)
- Use Open and Close as delimiters for each tuple
- HOCLRT
- Combine OCLR with Head and Tail
- N-LR and N-HLRT
- Nested LR
- Nested HLRT
13Terminology
- Oracles
- Page Oracle
- Label Oracle
- PAC analysis
- is to determine how many examples are necessary
to build an wrapper with two parameters
accuracy ? and confidence ? - PrE(w)lt?gt1-?, or PrE(w)gt?lt?
14Probably Approximate Correct (PAC) Analysis
- With ?0.1, ?0.1, K4, an average of 5
tuples/page, Build HLRT must examine at least 72
examples
15Empirical Evaluation
- Extract 48 web pages successfully.
- Weakness
- Missing attributes, attributes not in order,
tabular data, etc.
16Softmealy
- Chun-Nan Hsu, Ming-Tzung Dung, 1998
- Arizona State University
- http//kaukoai.iis.sinica.edu.tw/chunnan/mypublic
ations.html
17Softmealy Architecture
- Finite-State Transducers for Semi-Structured Text
Mining - Labeling use a interface to label example by
manually. - Learner FST (Finite-State Transducer)
- Extractor
- Demonstration
- http//kaukoai.iis.sinica.edu.tw/video.html
18Softmealy Wrapper
- SoftMealy wrapper representation
- Uses finite-state transducer where each distinct
attribute permutations can be encoded as a
successful path - Replaces delimiters with contextual rules that
describes the context delimiting two adjacent
attributes
19Example
20Label the Answer Key
4???
21Finite State Transducer
????(N, M)?(N, A, M)2???
extract
extract
skip
skip
N
-U
U
skip
-N
extract
extract
skip
skip
M
-A
A
e
22Find the starting position -- Single Pass
23Contextual based Rule Learning
- Tokens
- Separators
- SL Punc(,) Spc(1) Html(ltIgt)
- SR C1Alph(Professor) Spc(1) OAlph(of)
- Rule generalization
- Taxonomy Tree
24Tokens
- All uppercase string CALph
- An uppercase letter, followed by at least one
lowercase letter, C1Alph - A lowercase letter, followed by zero or more
characters OAlph - HTML tag HTML
- Punctuation symbol Punc
- Control characters NL(1), Tab(4), Spc(3)
25Rule Generalization
26Learning Algorithm
- Generalize each column by replacing each token
with their least common ancestor
27Taxonomy Tree
28Generating to Extract the Body
- The contextual rules for the head and tail
separators are - hLC1alpha(Staff) Html(lt/H2gt) NL(1)Html(ltHRgt)
NL(1) Html(ltULgt) - tRHtml(lt/ULgt) NL(1) Html(ltHRgt) NL(1)
Html(ltADDRESSgt) NL(1) Html(ltIgt) Clalpha(Please)
29More Expressive Power
- Softmealy allows
- Disjunction
- Multiple attribute orders within tuples
- Missing attributes
- Features of candidate strings
30Stalker
- I. Muslea, S. Minton, C. Knoblock,
- University of Southern California
- http//www.isi.edu/muslea/
31STALKER
- Embedded Catalog Tree
- Leaves (primitive items) ????????
- Internal nodes (items)
- Homogeneous list, or
- Heterogeneous tuple.
32EC Tree of a page
33Extracting Data from a Document
- For each node in the EC Tree, the wrapper needs a
rule that extracts that particular node from its
parent - Additionally, for each list node, the wrapper
requires a list iteration rule that decomposes
the list into individual tuples. - Advantages
- The hierarchical extraction based on the EC tree
allows us to wrap information sources that have
arbitrary many levels of embedded data. - Second, as each node is extracted independently
of its siblings, our approach does not rely on
there being a fixed ordering of the items, and we
can easily handle extraction tasks from documents
that may have missing items or items that appear
in various orders.
34Extraction Rules as Finite Automata
- Landmarks
- A sequence of tokens and wildcards
- Landmark automata
- A non-deterministic finite automata
35Landmark Automata (LA)
- A linear LA has one accepting state
- from each non-accepting state, there are exactly
two possible transitions a loop to itself, and a
transition to the next state - each non-looping transition is labeled by a
landmarks - all looping transitions have the meaning consume
all tokens until you encounter the landmark that
leads to the next state.
36Rule Generation
Extract Credit info.
1st terminals reservation _Symbol_ _Word_
Candidate ltigt _Symbol_ _HtmlTag_
perfect Disjltigt _HtmlTag_ positive
example D3, D4 2nd uncoverD1, D2
Candidate _Symbol_
37Possible Rules
38(No Transcript)
39(No Transcript)
40The STALKER Algorithm
41(No Transcript)
42(No Transcript)
43Features
- Process is performed in a hierarchical manner.
- ??Attributes not in order????
- Use disjunctive rule ????Missing attributes????
44Multi-pass Softmealy
- Chun-Nan Hsu and Chian-Chi Chang
- Institute of Information Science
- Academia Sinica
- Taipei, Taiwan
45Multi-pass
46Tabular style document
(Quote Server)
47Tagged-list style document
(Internet Address Finder)
48Layout styles and learnability
- Tabular style
- missing attributes, ordering as hints
- Tagged-list style
- variant ordering, tags as hints
- Prediction
- single-pass for tabular style
- multi-pass for tagged-list style
49Tabular result (Quote Server)
50Tagged-list result (Internet Address Finder)
51Alternative for Tagged-List Docs
52Comparison
- Both
- can handle irregular missing attributes.
- ??????attribute,??training
- Single-pass
- ???attribute permutations ??
- Single-pass is good for tabular pages
- ???
- Multi-pass
- Attribute permutations????
- Multi-pass is good for tagged-list pages
- ???
53Experiments
- Okra(tabular pages)
- Stalker 97, 1 example tuple
- WIEN 100 , 13 example tuples, 30 trials
- SoftMealy single-pass 100, 1 example tuple, 30
trials - Big-book(tagged-list pages)
- Stalker 97, 8 example tuples
- WIEN perfect, 18 example tuples, 30 trials
- SoftMealy single-pass 97, 4 examples, 30 trials
- multi-pass 100, 6 examples,
30 trials
54Experiments (Cont.)
- Quote Server
- Stalker 10 example tuples, 79, 500 trials
- WIEN the collection beyond learns capability
- SoftMealy multi-pass 85, single-pass 97
- Internet Address Finder
- Stalker 80 100, 300 trials (omitting
multiple occurrences of Organization) - WIEN the collection beyond learns capability
- SoftMealy multi-pass 68, single-pass 41, 99
with the alternative approach
55References
- Kushmerick, N. Wrapper induction Efficiency and
expressiveness. Artificial Intelligence J.
118(1-2)15-68, 2000. - Chun-Nan Hsu and Ming-Tzung Dung. Generating
finite-state transducers for semistructured data
extraction from the web. Information Systems,
23(8)521-538, 1998. - Chun-Nan Hsu and Chien-Chi Chang. Finite-State
Transducers for Semi-Structured Text Mining, In
Proceedings of IJCAI-99 Workshop on Text Mining,
Stockholm, Sweden, 1999. Page 38-49. - Ion Muslea, Steve Minton, Craig Knoblock.
Hierarchical Wrapper Induction for Semistructured
Information Sources, Journal of Autonomous Agents
and Multi-Agent Systems, 493-114, 2001.