Machinelearning based Semistructured IE - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Machinelearning based Semistructured IE

Description:

Machine-learning based Semi-structured IE. Chia-Hui Chang ... Machine-Learning Based Approach. A key component of IE ... Chun-Nan Hsu and Chien-Chi Chang. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 56
Provided by: chiahu
Category:

less

Transcript and Presenter's Notes

Title: Machinelearning based Semistructured IE


1
Machine-learning based Semi-structured IE
  • Chia-Hui Chang
  • Department of Computer Science Information
    Engineering
  • National Central University
  • chia_at_csie.ncu.edu.tw
  • 9/24/2002

2
Wrapper Induction
  • Wrapper
  • An extracting program to extract desired
    information from Web pages.
  • Semi-Structure Doc. wrapper? Structure Info.
  • Web wrappers wrap...
  • Query-able or Search-able Web sites
  • Web pages with large itemized lists
  • The primary issues are
  • How to build the extractor quickly?

3
Semi-structured IE
  • Independently of the traditional IE
  • The necessity of extracting and integrating data
    from multiple Web-based sources

4
Machine-Learning Based Approach
  • A key component of IE systems is
  • a set of extraction patterns
  • that can be generated by machine learning
    algorithms.
  • Extractor
  • Driver Architecture
  • Rule Format

5
Related Work
  • Shopbot
  • Doorenbos, Etzioni, Weld, AA-97
  • Ariadne
  • Ashish, Knoblock, Coopis-97
  • WIEN
  • Kushmerick, Weld, Doorenbos, IJCAI-97
  • SoftMealy wrapper representation
  • Hsu, IJCAI-99
  • STALKER
  • Muslea, Minton, Knoblock, AA-99
  • A hierarchical FST

6
WIEN
  • N. Kushmerick, D. S. Weld,
  • R. Doorenbos,
  • University of Washington, 1997
  • http//www.cs.ucd.ie/staff/nick/

7
Example 1
8
Extractor for Example 1
9
HLRT
10
Wrapper Induction
  • Induction
  • The task of generalizing from labeled examples to
    a hypothesis
  • Instances pages
  • Labels (Congo, 242), (Egypt, 20), (Belize,
    501), (Spain, 34)
  • Hypotheses
  • E.g. (ltpgt, ltHRgt, ltBgt, lt/Bgt, ltIgt, lt/Igt)

11
BuildHLRT
succeeds
12
Other Family
  • OCLR (Open-Close-Left-Right)
  • Use Open and Close as delimiters for each tuple
  • HOCLRT
  • Combine OCLR with Head and Tail
  • N-LR and N-HLRT
  • Nested LR
  • Nested HLRT

13
Terminology
  • Oracles
  • Page Oracle
  • Label Oracle
  • PAC analysis
  • is to determine how many examples are necessary
    to build an wrapper with two parameters
    accuracy ? and confidence ?
  • PrE(w)lt?gt1-?, or PrE(w)gt?lt?

14
Probably Approximate Correct (PAC) Analysis
  • With ?0.1, ?0.1, K4, an average of 5
    tuples/page, Build HLRT must examine at least 72
    examples

15
Empirical Evaluation
  • Extract 48 web pages successfully.
  • Weakness
  • Missing attributes, attributes not in order,
    tabular data, etc.

16
Softmealy
  • Chun-Nan Hsu, Ming-Tzung Dung, 1998
  • Arizona State University
  • http//kaukoai.iis.sinica.edu.tw/chunnan/mypublic
    ations.html

17
Softmealy Architecture
  • Finite-State Transducers for Semi-Structured Text
    Mining
  • Labeling use a interface to label example by
    manually.
  • Learner FST (Finite-State Transducer)
  • Extractor
  • Demonstration
  • http//kaukoai.iis.sinica.edu.tw/video.html

18
Softmealy Wrapper
  • SoftMealy wrapper representation
  • Uses finite-state transducer where each distinct
    attribute permutations can be encoded as a
    successful path
  • Replaces delimiters with contextual rules that
    describes the context delimiting two adjacent
    attributes

19
Example
20
Label the Answer Key
4???
21
Finite State Transducer
????(N, M)?(N, A, M)2???
extract
extract
skip
skip
N
-U
U
skip
-N
extract
extract
skip
skip
M
-A
A
e
22
Find the starting position -- Single Pass
  • ?????

23
Contextual based Rule Learning
  • Tokens
  • Separators
  • SL Punc(,) Spc(1) Html(ltIgt)
  • SR C1Alph(Professor) Spc(1) OAlph(of)
  • Rule generalization
  • Taxonomy Tree

24
Tokens
  • All uppercase string CALph
  • An uppercase letter, followed by at least one
    lowercase letter, C1Alph
  • A lowercase letter, followed by zero or more
    characters OAlph
  • HTML tag HTML
  • Punctuation symbol Punc
  • Control characters NL(1), Tab(4), Spc(3)

25
Rule Generalization
26
Learning Algorithm
  • Generalize each column by replacing each token
    with their least common ancestor

27
Taxonomy Tree
28
Generating to Extract the Body
  • The contextual rules for the head and tail
    separators are
  • hLC1alpha(Staff) Html(lt/H2gt) NL(1)Html(ltHRgt)
    NL(1) Html(ltULgt)
  • tRHtml(lt/ULgt) NL(1) Html(ltHRgt) NL(1)
    Html(ltADDRESSgt) NL(1) Html(ltIgt) Clalpha(Please)

29
More Expressive Power
  • Softmealy allows
  • Disjunction
  • Multiple attribute orders within tuples
  • Missing attributes
  • Features of candidate strings

30
Stalker
  • I. Muslea, S. Minton, C. Knoblock,
  • University of Southern California
  • http//www.isi.edu/muslea/

31
STALKER
  • Embedded Catalog Tree
  • Leaves (primitive items) ????????
  • Internal nodes (items)
  • Homogeneous list, or
  • Heterogeneous tuple.

32
EC Tree of a page
33
Extracting Data from a Document
  • For each node in the EC Tree, the wrapper needs a
    rule that extracts that particular node from its
    parent
  • Additionally, for each list node, the wrapper
    requires a list iteration rule that decomposes
    the list into individual tuples.
  • Advantages
  • The hierarchical extraction based on the EC tree
    allows us to wrap information sources that have
    arbitrary many levels of embedded data.
  • Second, as each node is extracted independently
    of its siblings, our approach does not rely on
    there being a fixed ordering of the items, and we
    can easily handle extraction tasks from documents
    that may have missing items or items that appear
    in various orders.

34
Extraction Rules as Finite Automata
  • Landmarks
  • A sequence of tokens and wildcards
  • Landmark automata
  • A non-deterministic finite automata

35
Landmark Automata (LA)
  • A linear LA has one accepting state
  • from each non-accepting state, there are exactly
    two possible transitions a loop to itself, and a
    transition to the next state
  • each non-looping transition is labeled by a
    landmarks
  • all looping transitions have the meaning consume
    all tokens until you encounter the landmark that
    leads to the next state.

36
Rule Generation
Extract Credit info.
1st terminals reservation _Symbol_ _Word_
Candidate ltigt _Symbol_ _HtmlTag_
perfect Disjltigt _HtmlTag_ positive
example D3, D4 2nd uncoverD1, D2
Candidate _Symbol_
37
Possible Rules
38
(No Transcript)
39
(No Transcript)
40
The STALKER Algorithm
41
(No Transcript)
42
(No Transcript)
43
Features
  • Process is performed in a hierarchical manner.
  • ??Attributes not in order????
  • Use disjunctive rule ????Missing attributes????

44
Multi-pass Softmealy
  • Chun-Nan Hsu and Chian-Chi Chang
  • Institute of Information Science
  • Academia Sinica
  • Taipei, Taiwan

45
Multi-pass
46
Tabular style document
(Quote Server)
47
Tagged-list style document
(Internet Address Finder)
48
Layout styles and learnability
  • Tabular style
  • missing attributes, ordering as hints
  • Tagged-list style
  • variant ordering, tags as hints
  • Prediction
  • single-pass for tabular style
  • multi-pass for tagged-list style

49
Tabular result (Quote Server)
50
Tagged-list result (Internet Address Finder)
51
Alternative for Tagged-List Docs
52
Comparison
  • Both
  • can handle irregular missing attributes.
  • ??????attribute,??training
  • Single-pass
  • ???attribute permutations ??
  • Single-pass is good for tabular pages
  • ???
  • Multi-pass
  • Attribute permutations????
  • Multi-pass is good for tagged-list pages
  • ???

53
Experiments
  • Okra(tabular pages)
  • Stalker 97, 1 example tuple
  • WIEN 100 , 13 example tuples, 30 trials
  • SoftMealy single-pass 100, 1 example tuple, 30
    trials
  • Big-book(tagged-list pages)
  • Stalker 97, 8 example tuples
  • WIEN perfect, 18 example tuples, 30 trials
  • SoftMealy single-pass 97, 4 examples, 30 trials
  • multi-pass 100, 6 examples,
    30 trials

54
Experiments (Cont.)
  • Quote Server
  • Stalker 10 example tuples, 79, 500 trials
  • WIEN the collection beyond learns capability
  • SoftMealy multi-pass 85, single-pass 97
  • Internet Address Finder
  • Stalker 80 100, 300 trials (omitting
    multiple occurrences of Organization)
  • WIEN the collection beyond learns capability
  • SoftMealy multi-pass 68, single-pass 41, 99
    with the alternative approach

55
References
  • Kushmerick, N. Wrapper induction Efficiency and
    expressiveness. Artificial Intelligence J.
    118(1-2)15-68, 2000.
  • Chun-Nan Hsu and Ming-Tzung Dung. Generating
    finite-state transducers for semistructured data
    extraction from the web. Information Systems,
    23(8)521-538, 1998.
  • Chun-Nan Hsu and Chien-Chi Chang. Finite-State
    Transducers for Semi-Structured Text Mining, In
    Proceedings of IJCAI-99 Workshop on Text Mining,
    Stockholm, Sweden, 1999. Page 38-49.
  • Ion Muslea, Steve Minton, Craig Knoblock.
    Hierarchical Wrapper Induction for Semistructured
    Information Sources, Journal of Autonomous Agents
    and Multi-Agent Systems, 493-114, 2001.
Write a Comment
User Comments (0)
About PowerShow.com