Learning Information Extraction Rules for SemiStructured and Free Text - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Learning Information Extraction Rules for SemiStructured and Free Text

Description:

WHISK: learning IE rules / Frans Adriaans 2. Introduction. Information Extraction (IE) ... WHISK: supervised learning, using hand-tagged examples ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 21
Provided by: pwadr
Category:

less

Transcript and Presenter's Notes

Title: Learning Information Extraction Rules for SemiStructured and Free Text


1
Learning Information Extraction Rules for
Semi-Structured and Free Text
  • Stephen Soderland (1999)

WHISK learning IE rules / Frans Adriaans 1
2
Overview
  • Introduction
  • WHISK representation
  • WHISK algorithm
  • Results
  • Concluding remarks

WHISK learning IE rules / Frans Adriaans 2
3
Introduction
  • Information Extraction (IE)
  • structured, semi-structured and free text
  • learning rules for extraction
  • WHISK supervised learning, using hand-tagged
    examples

WHISK learning IE rules / Frans Adriaans 3
4
WHISK rule representation
  • regular expressions
  • single-slot vs. multi-slot extraction
  • user-defined semantic classes Bdrm
    (brsbrbdsbdrmbdbedroombed)
  • free text requires preprocessing

WHISK learning IE rules / Frans Adriaans 4
5
Example 1
Semi-structured text
WHISK learning IE rules / Frans Adriaans 5
6
Example 1
WHISK pattern
WHISK output
WHISK learning IE rules / Frans Adriaans 6
7
Example 2
Grammatical text
WHISK learning IE rules / Frans Adriaans 7
8
Example 2
WHISK pattern
WHISK output
WHISK learning IE rules / Frans Adriaans 8
9
Creating training instances
  • supervised learning hand-tagged training
    instances
  • tagging interleaved with learning process
  • selective sampling of 1) Instances covered by
    an existing rule 2) Instances that are near
    misses of a rule 3) Instances not covered by any
    rule

WHISK learning IE rules / Frans Adriaans 9
10
WHISK algorithm
  • begins with an empty rule
  • adds one term at a time, using selection metric
  • stops adding terms when errors are reduced to
    zero or a pre-pruning criterium has been
    satisfied- Error measured by Laplacian (e 1)
    / (n 1)
  • repeats this until set of rules covers all
    training instances
  • post-pruning removes rules to prevent overfitting

WHISK learning IE rules / Frans
Adriaans 10
11
WHISK algorithm
WHISK(Reservoir) RuleSet NULL Training
NULL Repeat at users request Select a
batch of NewInst from Reservoir (User tags
the NewInst) Add NewInst to Training
Discard rules with errors on NewInst For each
Inst in Training For each Tag of Inst
If Tag is not covered by RuleSet Rule
GROW_RULE(Inst, Tag, Training) Prune RuleSet
WHISK learning IE rules / Frans
Adriaans 11
12
Anchoring extraction slots
  • Creating two base rules- a rule with terms
    added within extraction boundary- a rule with
    terms added just outside boundary
  • Example Anchoring Slot 1 Base_1 ( Neighbr
    ) Base_2 _at_start ( ) - Anchoring Slot
    2 Base_1 ( Neighbr ) ( Digit ) Base_2
    ( Neighbr ) - ( ) br

WHISK learning IE rules / Frans
Adriaans 12
13
WHISK algorithm
GROW_RULE(Inst, Tag, Training) Rule empty
rule (terms replaced by wildcards) For i 1 to
number of slots in Tag ANCHOR(Rule, Inst,
Tag, Training, i) Do until Rule makes no errors
on Training or no improvement in Laplacian
EXTEND_RULE(Rule, Inst, Tag, Training)
WHISK learning IE rules / Frans
Adriaans 13
14
Extending Rules
  • every term in instance is considered
  • proposed extension is tested on training set
  • best performing extension is selected
  • heuristics for equally performing extensions -
    semantic classes and syntactic tags are
    preferred - proximity to slot is preferred
  • extending continues until error is sufficiently
    reduced

WHISK learning IE rules / Frans
Adriaans 14
15
Pruning
  • pre-pruning- stop induction if error is below
    certain threshold(for example Laplacian 0.1)
  • post-pruning- remove rules that have low
    coverage to prevent overfitting

WHISK learning IE rules / Frans
Adriaans 15
16
Experiments
  • Structured text- CNN Weather Forecast, BigBook
    Telephone Directory
  • Semi-structured text- Rental Ads, Seminar
    Announcements
  • Free text- News articles

WHISK learning IE rules / Frans
Adriaans 16
17
Results
  • Structured text- 100 recall, 100 precision
    after only a few examples
  • Semi-structured text- more examples necessary,
    recall 92 , precision 96
  • Free text- recall 46 , precision 67
    after many, many training examples

WHISK learning IE rules / Frans
Adriaans 17
18
Results
  • regularities make learning rules easier
  • free text- concepts often described
    differently- many instances, only a few contain
    target slots? many low coverage, high error
    rules

WHISK learning IE rules / Frans
Adriaans 18
19
Concluding remarks
  • structured, semi-structured and free text
  • requires no hand-coded rules/keywords
  • multi-slot extraction
  • free text requires preprocessing (syntactic
    analysis, semantic tagger)
  • performance worse than hand-coded rules

WHISK learning IE rules / Frans
Adriaans 19
20
Questions ?
WHISK learning IE rules / Frans
Adriaans 20
Write a Comment
User Comments (0)
About PowerShow.com