Learning Information Extraction Rules for SemiStructured and Free Text

About This Presentation

Title:

Learning Information Extraction Rules for SemiStructured and Free Text

Description:

WHISK: learning IE rules / Frans Adriaans 2. Introduction. Information Extraction (IE) ... WHISK: supervised learning, using hand-tagged examples ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 21

Provided by: pwadr

Category:

more less

Transcript and Presenter's Notes

Title: Learning Information Extraction Rules for SemiStructured and Free Text

1
Learning Information Extraction Rules for
Semi-Structured and Free Text

Stephen Soderland (1999)

WHISK learning IE rules / Frans Adriaans 1
2
Overview

Introduction
WHISK representation
WHISK algorithm
Results
Concluding remarks

WHISK learning IE rules / Frans Adriaans 2
3
Introduction

Information Extraction (IE)
structured, semi-structured and free text
learning rules for extraction
WHISK supervised learning, using hand-tagged
examples

WHISK learning IE rules / Frans Adriaans 3
4
WHISK rule representation

regular expressions
single-slot vs. multi-slot extraction
user-defined semantic classes Bdrm
(brsbrbdsbdrmbdbedroombed)
free text requires preprocessing

WHISK learning IE rules / Frans Adriaans 4
5
Example 1
Semi-structured text
WHISK learning IE rules / Frans Adriaans 5
6
Example 1
WHISK pattern
WHISK output
WHISK learning IE rules / Frans Adriaans 6
7
Example 2
Grammatical text
WHISK learning IE rules / Frans Adriaans 7
8
Example 2
WHISK pattern
WHISK output
WHISK learning IE rules / Frans Adriaans 8
9
Creating training instances

supervised learning hand-tagged training
instances
tagging interleaved with learning process
selective sampling of 1) Instances covered by
an existing rule 2) Instances that are near
misses of a rule 3) Instances not covered by any
rule

WHISK learning IE rules / Frans Adriaans 9
10
WHISK algorithm

begins with an empty rule
adds one term at a time, using selection metric
stops adding terms when errors are reduced to
zero or a pre-pruning criterium has been
satisfied- Error measured by Laplacian (e 1)
/ (n 1)
repeats this until set of rules covers all
training instances
post-pruning removes rules to prevent overfitting

WHISK learning IE rules / Frans
Adriaans 10
11
WHISK algorithm
WHISK(Reservoir) RuleSet NULL Training
NULL Repeat at users request Select a
batch of NewInst from Reservoir (User tags
the NewInst) Add NewInst to Training
Discard rules with errors on NewInst For each
Inst in Training For each Tag of Inst
If Tag is not covered by RuleSet Rule
GROW_RULE(Inst, Tag, Training) Prune RuleSet
WHISK learning IE rules / Frans
Adriaans 11
12
Anchoring extraction slots

Creating two base rules- a rule with terms
added within extraction boundary- a rule with
terms added just outside boundary
Example Anchoring Slot 1 Base_1 ( Neighbr
) Base_2 _at_start ( ) - Anchoring Slot
2 Base_1 ( Neighbr ) ( Digit ) Base_2
( Neighbr ) - ( ) br

WHISK learning IE rules / Frans
Adriaans 12
13
WHISK algorithm
GROW_RULE(Inst, Tag, Training) Rule empty
rule (terms replaced by wildcards) For i 1 to
number of slots in Tag ANCHOR(Rule, Inst,
Tag, Training, i) Do until Rule makes no errors
on Training or no improvement in Laplacian
EXTEND_RULE(Rule, Inst, Tag, Training)
WHISK learning IE rules / Frans
Adriaans 13
14
Extending Rules

every term in instance is considered
proposed extension is tested on training set
best performing extension is selected
heuristics for equally performing extensions -
semantic classes and syntactic tags are
preferred - proximity to slot is preferred
extending continues until error is sufficiently
reduced

WHISK learning IE rules / Frans
Adriaans 14
15
Pruning

pre-pruning- stop induction if error is below
certain threshold(for example Laplacian 0.1)
post-pruning- remove rules that have low
coverage to prevent overfitting

WHISK learning IE rules / Frans
Adriaans 15
16
Experiments

Structured text- CNN Weather Forecast, BigBook
Telephone Directory
Semi-structured text- Rental Ads, Seminar
Announcements
Free text- News articles

WHISK learning IE rules / Frans
Adriaans 16
17
Results

Structured text- 100 recall, 100 precision
after only a few examples
Semi-structured text- more examples necessary,
recall 92 , precision 96
Free text- recall 46 , precision 67
after many, many training examples

WHISK learning IE rules / Frans
Adriaans 17
18
Results

regularities make learning rules easier
free text- concepts often described
differently- many instances, only a few contain
target slots? many low coverage, high error
rules

WHISK learning IE rules / Frans
Adriaans 18
19
Concluding remarks