Title: Learning Information Extraction Rules for SemiStructured and Free Text
1Learning Information Extraction Rules for
Semi-Structured and Free Text
WHISK learning IE rules / Frans Adriaans 1
2Overview
- Introduction
- WHISK representation
- WHISK algorithm
- Results
- Concluding remarks
WHISK learning IE rules / Frans Adriaans 2
3Introduction
- Information Extraction (IE)
- structured, semi-structured and free text
- learning rules for extraction
- WHISK supervised learning, using hand-tagged
examples
WHISK learning IE rules / Frans Adriaans 3
4WHISK rule representation
- regular expressions
- single-slot vs. multi-slot extraction
- user-defined semantic classes Bdrm
(brsbrbdsbdrmbdbedroombed) - free text requires preprocessing
WHISK learning IE rules / Frans Adriaans 4
5Example 1
Semi-structured text
WHISK learning IE rules / Frans Adriaans 5
6Example 1
WHISK pattern
WHISK output
WHISK learning IE rules / Frans Adriaans 6
7Example 2
Grammatical text
WHISK learning IE rules / Frans Adriaans 7
8Example 2
WHISK pattern
WHISK output
WHISK learning IE rules / Frans Adriaans 8
9Creating training instances
- supervised learning hand-tagged training
instances - tagging interleaved with learning process
- selective sampling of 1) Instances covered by
an existing rule 2) Instances that are near
misses of a rule 3) Instances not covered by any
rule
WHISK learning IE rules / Frans Adriaans 9
10WHISK algorithm
- begins with an empty rule
- adds one term at a time, using selection metric
- stops adding terms when errors are reduced to
zero or a pre-pruning criterium has been
satisfied- Error measured by Laplacian (e 1)
/ (n 1) - repeats this until set of rules covers all
training instances - post-pruning removes rules to prevent overfitting
WHISK learning IE rules / Frans
Adriaans 10
11WHISK algorithm
WHISK(Reservoir) RuleSet NULL Training
NULL Repeat at users request Select a
batch of NewInst from Reservoir (User tags
the NewInst) Add NewInst to Training
Discard rules with errors on NewInst For each
Inst in Training For each Tag of Inst
If Tag is not covered by RuleSet Rule
GROW_RULE(Inst, Tag, Training) Prune RuleSet
WHISK learning IE rules / Frans
Adriaans 11
12Anchoring extraction slots
- Creating two base rules- a rule with terms
added within extraction boundary- a rule with
terms added just outside boundary - Example Anchoring Slot 1 Base_1 ( Neighbr
) Base_2 _at_start ( ) - Anchoring Slot
2 Base_1 ( Neighbr ) ( Digit ) Base_2
( Neighbr ) - ( ) br
WHISK learning IE rules / Frans
Adriaans 12
13WHISK algorithm
GROW_RULE(Inst, Tag, Training) Rule empty
rule (terms replaced by wildcards) For i 1 to
number of slots in Tag ANCHOR(Rule, Inst,
Tag, Training, i) Do until Rule makes no errors
on Training or no improvement in Laplacian
EXTEND_RULE(Rule, Inst, Tag, Training)
WHISK learning IE rules / Frans
Adriaans 13
14Extending Rules
- every term in instance is considered
- proposed extension is tested on training set
- best performing extension is selected
- heuristics for equally performing extensions -
semantic classes and syntactic tags are
preferred - proximity to slot is preferred - extending continues until error is sufficiently
reduced
WHISK learning IE rules / Frans
Adriaans 14
15Pruning
- pre-pruning- stop induction if error is below
certain threshold(for example Laplacian 0.1) - post-pruning- remove rules that have low
coverage to prevent overfitting
WHISK learning IE rules / Frans
Adriaans 15
16Experiments
- Structured text- CNN Weather Forecast, BigBook
Telephone Directory - Semi-structured text- Rental Ads, Seminar
Announcements - Free text- News articles
WHISK learning IE rules / Frans
Adriaans 16
17Results
- Structured text- 100 recall, 100 precision
after only a few examples - Semi-structured text- more examples necessary,
recall 92 , precision 96 - Free text- recall 46 , precision 67
after many, many training examples
WHISK learning IE rules / Frans
Adriaans 17
18Results
- regularities make learning rules easier
- free text- concepts often described
differently- many instances, only a few contain
target slots? many low coverage, high error
rules
WHISK learning IE rules / Frans
Adriaans 18
19Concluding remarks
- structured, semi-structured and free text
- requires no hand-coded rules/keywords
- multi-slot extraction
- free text requires preprocessing (syntactic
analysis, semantic tagger) - performance worse than hand-coded rules
WHISK learning IE rules / Frans
Adriaans 19
20Questions ?
WHISK learning IE rules / Frans
Adriaans 20