Title: Learning Information Extraction Rules for SemiStructured and Free Text
1Learning Information Extraction Rules for
Semi-Structured and Free Text
- Presented by Daniyal Alghazzawi
- November 16, 2005
2TOC
- Contributions of WHISK
- Rules in WHISK
- WHISK Algorithm (Supervised Learning)
- Empirical Results
3TOC
- Contributions of WHISK
- Rules in WHISK
- WHISK Algorithm (Supervised Learning)
- Empirical Results
4Contributions of WHISK
- The first system to learn text extraction rules
for the full range of text styles - Structured Text
- Semi-Structured Text
- Free Text
- Multi-Slot
- Syntactic Analyzing
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
5Contributions of WHISK
- The first system to learn text extraction rules
for the full range of text styles - Structured Text
- Semi-Structured Text
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
6Structured Text Style
- Fixed order of relative information
- HTML tags delimit strings to be extracted
- Examples
- CNN Weather Forecast
- BigBook searchable telephone directory
7Contributions of WHISK
- The first system to learn text extraction rules
for the full range of text styles - Structured Text
- Semi-Structured Text
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
8Semi-Structured Text Style
- Telegraphic abbreviations ? Semantic required
- No grammar ? Syntax analyzer will not work
- Examples
- Apartment rental ads
- Medical records
- equipment maintenance logs
9Semi-Structured Text Style
Neighborhood (capitol hill downtown )
- Telegraphic abbreviations ? Semantic required
- No grammar ? Syntax Analyzer will not work
- Examples
- Apartment rental ads
- Medical records
- equipment maintenance logs
10Semi-Structured Text Style
bedroom (br brs bds bdrm bedrooms)
- Telegraphic abbreviations ? Semantic required
- No grammar ? Syntax Analyzer will not work
- Examples
- Apartment rental ads
- Medical records
- equipment maintenance logs
11Contributions of WHISK
- The first system to learn text extraction rules
for the full range of text styles - Structured Text
- Semi-Structured Text
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
12Free Text Style
- Needs syntactic analysis
- May need semantic tagging
- Examples
- Management Succession
13Free Text Style
SUBJ - PN
- Needs syntactic analysis
- May need semantic tagging
- Examples
- Management Succession
14Free Text Style
SUBJ - PS
- Needs syntactic analysis
- May need semantic tagging
- Examples
- Management Succession
15Contributions of WHISK
- First system that learns extract rules for the
full range of text styles - Structured
- Semi-Structured
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
16Single-Slot vs. Multi-Slot
17Contributions of WHISK
- First system that learns extract rules for the
full range of text styles - Structured
- Semi-Structured
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
18Contributions of WHISK
- First system that learns extract rules for the
full range of text styles - Structured
- Semi-Structured
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
19Contributions of WHISK
- First system that learns extract rules for the
full range of text styles - Structured
- Semi-Structured
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
20Contributions of WHISK
- First system that learns extract rules for the
full range of text styles - Structured
- Semi-Structured
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
21TOC
- Contributions of WHISK
- Rules in WHISK
- WHISK Algorithm (Supervised Learning)
- Empirical Results
22Rules in WHISK
- Structured and Semi-Structured Text
- Semantic Tags
- Free-Text
- Semantic Tags, and
- Syntactic Tags
23Rules in WHISK
- Structured and Semi-Structured Text
- Semantic Tags
- Free-Text
- Semantic Tags, and
- Syntactic Tags
24A Rule Using Semantic Tags
25A Rule Using Semantic Tags
- Input (Semi-Structure)
- Output
26A Rule Using Semantic Tags
27A Rule Using Semantic Tags
Skip any character until the next term occur
28A Rule Using Semantic Tags
( X ) The phrase X needs to be extracted and
store it
29A Rule Using Semantic Tags
Italicized Word not enclosed by quote Class Name
30A Rule Using Semantic Tags
Nghbr Capitol Hill Downtown
31A Rule Using Semantic Tags
32A Rule Using Semantic Tags
33A Rule Using Semantic Tags
Digit 0 1 2 3 9
34A Rule Using Semantic Tags
Single quote Match exactly
35A Rule Using Semantic Tags
bdrm (br brs bds bedroom )
36A Rule Using Semantic Tags
37A Rule Using Semantic Tags
38A Rule Using Semantic Tags
Number 0 1 2 3 inf
39A Rule Using Semantic Tags
40Rules in WHISK
- Structured and Semi-Structured Text
- Semantic Tags
- Free-Text
- Semantic Tags, and
- Syntactic Tags
41A Rule Using (Syntactic Semantic)
42A Rule Using (Syntactic Semantic)
43A Rule Using (Syntactic Semantic)
44A Rule Using (Syntactic Semantic)
45A Rule Using (Syntactic Semantic)
46A Rule Using (Syntactic Semantic)
47A Rule Using (Syntactic Semantic)
48A Rule Using (Syntactic Semantic)
49A Rule Using (Syntactic Semantic)
50A Rule Using (Syntactic Semantic)
51A Rule Using (Syntactic Semantic)
52A Rule Using (Syntactic Semantic)
53TOC
- Contributions of WHISK
- Rules in WHISK
- WHISK Algorithm (Supervised Learning)
- Empirical Results
54WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
55WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
56Choosing Instances Automatically
- By using
- HTML tags
- Regular expression
- Sentence analyzing
- Pre-adding semantic tags or syntactic annotations
- Selecting samples automatically from the
reservoir to reduce the training processing by
divide them into three classes - Instances covered by an existing rule
- Instances that are near misses of a rule
- Instances not covered by any rule
57WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
58Hand-Tagged
- An instance has been chosen by the system
59Hand-Tagged
- An instance has been chosen by the system
- User needs to do hand-tagged
60Hand-Tagged
- An instance has been chosen by the system
- User needs to do hand-tagged
Instance
Seed
Tag
61WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
62Creating A Rule
63Creating A Rule
- The instance seed
- Anchoring Slot 1
- Base_1 (Nghbr)
-
64Creating A Rule
- The instance seed
- Anchoring Slot 1
- Base_1 (Nghbr)
- Base_2 _at_start () -
65Creating A Rule
- The instance seed
- Anchoring Slot 1
- ? Base_1 (Nghbr)
- Base_2 _at_start () -
66Creating A Rule
- The instance seed
- Anchoring Slot 2
- Base_1 (Nghbr) (Digit)
- Base_2 (Nghbr) - () br
67Creating A Rule
- The instance seed
- Anchoring Slot 2
- ? Base_1 (Nghbr) (Digit)
- ? Base_2 (Nghbr) - () br
68Creating A Rule
- The instance seed
- Anchoring Slot 3
- Base_1 (Nghbr)(Digit)(Number)
- Base_2 (Nghbr)(Digit)().
69WHISK Algorithm (Supervised Learning)
- Choosing instances automatically
- Creating hand-tagged training instance
- Creating a rule automatically from a seed
instance - Anchoring the extraction slots
- Adding terms to a proposed rule
- Reliability using Laplacian
70Reliability Using Laplacian
- Hill Climbing and Horizon Effects
- Pre-Pruning
- Post-Pruning
- When to stop tagging?
71WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
72Empirical Results
- Recall is the percentage of relevant information
that is correctly reported by the system. - Precision is the percentage of the information
reported as relevant by the system that is
correct. - Recall TP / (TP FN)
- Precision TP / (TP FP)
- Accuracy (TP TN) / (TP TN FP FN)
73Empirical Results
- Structured Text (CNN weather domain)
- Recall 100
- Precision 100
74Empirical Results
- Structured Text (BigBook domain)
- Recall 100
- Precision 100
75Empirical Results
- Semi-Structured Text (Rental ads)
76Empirical Results
- Semi-Structured Text (Software jobs)
77Empirical Results
- Free Text (Management succession)
78Conclusion
- First system that learns extract rules for the
full range of text styles - Structured
- Semi-Structured
- Free Text
- Multi-Slot
- Syntactic Analyzer
- Semantic Tagging
- More Automated (Less Hand-work)
- Does not require to Trim the extraneous words
79Thank You????? ???