Learning Information Extraction Rules for SemiStructured and Free Text - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

Learning Information Extraction Rules for SemiStructured and Free Text

Description:

Neighborhood = (capitol hill | downtown |...) Semi-Structured Text Style ... Hill Climbing and Horizon Effects. Pre-Pruning. Post-Pruning. When to stop tagging? ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 80
Provided by: DMG7
Category:

less

Transcript and Presenter's Notes

Title: Learning Information Extraction Rules for SemiStructured and Free Text


1
Learning Information Extraction Rules for
Semi-Structured and Free Text
  • Presented by Daniyal Alghazzawi
  • November 16, 2005

2
TOC
  • Contributions of WHISK
  • Rules in WHISK
  • WHISK Algorithm (Supervised Learning)
  • Empirical Results

3
TOC
  • Contributions of WHISK
  • Rules in WHISK
  • WHISK Algorithm (Supervised Learning)
  • Empirical Results

4
Contributions of WHISK
  • The first system to learn text extraction rules
    for the full range of text styles
  • Structured Text
  • Semi-Structured Text
  • Free Text
  • Multi-Slot
  • Syntactic Analyzing
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

5
Contributions of WHISK
  • The first system to learn text extraction rules
    for the full range of text styles
  • Structured Text
  • Semi-Structured Text
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

6
Structured Text Style
  • Fixed order of relative information
  • HTML tags delimit strings to be extracted
  • Examples
  • CNN Weather Forecast
  • BigBook searchable telephone directory

7
Contributions of WHISK
  • The first system to learn text extraction rules
    for the full range of text styles
  • Structured Text
  • Semi-Structured Text
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

8
Semi-Structured Text Style
  • Telegraphic abbreviations ? Semantic required
  • No grammar ? Syntax analyzer will not work
  • Examples
  • Apartment rental ads
  • Medical records
  • equipment maintenance logs

9
Semi-Structured Text Style
Neighborhood (capitol hill downtown )
  • Telegraphic abbreviations ? Semantic required
  • No grammar ? Syntax Analyzer will not work
  • Examples
  • Apartment rental ads
  • Medical records
  • equipment maintenance logs

10
Semi-Structured Text Style
bedroom (br brs bds bdrm bedrooms)
  • Telegraphic abbreviations ? Semantic required
  • No grammar ? Syntax Analyzer will not work
  • Examples
  • Apartment rental ads
  • Medical records
  • equipment maintenance logs

11
Contributions of WHISK
  • The first system to learn text extraction rules
    for the full range of text styles
  • Structured Text
  • Semi-Structured Text
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

12
Free Text Style
  • Needs syntactic analysis
  • May need semantic tagging
  • Examples
  • Management Succession

13
Free Text Style
SUBJ - PN
  • Needs syntactic analysis
  • May need semantic tagging
  • Examples
  • Management Succession

14
Free Text Style
SUBJ - PS
  • Needs syntactic analysis
  • May need semantic tagging
  • Examples
  • Management Succession

15
Contributions of WHISK
  • First system that learns extract rules for the
    full range of text styles
  • Structured
  • Semi-Structured
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

16
Single-Slot vs. Multi-Slot
  • Single-Slot
  • Isolated
  • Multi-Slot
  • Related

17
Contributions of WHISK
  • First system that learns extract rules for the
    full range of text styles
  • Structured
  • Semi-Structured
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

18
Contributions of WHISK
  • First system that learns extract rules for the
    full range of text styles
  • Structured
  • Semi-Structured
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

19
Contributions of WHISK
  • First system that learns extract rules for the
    full range of text styles
  • Structured
  • Semi-Structured
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

20
Contributions of WHISK
  • First system that learns extract rules for the
    full range of text styles
  • Structured
  • Semi-Structured
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

21
TOC
  • Contributions of WHISK
  • Rules in WHISK
  • WHISK Algorithm (Supervised Learning)
  • Empirical Results

22
Rules in WHISK
  • Structured and Semi-Structured Text
  • Semantic Tags
  • Free-Text
  • Semantic Tags, and
  • Syntactic Tags

23
Rules in WHISK
  • Structured and Semi-Structured Text
  • Semantic Tags
  • Free-Text
  • Semantic Tags, and
  • Syntactic Tags

24
A Rule Using Semantic Tags
  • Input (Semi-Structure)

25
A Rule Using Semantic Tags
  • Input (Semi-Structure)
  • Output

26
A Rule Using Semantic Tags
  • Input
  • The Rule

27
A Rule Using Semantic Tags
  • Input
  • The Rule

Skip any character until the next term occur
28
A Rule Using Semantic Tags
  • Input
  • The Rule

( X ) The phrase X needs to be extracted and
store it
29
A Rule Using Semantic Tags
  • Input
  • The Rule

Italicized Word not enclosed by quote Class Name
30
A Rule Using Semantic Tags
  • Input
  • The Rule

Nghbr Capitol Hill Downtown
31
A Rule Using Semantic Tags
  • Input
  • The Rule

32
A Rule Using Semantic Tags
  • Input
  • The Rule

33
A Rule Using Semantic Tags
  • Input
  • The Rule

Digit 0 1 2 3 9
34
A Rule Using Semantic Tags
  • Input
  • The Rule

Single quote Match exactly
35
A Rule Using Semantic Tags
  • Input
  • The Rule

bdrm (br brs bds bedroom )
36
A Rule Using Semantic Tags
  • Input
  • The Rule

37
A Rule Using Semantic Tags
  • Input
  • The Rule

38
A Rule Using Semantic Tags
  • Input
  • The Rule

Number 0 1 2 3 inf
39
A Rule Using Semantic Tags
  • Input
  • Output

40
Rules in WHISK
  • Structured and Semi-Structured Text
  • Semantic Tags
  • Free-Text
  • Semantic Tags, and
  • Syntactic Tags

41
A Rule Using (Syntactic Semantic)
  • Input (Free Text)
  • Output

42
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

43
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

44
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

45
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

46
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

47
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

48
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

49
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

50
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

51
A Rule Using (Syntactic Semantic)
  • Input
  • The Rule

52
A Rule Using (Syntactic Semantic)
  • Input
  • Output

53
TOC
  • Contributions of WHISK
  • Rules in WHISK
  • WHISK Algorithm (Supervised Learning)
  • Empirical Results

54
WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
55
WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
56
Choosing Instances Automatically
  • By using
  • HTML tags
  • Regular expression
  • Sentence analyzing
  • Pre-adding semantic tags or syntactic annotations
  • Selecting samples automatically from the
    reservoir to reduce the training processing by
    divide them into three classes
  • Instances covered by an existing rule
  • Instances that are near misses of a rule
  • Instances not covered by any rule

57
WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
58
Hand-Tagged
  • An instance has been chosen by the system

59
Hand-Tagged
  • An instance has been chosen by the system
  • User needs to do hand-tagged

60
Hand-Tagged
  • An instance has been chosen by the system
  • User needs to do hand-tagged

Instance
Seed
Tag
61
WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
62
Creating A Rule
  • The instance seed

63
Creating A Rule
  • The instance seed
  • Anchoring Slot 1
  • Base_1 (Nghbr)

64
Creating A Rule
  • The instance seed
  • Anchoring Slot 1
  • Base_1 (Nghbr)
  • Base_2 _at_start () -

65
Creating A Rule
  • The instance seed
  • Anchoring Slot 1
  • ? Base_1 (Nghbr)
  • Base_2 _at_start () -

66
Creating A Rule
  • The instance seed
  • Anchoring Slot 2
  • Base_1 (Nghbr) (Digit)
  • Base_2 (Nghbr) - () br

67
Creating A Rule
  • The instance seed
  • Anchoring Slot 2
  • ? Base_1 (Nghbr) (Digit)
  • ? Base_2 (Nghbr) - () br

68
Creating A Rule
  • The instance seed
  • Anchoring Slot 3
  • Base_1 (Nghbr)(Digit)(Number)
  • Base_2 (Nghbr)(Digit)().

69
WHISK Algorithm (Supervised Learning)
  • Choosing instances automatically
  • Creating hand-tagged training instance
  • Creating a rule automatically from a seed
    instance
  • Anchoring the extraction slots
  • Adding terms to a proposed rule
  • Reliability using Laplacian

70
Reliability Using Laplacian
  • Hill Climbing and Horizon Effects
  • Pre-Pruning
  • Post-Pruning
  • When to stop tagging?

71
WHISK Algorithm (Supervised Learning)
Choosing Instance
Create Rules
Create Tags
By the System
By the System
By the User
Store the Rules
Check Reliability
By the System
72
Empirical Results
  • Recall is the percentage of relevant information
    that is correctly reported by the system.
  • Precision is the percentage of the information
    reported as relevant by the system that is
    correct.
  • Recall TP / (TP FN)
  • Precision TP / (TP FP)
  • Accuracy (TP TN) / (TP TN FP FN)

73
Empirical Results
  • Structured Text (CNN weather domain)
  • Recall 100
  • Precision 100

74
Empirical Results
  • Structured Text (BigBook domain)
  • Recall 100
  • Precision 100

75
Empirical Results
  • Semi-Structured Text (Rental ads)

76
Empirical Results
  • Semi-Structured Text (Software jobs)

77
Empirical Results
  • Free Text (Management succession)

78
Conclusion
  • First system that learns extract rules for the
    full range of text styles
  • Structured
  • Semi-Structured
  • Free Text
  • Multi-Slot
  • Syntactic Analyzer
  • Semantic Tagging
  • More Automated (Less Hand-work)
  • Does not require to Trim the extraneous words

79
Thank You????? ???
Write a Comment
User Comments (0)
About PowerShow.com