Relational Learning of Pattern-Match Rules for Information Extraction - PowerPoint PPT Presentation

1 / 11

About This Presentation

Title:

Relational Learning of Pattern-Match Rules for Information Extraction

Description:

Title: PowerPoint Presentation Last modified by: Tim Chartrand Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 12

Provided by: blondieCs

Category:

more less

Transcript and Presenter's Notes

Title: Relational Learning of Pattern-Match Rules for Information Extraction

1
Relational Learning of Pattern-Match Rules for
Information Extraction

Presentation by Tim Chartrand of
A paper by
Mary Elaine Califf and
Raymond J. Mooney

2
Introduction

Information Extraction (IE) is the task of
locating specific pieces of information in NL
text
IE is an important subpart of text understanding
IE systems are difficult and time consuming to
build and they dont port well to different
domains
Researchers are combining learning methods with
NLP methods to automate IE

3
Overview of RAPIER

RAPIER Robust Automated Production of
Information Extraction Rules
Learn IE rules automatically
Use a corpus of documents paired with filled
templates
Resulting rules do not require prior parsing or
subsequent processing
Uses limited syntactic information from a POS
tagger
Induced patterns incorporate semantic classes
Rules characterize slot-fillers and their context

4
RAPIER Rules

Consist of three parts
Pre-filler pattern matches text immediately
preceding the extracted information
Filler pattern matches the exact text to be
extracted
Post-filler pattern matches text after
information
Each pattern is a sequence of pattern items or
pattern lists
Pattern item specifies constraints for one word
or symbol
Pattern list specifies constraints for 0..n words
or symbols
Constraints include
List of words, one of which must match the item
POS tag
Semantic class

5
RAPIER Rules (cont.)
Pre-Filler Filler Post-Filler
1)word leading 1)list len2 tagsnn, nns 1)word firm, company
Leading telecommunications firm in need Leading telecommunications firm in need Leading telecommunications firm in need
1)tagnn, nnp 2)list length 2 1)word undisclosed tag jj 1)sem price
sold to the bank for an undisclosed amount paid Honeywell an undisclosed price sold to the bank for an undisclosed amount paid Honeywell an undisclosed price sold to the bank for an undisclosed amount paid Honeywell an undisclosed price
6
Learning Algorithm
located in Atlanta, Georgia. offices in Kansas
City, Missouri.
Pre-Filler Filler Post-Filler
S R U L E S 1)word located tagvbn 2) word in tag in 1)word atlanta tagnnp 1)word , tag , 2)word georgia tagnnp 3)word . tag .
S R U L E S 1)word offices tagnns 2)word in tag in 1)wordkansas tagnnp 2)wordcity tagnnp 1)word , tag , 2)word missouri tagnnp 3)word . tag .
R L I S T 1)list len-2 wordatlanta,kansas,city tagnnp
R L I S T 1)list len-2 tagnnp
R L I S T 1)word in tag in 1)list len-2 tagnnp 1)word , tag , 2)tagnnp semanticstate
For each slot, S in the template being learned
SlotRules most specific rules from document
S while compression has failed fewer than lim
times randomly select r pairs of rules from
SlotRules find the set L of generalizations of
the fillers of the rule pairs create rules
from L, evaluate, and initialize RulesList let
n 0 while best rule in RuleList produces
spurious fillers and weighted information
value of best rule is improving increme
nt n specialize each rule in RuleList with
generalizations of the last n items of the
pre-filler patterns of the rule pair and add
specializations to RuleList specialize each
rule in RuleList with generalizations of the
last n items of the post-filler patterns of
the rule pair and add specializations to
RuleList if best rule in RuleList produces
only valid fillers Add it to
SlotRules Remove empirically subsumed rules
7
Experimental Results

The task Extract information from
coputer-related job postings
17 slots used, including employer, salary, etc.
Results do not employ semantic categories
100 document dataset with filled templates with
10-fold cross validation
Measured precision, recall,
and F-measure

8
Experimental Results continued

Performance
Is comparable to Crystal on a medical domain
Is better than AutoSlog and AutoSlog-TS on MUC-4
terrorism task
Is hard to compare because of the different
domains tested
Is good because precision is most important

9
Related Work

Resolve
Uses decision trees
Uses annotated coreference examples
Crystal
Uses a clustering algorithm to build a dictionary
of extraction patterns
Requires patterns identified by an expert
Requires prior syntax analysis to identify
syntactic elements and their relationships
AutoSlog
Specializes a set of general syntatic patterns
An expert must examine the patterns it produces
Requires prior syntax analysis
Liep
Requires prior syntax analysis
Makes no real use of semantic information
Has not been applied to complex domains

10
Related Work BYU DEG

RAPIER rules correspond closely to DEG data
frames.
Data frames are finer-grained, based on character
patterns, whereas rules are based on word
patterns
Pre-filler and Post-filler patterns correspond
closely to data frame contexts and key words
Semantic categories correspond closely with
lexicons
Not mentioned how RAPIER handles multiple record
documents
Rapier data structure is given by the template
(slots) defined in the input data
RAPIER is very similar in purpose to what Joe is
trying to do learn extraction rules based on a
filled in form

11
Conclusions

Extracting desired pieces of information from NL
text is important
Manually constructing IE systems too hard
RAPIER uses relational learning to build a set of
pattern-match rules given a database of texts and
filled templates
Learned patterns employ syntactic and semantic
information to match slot fillers and context
Fairly accurate results can be obtained for a
real-world problem with relatively small datasets
RAPIER compares favorably with other IE learning
systems