Title: Relational Learning of Pattern-Match Rules for Information Extraction
1Relational Learning of Pattern-Match Rules for
Information Extraction
- Presentation by Tim Chartrand of
- A paper by
- Mary Elaine Califf and
- Raymond J. Mooney
2Introduction
- Information Extraction (IE) is the task of
locating specific pieces of information in NL
text - IE is an important subpart of text understanding
- IE systems are difficult and time consuming to
build and they dont port well to different
domains - Researchers are combining learning methods with
NLP methods to automate IE
3Overview of RAPIER
- RAPIER Robust Automated Production of
Information Extraction Rules - Learn IE rules automatically
- Use a corpus of documents paired with filled
templates - Resulting rules do not require prior parsing or
subsequent processing - Uses limited syntactic information from a POS
tagger - Induced patterns incorporate semantic classes
- Rules characterize slot-fillers and their context
4RAPIER Rules
- Consist of three parts
- Pre-filler pattern matches text immediately
preceding the extracted information - Filler pattern matches the exact text to be
extracted - Post-filler pattern matches text after
information - Each pattern is a sequence of pattern items or
pattern lists - Pattern item specifies constraints for one word
or symbol - Pattern list specifies constraints for 0..n words
or symbols - Constraints include
- List of words, one of which must match the item
- POS tag
- Semantic class
5RAPIER Rules (cont.)
Pre-Filler Filler Post-Filler
1)word leading 1)list len2 tagsnn, nns 1)word firm, company
Leading telecommunications firm in need Leading telecommunications firm in need Leading telecommunications firm in need
1)tagnn, nnp 2)list length 2 1)word undisclosed tag jj 1)sem price
sold to the bank for an undisclosed amount paid Honeywell an undisclosed price sold to the bank for an undisclosed amount paid Honeywell an undisclosed price sold to the bank for an undisclosed amount paid Honeywell an undisclosed price
6Learning Algorithm
located in Atlanta, Georgia. offices in Kansas
City, Missouri.
Pre-Filler Filler Post-Filler
S R U L E S 1)word located tagvbn 2) word in tag in 1)word atlanta tagnnp 1)word , tag , 2)word georgia tagnnp 3)word . tag .
S R U L E S 1)word offices tagnns 2)word in tag in 1)wordkansas tagnnp 2)wordcity tagnnp 1)word , tag , 2)word missouri tagnnp 3)word . tag .
R L I S T 1)list len-2 wordatlanta,kansas,city tagnnp
R L I S T 1)list len-2 tagnnp
R L I S T 1)word in tag in 1)list len-2 tagnnp 1)word , tag , 2)tagnnp semanticstate
For each slot, S in the template being learned
SlotRules most specific rules from document
S while compression has failed fewer than lim
times randomly select r pairs of rules from
SlotRules find the set L of generalizations of
the fillers of the rule pairs create rules
from L, evaluate, and initialize RulesList let
n 0 while best rule in RuleList produces
spurious fillers and weighted information
value of best rule is improving increme
nt n specialize each rule in RuleList with
generalizations of the last n items of the
pre-filler patterns of the rule pair and add
specializations to RuleList specialize each
rule in RuleList with generalizations of the
last n items of the post-filler patterns of
the rule pair and add specializations to
RuleList if best rule in RuleList produces
only valid fillers Add it to
SlotRules Remove empirically subsumed rules
7Experimental Results
- The task Extract information from
coputer-related job postings - 17 slots used, including employer, salary, etc.
- Results do not employ semantic categories
- 100 document dataset with filled templates with
10-fold cross validation - Measured precision, recall,
and F-measure
8Experimental Results continued
- Performance
- Is comparable to Crystal on a medical domain
- Is better than AutoSlog and AutoSlog-TS on MUC-4
terrorism task - Is hard to compare because of the different
domains tested - Is good because precision is most important
9Related Work
- Resolve
- Uses decision trees
- Uses annotated coreference examples
- Crystal
- Uses a clustering algorithm to build a dictionary
of extraction patterns - Requires patterns identified by an expert
- Requires prior syntax analysis to identify
syntactic elements and their relationships - AutoSlog
- Specializes a set of general syntatic patterns
- An expert must examine the patterns it produces
- Requires prior syntax analysis
- Liep
- Requires prior syntax analysis
- Makes no real use of semantic information
- Has not been applied to complex domains
10Related Work BYU DEG
- RAPIER rules correspond closely to DEG data
frames. - Data frames are finer-grained, based on character
patterns, whereas rules are based on word
patterns - Pre-filler and Post-filler patterns correspond
closely to data frame contexts and key words - Semantic categories correspond closely with
lexicons - Not mentioned how RAPIER handles multiple record
documents - Rapier data structure is given by the template
(slots) defined in the input data - RAPIER is very similar in purpose to what Joe is
trying to do learn extraction rules based on a
filled in form
11Conclusions
- Extracting desired pieces of information from NL
text is important - Manually constructing IE systems too hard
- RAPIER uses relational learning to build a set of
pattern-match rules given a database of texts and
filled templates - Learned patterns employ syntactic and semantic
information to match slot fillers and context - Fairly accurate results can be obtained for a
real-world problem with relatively small datasets - RAPIER compares favorably with other IE learning
systems