Title: Information Extraction A Practical Survey
1Information ExtractionA Practical Survey
Mihai Surdeanu
- TALP Research Center
- Dep. Llenguatges i Sistemes Informà tics
- Universitat Politècnica de Catalunya
- surdeanu_at_lsi.upc.es
2Overview
- What is information extraction?
- A traditional system and its problems
- Pattern learning and classification
- Beyond patterns
3What is information extraction?
- The extraction or pulling out of pertinent
information from large volumes of texts.
(http//www.itl.nist.gov/iad/894.02/related_projec
ts/muc/index.html) - Information extraction (IE) systems extract
concepts, events, and relations that are relevant
for a given scenario domain. - But, what is a concept, an event, or a scenario
domain? Actual implementations of IE systems
varied throughout the history of the task MUC,
Event99, EELD. - The tendency is to simplify the definition (or
rather the implementation) of the task.
4Information Extraction at the Message
Understanding Conferences
- Seven MUC conferences, between 1987 and 1998.
- Scenario domains driven by template
specifications (fairly similar to database
schemas), which define the content to be
extracted. - Each event fills exactly one template (fairly
similar to a database record). - Each template slot contains either text, or
pointers to other templates. - The goal was to use IE technology to populate
relational databases. Never really happened - The chosen representation was too complicated.
- Did not address real-world problems, but
artificial benchmarks. - Systems never achieved good-enough accuracy.
5MUC-6 Management Succession Example
Barry Diller was appointed chief executive
officer of QVC Network Inc
- ltSUCCESSION_EVENT-9301190125-1gt
- SUCCESSION_ORG ltORGANIZATION-9301190125-1gt
- POST chief executive officer
- IN_AND_OUT
- ltIN_AND_OUT- 9301190125-1gt
- ltIN_AND_OUT- 9301190125-2gt
- VACANCY_REASON REASSIGNMENT
- lt IN_AND_OUT- 9301190125-1gt
- IO_PERSON ltPERSON- 9301190125-1gt
- NEW_STATUS IN
- ON_THE_JOB UNCLEAR
- OTHER_ORG ltORGANIZATION- 9301190125-2gt
- REL_OTHER_ORG OUTSIDE_ORG
- COMMENT Barry Diller IN
-
- ltORGANIZATION-9301190125-1gt
- ORG_NAME QVC Network Inc.
- ORG_TYPE COMPANY
MUC6 Template
Template slot with a text fill
Template slot that points to another template
6Information Extraction at DARPAs HUB-4 Event99
- Was planned as a successor of MUC.
- Identification and extraction of relevant
information dictated by templettes, which are
flat, simplified templates. Slots are filled
only with text, no pointers to other templettes
are accepted. - Domains closer to real-world applications are
addressed natural disasters, bombing, deaths,
elections, financial fluctuations, illness
outbreaks. - The goal was to provide event-level indexing into
documents such as news wires, radio and
television transcripts etcetera. Imagine
querying BOMBING AND Gaza in news messages,
and retrieving only the relevant text about
bombing events in the Gaza area classified into
templettes. - Event99 A Proposed Event Indexing Task For
Broadcast News. Lynette Hirschman et al.
(http//citeseer.nj.nec.com/424439.html)
7Event99 Death ExampleTemplettes Versus
Templates
The sole survivor of the car crash that killed
Princess Diana and Dodi Fayed last year in France
is remembering more about the accident.
ltDEATH-CNN3-1gt DECEASED Princess
Diana / Dodi Fayed MANNER_OF_DEATH
the car crash that killed Princess Diana and
Dodi Fayed / the accident LOCATION in
France DATE last year
8Information Extraction at DARPAs Evidence
Extraction and Link Detection (EELD) Program
- IE used as a tool for the more general problem of
link discovery sift through large data
collections and derive complex rules from
collections of simpler IE patterns. - Example certain sets of account_number(Person,Acc
ount), deposit(Account,Amount),
greater_than(Amount,reporting_amount) patterns
imply is_a(Person, money_launderer). Note the
fact that Person is a money_launderer is not
stated in any form in text! - IE used to identify concepts (typically named
entities), events (typically identified by
trigger words), and basic entity-entity and
entity-event relations. - Simpler IE problem
- No templates or templettes generated.
- Not dealing with event merging.
- Events always marked by trigger words, e.g.
murder triggers a MURDER event. - Relations are always intra-sentential.
- EELD web portal http//www.rl.af.mil/tech/program
s/eeld/
9EELD Example
John Smith is the chief scientist of Hardcom
Corporation.
Entities Person(John Smith), Organization(
Hardcom Corporation) Events -- Relations
person-affiliation(Person(John Smith),
Organization(Hardcom
Corporation))
The murder of John Smith
Entities Person(John Smith) Events
Murder(murder) Relations murder-victim(Person(Joh
n Smith),
Murder(murder))
10Overview
- What is information extraction?
- A traditional system and its problems
- Pattern learning and classification
- Beyond patterns
11Traditional IE Architecture
- The Finite State Automaton Text Understanding
System (FASTUS) approach cascaded finite state
automata (FSA). - Each FSA level recognizes larger linguistic
contructs (from tokens to chunks to clauses to
domain patterns), which become the simplified
input for the next FSA in the cascade. - Why? Speed. Robustness to unstructured input.
Handles data sparsity well. - The FSA cascade is enriched with limited
discourse processing components coreference
resolution and event merging. - Most systems in MUC ended up using this
architecture CIRCUS from UMass (was actually the
first to introduce the cascaded FSA
architecture), PROTEUS (NYU), PLUM (BBN), CICERO
(LCC) and many others. - An ocean of information available
- FASTUS A Cascaded Finite-State Transducer for
Extracting Information from Natural-Language
Text. Jerry R. Hobbs et al. http//www.ai.sri.com/
natural-language/projects/fastus-schabes.html - Infrastructure for Open-Domain Information
Extraction. Mihai Surdeanu and Sanda Harabagiu.
http//www.languagecomputer.com/papers/hlt2002.pdf
- Rich IE bibliography maintained by Horacio
Rodriguez at http//www.lsi.upc.es/horacio/vario
s/sevilla2001.zip
12Language Computers CICERO Information
Extraction System
Documents
Recognizes known concepts using lexicons and
gazetteers.
known word recognition
Identifies numerical entities such as money,
percents, dates and times (FSA)
numerical-entity recognition
stand-alone named-entity recognizer
Identifies named entities such as person,
location, and organization names (FSA)
named-entity recognition
Disambiguates incomplete or ambiguous names
name aliasing
phrasal parser
Identifies basic, noun, verb, and particle
phrases (TBL FSA)
phrase combiner
Identifies domain-dependent complex noun and verb
phrases (FSA)
entity coreference resolution
Detects pronominal and nominal coreference links
domain pattern recognition
Identifies domain-dependent patterns (FSA)
event coreference
Resolves empty templette slots
event merging
Merges templettes belonging to the same event
Templettes/Templates
13Walk-Through Example (1/5)
At least seven police officers were killed and as
many as 52 other people, including several
children, were injured Monday in a car bombing
that also wrecked a police station. Kirkuks
police said they had "good information" that
Ansar al-Islam was behind the blast.
- ltBOMBINGgt
- BOMB a car bombing
- PERPETRATOR Ansar al-Islam
- DEAD At least seven police officers
- INJURED as many as 52 other people, including
several children - DAMAGE a police station
- LOCATION Kirkuk
- DATE Monday
14Walk-Through Example (2/5)
15Walk-Through Example (3/5)
Entity coreference resolution
they ? The police the blast ? a car bombing
16Walk-Through Example (4/5)
At least seven police officers were
killed/PATTERN and as many as 52 other people,
including several children, were injured Monday
in a car bombing/PATTERN car bombing that also
wrecked a police station/PATTERN. Kirkuks police
said they had "good information" that Ansar
al-Islam was behind the blast/PATTERN.
17Walk-Through Example (5/5)
18Coreference for IE
- Algorithm detailed in Recognizing Referential
Links An Information Extraction Perspective.
Megumi Kameyama. http//citeseer.nj.nec.com/kameya
ma97recognizing.html - 3 step algorithm
- Identify all anaphoric entities, e.g. pronouns,
nouns, ambiguous named-entities. - For each anaphoric entity identify all possible
candidates and sort them according to same
salience ordering, e.g. left-to-right traversal
in the same sentence, right-to-left traversal in
previous sentences. - Extract the first candidate that matches some
semantic constraints, e.g. number and gender
consistency. Merge the candidate with the
anaphoric entity.
19The Role of Coreference in Named Entity
Recognition
- Classifies unknown named-entities, that are
likely part of a name but can not be identified
as such due to insufficient local context. - Example Michigan National Corp./ORG said it
will eliminate some senior management jobs
Michigan National/? said the restructuring - Disambiguates named entities of ambiguous length
and/or ambiguous type. - Michigan changed from LOC to ORG when Michigan
Corp. appears in the same context. - The text McDonalds may contain a person name
McDonald or an organization name McDonalds.
Non-deterministic FSA used to maintain both
alternatives until after name aliasing, when one
is selected. - Disambiguate headline named entities.
- Headlines typically capitalized, e.g. McDermott
Completes Sale - Processing of headlines postponed until after the
body of text is processed. - A longest-match approach is used to match the
headline sequence of tokens against entities
found in the first body paragraph. For example,
McDermott is labeled to ORG because it matches
over McDermott International Inc. in the first
document paragraph. - Over 5 increase in accuracy (F-measure) from
87.81 to 93.64.
20The Role of Coreference in IE
21The Good
- Relatively good performance with a simple system
- F-measures over 75 up to 88 for some simpler
Event99 domains - Execution times below 10 seconds per 5KB document
- Improvements to the FSA-only approach
- Coreference almost doubles the FSA-only
performance - More extraction rules add little to the IE
performance whereas different forms of
coreference add more - Non-determinism used to mitigate the limited
power of FSA grammars
22The Bad
- Needs domain-specific lexicons, e.g. an ontology
of bombing devices. Work the automate this
process Learning Dictionaries for Information
Extraction by Multi-Level Bootstrapping. Ellen
Riloff and Rosie Jones. http//www.cs.utah.edu/ri
loff/psfiles/aaai99.pdf (not covered in this
presentation) - Domain-specific patterns must be developed, e.g.
ltSUBJECTgt explode. - Patterns must be classified What does the above
pattern mean? Is the subject a bomb, a
perpetrator, a location? - Patterns can not cover the flexibility of the
natural language. Need better models that go
beyond the pattern limitations. - Event merging is another NP-complete problem. One
of the few stochastic models for event merging
Probabilistic Coreference in Information
Extraction. Andrew Kehler. http//ling.ucsd.edu/k
ehler/Papers/emnlp97.ps.gz (not covered in this
presentation) - All of the above issues are manually developed,
which yields high domain development time (larger
than 40 person hours per domain). This prohibits
the use of this approach for real-time
information extraction.
23Overview
- What is information extraction?
- A traditional system and its problems
- Pattern learning and classification
- Beyond patterns
24Automatically Generating Extraction Patterns from
Untagged Text
- The first system to successfully discover domain
patterns ? AutoSlog-TS. - Automatically Generating Extraction Patterns from
Untagged Text. Ellen Riloff. http//www.cs.utah.ed
u/riloff/psfiles/aaai96.pdf - The intuition is that domain-specific patterns
will appear more often in documents related to
the domain of interest than in unrelated
documents.
25Weakly-Supervised Pattern Learning Algorithm
(1/2)
- Separate the training document set into relevant
and irrelevant documents (manual process). - Generate all possible patterns in all documents,
according to some meta-patterns. Examples below.
Meta Pattern Pattern
ltsubjgt active-verb ltperpetratorgt bombed
active-verb ltdobjgt bombed lttargetgt
infinitive ltdobjgt to kill ltvictimgt
gerund ltdobjgt killing ltvictimgt
ltnpgt prep ltnpgt ltbombgt against lttargetgt
26Weakly-Supervised Pattern Learning Algorithm
(2/2)
- Rank all generated patterns according to the
formula relevance_rate x log2(frequency), where
the relevance_rate indicates the ratio of
relevant instances (i.e. in relevant documents
versus non-relevant documents) of the
corresponding pattern, and frequency indicates
the number of times the pattern was seen in
relevant documents. - Add the top-ranked pattern to the list of learned
patterns, and mark all documents where the
pattern appears as relevant. - Repeat the process from Step 3 for a number of N
iterations. Hence the output of the algorithm is
N learned patterns.
27Examples of Learned Patterns
Patterns learned for the MUC-4 terrorism domain
ltsubjgt exploded
murder of ltnpgt
assasination of ltnpgt
ltsubjgt was killed
ltsubjgt was kidnapped
attack on ltnpgt
ltsubjgt was injured
exploded in ltnpgt
death of ltnpgt
ltsubjgt took_place
28The Good and the Bad
- The good
- Performance very close to the manually-customized
system - The bad
- Documents must be separated into
relevant/irrelevant by hand - When does the learning process stop?
- Pattern classification and event merging still
developed by human experts
29The ExDisco IE System
- Automatic Acquisition of Domain Knowledge for
Information Extraction. Roman Yangarber et al.
http//www.cs.nyu.edu/roman/Papers/2000-coling-pub
.ps.gz - Quasi automatically separates documents in
relevant/non-relevant using a set of seed
patterns selected by the user, e.g. ltcompanygt
appoint-verb ltpersongt for the MUC-6 management
succession domain. - In addition to ranking patterns, ExDisco ranks
documents based on how many relevant patterns
they contain ? immediate application to text
filtering.
30Counter-Training for Pattern Discovery
- Counter-Training in Discovery of Semantic
Patterns. Roman Yangarber. http//www.cs.nyu.edu/r
oman/Papers/2003-acl-countertrain-web.pdf - Previous approaches are iterative learning
algorithms, where the output is a continuous
stream of patterns with degrading precision. What
is the best stopping point? - The approach is to introduce competition among
multiple scenario learners (e.g. management
succession, mergers and acquisitions, legal
actions). Stop when the learners wander in the
territories already discovered by others. - Pattern frequency weighted by the document
relevance. - Document relevance receives negative weight based
on how many patterns from a different scenario it
contains. - The learning for each scenario stops when the
best pattern has a negative score.
31Pattern Classification
- Multiple systems perform successful pattern
acquisition by now, e.g. attacked ltnpgt is
discovered for the bombing domain. But what does
the ltnpgt actually mean? Is it the victim, the
physical target, or something else? - An Empirical Approach to Conceptual Case Frame
Acquisition. Ellen Riloff and Mark Schmelzenbach.
http//www.cs.utah.edu/riloff/psfiles/wvlc98.pdf
32Pattern Classification Algorithm
- Requires 5 seed words per semantic category (e.g.
PERPETRATOR, VICTIM etc) - Builds a context for each semantic category by
expanding the seed word set with words that
appear frequently in the proximity of previous
seed words. - Uses AutoSlog to discover domain patterns.
- Builds a semantic profile for each discovered
pattern based on the overlap between the noun
phrases contained in the pattern and the previous
semantic contexts. - Each pattern is associated with the best ranked
semantic category.
33Pattern Classification Example
Semantic Category Probability
BUILDING 0.10
CIVILIAN 0.03
DATE 0.05
GOVOFFICIAL 0.03
LOCATION 0.03
MILITARYPEOPLE 0.09
TERRORIST 0.00
VEHICLE 0.03
WEAPON 0.00
Semantic profile for the pattern attack on ltnpgt
34Other Pattern-Learning Systems RAPIER (1/2)
- Relational Learning of Pattern-Match Rules for
Information Extraction. Mary Elaine Califf and
Raymond J. Mooney. http//citeseer.nj.nec.com/cali
ff98relational.html - Uses Inductive Logic Programming (ILP) to
implement a bottom-up generalization of patterns. - Patterns specified with pre-fillers (conditions
on the tokens preceding the pattern), fillers
(conditions on the tokens included in the
pattern), and post-fillers (conditions on the
tokens following the pattern) - The only linguistic resource used is a
part-of-speech (POS) tagger. No parser (full or
partial) used! - More robust to unstructured text.
- Applicability limited to simpler domains (e.g.
job postings)
35Other Pattern-Learning Systems RAPIER (2/2)
located in Atlanta, Georgia
Pre-filler Filler Post-filler
word located, tag VBN word in, tag IN word Atlanta, tag NNP word , , tag , word Georgia, tag NNP
offices in Kansas City, Missouri
Pre-filler Filler Post-filler
word offices, tag NNS word in, tag IN word Kansas, tag NNP word City, tag NNP word , , tag , word Missouri, tag NNP
Pre-filler Filler Post-filler
word in, tag IN list len 2, tag NNP word , , tag , semantic STATE, tag NNP
36Other Pattern-Learning Systems
- SRV
- Toward General-Purpose Learning for Information
Extraction. Dayne Freitag. http//citeseer.nj.nec.
com/freitag98toward.html - Supervised machine learning based on FOIL.
Constructs HORN clauses from examples. - Active learning
- Active Learning for Information Extraction with
Multiple View Feature Sets. Rosie Jones et al.
http//www.cs.utah.edu/riloff/psfiles/ecml-wkshp0
3.pdf - Active learning with multiple views. Ion Muslea.
http//www.ai.sri.com/muslea/PS/dissertation-02.p
df - Interactively learn and annotate data to reduce
human effort in data annotation.
37Overview
- What is information extraction?
- A traditional system and its problems
- Pattern learning and classification
- Beyond patterns
38The Need to Move Beyond the Pattern-Based
Paradigm (1/2)
The space shuttle Challenger/AGENT_OF_DEATH flew
apart over Florida like a billion-dollar
confetti killing/MANNER_OF_DEATH six
astronauts/DECEASED.
Hard using surface-level information
Easier using full parse trees
AGENT_OF_DEATH
MANNER_OF_DEATH
DECEASED
39The Need to Move Beyond the Pattern-Based
Paradigm (2/2)
- Pattern-based systems
- Have limited power due to the strict formalism ?
accuracy lt 60 without additional discourse
processing. - Were developed also due to the historical
conjecture there was no high-performance full
parser widely available. - Recent NLP developments
- Full syntactic parsing ? 90 Collins,
1997Charniak, 2000. - Predicate-argument frames provide open-domain
event representation Surdeanu et al, 2003,
Gildea and Jurafsky, 2002Gildea and Palmer,
2002.
40Goal
- Novel IE paradigm
- Syntactic representation provided by full parser.
- Event representation based on predicate-argument
frames. - Entity coreference provides pronominal and
nominal anaphora resolution (future work). - Event merging merges similar/overlapping events
(future work). - Advantages
- High accuracy due to enhanced syntactic and
semantic processing. - Minimal domain customization time because most
components are open-domain.
41Proposition Bank Overview
S
VP
NP
VP
PP
NP
The futures halt
was
assailed
by
Big Board floor traders
ARG1 entity assailed
PRED
ARG0 agent
- A one million word corpus annotated with
predicate argument structures Kingsbury, 2002.
Currently only predicates lexicalized by verbs. - Numbered arguments from 0 to 5. Typically ARG0
agent, ARG1 direct object or theme, ARG2
indirect object, benefactive, or instrument, but
they are predicate dependent! - Functional tags ARMG-LOC locative, ARGM-TMP
temporal, ARGM-DIR direction.
42Block Architecture
identification of pred-arg structures
43Walk-Through Example
The space shuttle Challenger flew apart over
Florida like a billion-dollar confetti killing
six astronauts.
44The Model
- Consists of two tasks (1) identifying parse tree
constituents corresponding to predicate
arguments, and (2) assigning a role to each
argument constituent. - Both tasks modeled using C5.0 decision tree
learning, and two sets of features Feature Set 1
adapted from Gildea and Jurafsky, 2002, and
Feature Set 2, novel set of semantic and
syntactic features.
45Feature Set 1
- POSITION (pos) indicates if constituent appears
before predicate in sentence. E.g. true for ARG1
and false for ARG2. - VOICE (voice) predicate voice (active or
passive). E.g. passive for PRED. - HEAD WORD (hw) head word of the evaluated
phrase. E.g. halt for ARG1. - GOVERNING CATEGORY (gov) indicates if an NP is
dominated by a S phrase or a VP phrase. E.g. S
for ARG1, VP for ARG0. - PREDICATE WORD the verb with morphological
information preserved (verb), and the verb
normalized to lower case and infinitive form
(lemma). E.g. for PRED verb is assailed, lemma
is assail.
- PHRASE TYPE (pt) type of the syntactic phrase as
argument. E.g. NP for ARG1. - PARSE TREE PATH (path) path between argument
and predicate. E.g. NP ? S ? VP ? VP for ARG1. - PATH LENGTH (pathLen) number of labels stored in
the predicate-argument path. E.g. 4 for ARG1.
46Observations about Feature Set 1
- Because most of the argument constituents are
prepositional attachments (PP) and relative
clauses (SBAR), often the head word (hw) is not
the most informative word in the phrase. - Due to its strong lexicalization, the model
suffers from data sparsity. E.g. hw used lt 3.
The problem can be addressed with a back-off
model from words to part of speech tags. - The features in set 1 capture only syntactic
information, even though semantic information
like named-entity tags should help. For example,
ARGM-TMP typically contains DATE entities, and
ARGM-LOC includes LOCATION named entities. - Feature set 1 does not capture predicates
lexicalized by phrasal verbs, e.g. put up.
47Feature Set 2 (1/2)
- CONTENT WORD (cw) lexicalized feature that
selects an informative word from the constituent,
other than the head. Selection heuristics
available in the paper. E.g. June for the
phrase in last June. - PART OF SPEECH OF CONTENT WORD (cPos) part of
speech tag of the content word. E.g. NNP for the
phrase in last June. - PART OF SPEECH OF HEAD WORD (hPos) part of
speech tag of the head word. E.g. NN for the
phrase the futures halt. - NAMED ENTITY CLASS OF CONTENT WORD (cNE) The
class of the named entity that includes the
content word. 7 named entity classes (from the
MUC-7 specification) covered. E.g. DATE for in
last June.
48Feature Set 2 (2/2)
- BOOLEAN NAMED ENTITY FLAGS set of features that
indicate if a named entity is included at any
position in the phrase - neOrganization set to true if an organization
name is recognized in the phrase. - neLocation set to true if a location name is
recognized in the phrase. - nePerson set to true if a person name is
recognized in the phrase. - neMoney set to true if a currency expression is
recognized in the phrase. - nePercent set to true if a percentage expression
is recognized in the phrase. - neTime set to true if a time of day expression
is recognized in the phrase. - neDate set to true if a date temporal expression
is recognized in the phrase. - PHRASAL VERB COLLOCATIONS set of two features
that capture information about phrasal verbs - pvcSum the frequency with which a verb is
immediately followed by any preposition or
particle. - pvcMax the frequency with which a verb is
followed by its predominant preposition or
particle.
49Experiments (1/3)
- Trained on PropBank release 2002/7/15, Treebank
release 2, both without Section 23. Named entity
information extracted using CiceroLite. - Tested on PropBank and Treebank section 23. Used
gold-standard trees from Treebank, and named
entities from CiceroLite. - Task 1 (identifying argument constituents)
- Negative examples any Treebank phrases not
tagged in PropBank. Due to memory limitations, we
used 11 of Treebank. - Positive examples Treebank phrases (from the
same 11 set) annotated with any PropBank role. - Task 2 (assigning roles to argument
constituents) - Due to memory limitations we limited the example
set to the first 60 of PropBank annotations.
50Experiments (2/3)
Features Arg P Arg R Arg F1 Role A
FS1 84.96 84.26 84.61 78.76
FS1 POS tag of head word 92.24 84.50 88.20 79.04
FS1 content word and POS tag 92.19 84.67 88.27 80.80
FS1 NE label of content word 83.93 85.69 84.80 79.85
FS1 phrase NE flags 87.78 85.71 86.73 81.28
FS1 phrasal verb information 84.88 82.77 83.81 78.62
FS1 FS2 91.62 85.06 88.22 83.05
FS1 FS2 boosting 93.00 85.29 88.98 83.74
51Experiments (3/3)
- Four models compared
- Gildea and Palmer, 2002
- Gildea and Palmer, 2002, our implementation
- Our model with FS1
- Our model with FS1 FS2 boosting
Model Implementation Arg F1 Role A
Statistical Gildea and Palmer - 82.8
Statistical This study 71.86 78.87
Decision Trees FS1 84.61 78.76
Decision Trees FS1 FS2 boosting 88.98 83.74
52Mapping Predicate-Argument Structures to
Templettes
- The mapping rules from predicate-arguments
structures to templette slots are currently
manually produced, using training texts and the
corresponding templettes. Effort per domain lt 3
person hours, if training information is
available. - We focused on two Event99 domains
- Market change tracks changes of financial
instruments. Relevant slots INSTRUMENT
description of the financial instrument
AMOUNT_CHANGE change amount and CURRENT_VALUE
current instrument value after change. - Death extracts person death events. Relevant
slots DECEASED person deceased
MANNER_OF_DEATH manner of death and
AGENT_OF_DEATH entity that caused the death
event.
53Mappings for Event99 Death and Market Change
Domains
54Experimental Setup
- Three systems compared
- This model with predicate-argument structures
detected using the statistical approach. - This model with predicate-argument structures
detected using decision trees. - Cascaded Finite-State-Automata system (Cicero).
- In all systems entity coreference and event
fusion disabled.
55Experiments
System Market Change Death
Pred/Args Statistical 68.9 58.4
Pred/Args Inductive 82.8 67.0
FSA 91.3 72.7
System Correct Missed Incorrect
Pred/Args Statistical 26 16 3
Pred/Args Inductive 33 9 2
FSA 38 4 2
56The good and the bad
- The good
- The method achieves over 88 F-measure for the
task of identifying argument constituents, and
over 83 accuracy for role labeling. - The model scales well to unknown predicates
because predicate lexical information is used for
less than 5 of the branching decisions. - Domain customization of the complete IE system is
less than 3 person hours per domain because most
of the components are open-domain.
Domain-specific components can be modeled with
machine learning (future work). - Performance degradation versus a fully-customized
IE system is only 10. Will be further decreased
by including coreference resolution (open-domain)
and event fusion (domain-specific). - The bad
- Currently PropBank provides annotations only for
verb-based predicates. Noun-noun relations cannot
be modeled for now. - Can not be applied to unstructured text, where
full parsing does not work. - Slower than the cascaded FSA models.
57Other Pattern-Free Systems
- Algorithms That Learn To Extract Information.
BBN Description Of The Sift System As Used For
MUC-7. Scott Miller et al. http//citeseer.nj.nec.
com/miller98algorithms.html - Probabilistic model with features extracted from
full parse trees enhanced with NEs - Kernel Methods for Relation Extraction. Dmitry
Zelenko and Chinatsu Aone. http//citeseer.nj.nec.
com/zelenko02kernel.html - Tree-based SVM kernels used to discover EELD
relations. - Automatic Pattern Acquisition for Japanese
Information Extraction. Kiyoshi Sudo et al.
http//citeseer.nj.nec.com/sudo01automatic.html - Learns parse trees that subsume the information
of interest.
58End
Grà cies!