Information Extraction A Practical Survey

About This Presentation

Title:

Information Extraction A Practical Survey

Description:

Information Extraction A Practical Survey Mihai Surdeanu TALP Research Center Dep. Llenguatges i Sistemes Inform tics Universitat Polit cnica de Catalunya – PowerPoint PPT presentation

Number of Views:158

Avg rating:3.0/5.0

Slides: 59

Provided by: DanM82

Learn more at: https://www.cs.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction A Practical Survey

1
Information ExtractionA Practical Survey
Mihai Surdeanu

TALP Research Center
Dep. Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
surdeanu_at_lsi.upc.es

2
Overview

What is information extraction?
A traditional system and its problems
Pattern learning and classification
Beyond patterns

3
What is information extraction?

The extraction or pulling out of pertinent
information from large volumes of texts.
(http//www.itl.nist.gov/iad/894.02/related_projec
ts/muc/index.html)
Information extraction (IE) systems extract
concepts, events, and relations that are relevant
for a given scenario domain.
But, what is a concept, an event, or a scenario
domain? Actual implementations of IE systems
varied throughout the history of the task MUC,
Event99, EELD.
The tendency is to simplify the definition (or
rather the implementation) of the task.

4
Information Extraction at the Message
Understanding Conferences

Seven MUC conferences, between 1987 and 1998.
Scenario domains driven by template
specifications (fairly similar to database
schemas), which define the content to be
extracted.
Each event fills exactly one template (fairly
similar to a database record).
Each template slot contains either text, or
pointers to other templates.
The goal was to use IE technology to populate
relational databases. Never really happened
The chosen representation was too complicated.
Did not address real-world problems, but
artificial benchmarks.
Systems never achieved good-enough accuracy.

5
MUC-6 Management Succession Example
Barry Diller was appointed chief executive
officer of QVC Network Inc

ltSUCCESSION_EVENT-9301190125-1gt
SUCCESSION_ORG ltORGANIZATION-9301190125-1gt
POST chief executive officer
IN_AND_OUT
ltIN_AND_OUT- 9301190125-1gt
ltIN_AND_OUT- 9301190125-2gt
VACANCY_REASON REASSIGNMENT
lt IN_AND_OUT- 9301190125-1gt
IO_PERSON ltPERSON- 9301190125-1gt
NEW_STATUS IN
ON_THE_JOB UNCLEAR
OTHER_ORG ltORGANIZATION- 9301190125-2gt
REL_OTHER_ORG OUTSIDE_ORG
COMMENT Barry Diller IN
ltORGANIZATION-9301190125-1gt
ORG_NAME QVC Network Inc.
ORG_TYPE COMPANY

MUC6 Template
Template slot with a text fill
Template slot that points to another template
6
Information Extraction at DARPAs HUB-4 Event99

Was planned as a successor of MUC.
Identification and extraction of relevant
information dictated by templettes, which are
flat, simplified templates. Slots are filled
only with text, no pointers to other templettes
are accepted.
Domains closer to real-world applications are
addressed natural disasters, bombing, deaths,
elections, financial fluctuations, illness
outbreaks.
The goal was to provide event-level indexing into
documents such as news wires, radio and
television transcripts etcetera. Imagine
querying BOMBING AND Gaza in news messages,
and retrieving only the relevant text about
bombing events in the Gaza area classified into
templettes.
Event99 A Proposed Event Indexing Task For
Broadcast News. Lynette Hirschman et al.
(http//citeseer.nj.nec.com/424439.html)

7
Event99 Death ExampleTemplettes Versus
Templates
The sole survivor of the car crash that killed
Princess Diana and Dodi Fayed last year in France
is remembering more about the accident.
ltDEATH-CNN3-1gt DECEASED Princess
Diana / Dodi Fayed MANNER_OF_DEATH
the car crash that killed Princess Diana and
Dodi Fayed / the accident LOCATION in
France DATE last year
8
Information Extraction at DARPAs Evidence
Extraction and Link Detection (EELD) Program

IE used as a tool for the more general problem of
link discovery sift through large data
collections and derive complex rules from
collections of simpler IE patterns.
Example certain sets of account_number(Person,Acc
ount), deposit(Account,Amount),
greater_than(Amount,reporting_amount) patterns
imply is_a(Person, money_launderer). Note the
fact that Person is a money_launderer is not
stated in any form in text!
IE used to identify concepts (typically named
entities), events (typically identified by
trigger words), and basic entity-entity and
entity-event relations.
Simpler IE problem
No templates or templettes generated.
Not dealing with event merging.
Events always marked by trigger words, e.g.
murder triggers a MURDER event.
Relations are always intra-sentential.
EELD web portal http//www.rl.af.mil/tech/program
s/eeld/

9
EELD Example
John Smith is the chief scientist of Hardcom
Corporation.
Entities Person(John Smith), Organization(
Hardcom Corporation) Events -- Relations
person-affiliation(Person(John Smith),
Organization(Hardcom
Corporation))
The murder of John Smith
Entities Person(John Smith) Events
Murder(murder) Relations murder-victim(Person(Joh
n Smith),
Murder(murder))
10
Overview

What is information extraction?
A traditional system and its problems
Pattern learning and classification
Beyond patterns

11
Traditional IE Architecture

The Finite State Automaton Text Understanding
System (FASTUS) approach cascaded finite state
automata (FSA).
Each FSA level recognizes larger linguistic
contructs (from tokens to chunks to clauses to
domain patterns), which become the simplified
input for the next FSA in the cascade.
Why? Speed. Robustness to unstructured input.
Handles data sparsity well.
The FSA cascade is enriched with limited
discourse processing components coreference
resolution and event merging.
Most systems in MUC ended up using this
architecture CIRCUS from UMass (was actually the
first to introduce the cascaded FSA
architecture), PROTEUS (NYU), PLUM (BBN), CICERO
(LCC) and many others.
An ocean of information available
FASTUS A Cascaded Finite-State Transducer for
Extracting Information from Natural-Language
Text. Jerry R. Hobbs et al. http//www.ai.sri.com/
natural-language/projects/fastus-schabes.html
Infrastructure for Open-Domain Information
Extraction. Mihai Surdeanu and Sanda Harabagiu.
http//www.languagecomputer.com/papers/hlt2002.pdf
Rich IE bibliography maintained by Horacio
Rodriguez at http//www.lsi.upc.es/horacio/vario
s/sevilla2001.zip

12
Language Computers CICERO Information
Extraction System
Documents
Recognizes known concepts using lexicons and
gazetteers.
known word recognition
Identifies numerical entities such as money,
percents, dates and times (FSA)
numerical-entity recognition
stand-alone named-entity recognizer
Identifies named entities such as person,
location, and organization names (FSA)
named-entity recognition
Disambiguates incomplete or ambiguous names
name aliasing
phrasal parser
Identifies basic, noun, verb, and particle
phrases (TBL FSA)
phrase combiner
Identifies domain-dependent complex noun and verb
phrases (FSA)
entity coreference resolution
Detects pronominal and nominal coreference links
domain pattern recognition
Identifies domain-dependent patterns (FSA)
event coreference
Resolves empty templette slots
event merging
Merges templettes belonging to the same event
Templettes/Templates
13
Walk-Through Example (1/5)
At least seven police officers were killed and as
many as 52 other people, including several
children, were injured Monday in a car bombing
that also wrecked a police station. Kirkuks
police said they had "good information" that
Ansar al-Islam was behind the blast.

ltBOMBINGgt
BOMB a car bombing
PERPETRATOR Ansar al-Islam
DEAD At least seven police officers
INJURED as many as 52 other people, including
several children
DAMAGE a police station
LOCATION Kirkuk
DATE Monday

14
Walk-Through Example (2/5)
15
Walk-Through Example (3/5)
Entity coreference resolution
they ? The police the blast ? a car bombing
16
Walk-Through Example (4/5)
At least seven police officers were
killed/PATTERN and as many as 52 other people,
including several children, were injured Monday
in a car bombing/PATTERN car bombing that also
wrecked a police station/PATTERN. Kirkuks police
said they had "good information" that Ansar
al-Islam was behind the blast/PATTERN.
17
Walk-Through Example (5/5)
18
Coreference for IE

Algorithm detailed in Recognizing Referential
Links An Information Extraction Perspective.
Megumi Kameyama. http//citeseer.nj.nec.com/kameya
ma97recognizing.html
3 step algorithm
Identify all anaphoric entities, e.g. pronouns,
nouns, ambiguous named-entities.
For each anaphoric entity identify all possible
candidates and sort them according to same
salience ordering, e.g. left-to-right traversal
in the same sentence, right-to-left traversal in
previous sentences.
Extract the first candidate that matches some
semantic constraints, e.g. number and gender
consistency. Merge the candidate with the
anaphoric entity.

19
The Role of Coreference in Named Entity
Recognition

Classifies unknown named-entities, that are
likely part of a name but can not be identified
as such due to insufficient local context.
Example Michigan National Corp./ORG said it
will eliminate some senior management jobs
Michigan National/? said the restructuring
Disambiguates named entities of ambiguous length
and/or ambiguous type.
Michigan changed from LOC to ORG when Michigan
Corp. appears in the same context.
The text McDonalds may contain a person name
McDonald or an organization name McDonalds.
Non-deterministic FSA used to maintain both
alternatives until after name aliasing, when one
is selected.
Disambiguate headline named entities.
Headlines typically capitalized, e.g. McDermott
Completes Sale
Processing of headlines postponed until after the
body of text is processed.
A longest-match approach is used to match the
headline sequence of tokens against entities
found in the first body paragraph. For example,
McDermott is labeled to ORG because it matches
over McDermott International Inc. in the first
document paragraph.
Over 5 increase in accuracy (F-measure) from
87.81 to 93.64.

20
The Role of Coreference in IE
21
The Good

Relatively good performance with a simple system
F-measures over 75 up to 88 for some simpler
Event99 domains
Execution times below 10 seconds per 5KB document
Improvements to the FSA-only approach
Coreference almost doubles the FSA-only
performance
More extraction rules add little to the IE
performance whereas different forms of
coreference add more
Non-determinism used to mitigate the limited
power of FSA grammars

22
The Bad

Needs domain-specific lexicons, e.g. an ontology
of bombing devices. Work the automate this
process Learning Dictionaries for Information
Extraction by Multi-Level Bootstrapping. Ellen
Riloff and Rosie Jones. http//www.cs.utah.edu/ri
loff/psfiles/aaai99.pdf (not covered in this
presentation)
Domain-specific patterns must be developed, e.g.
ltSUBJECTgt explode.
Patterns must be classified What does the above
pattern mean? Is the subject a bomb, a
perpetrator, a location?
Patterns can not cover the flexibility of the
natural language. Need better models that go
beyond the pattern limitations.
Event merging is another NP-complete problem. One
of the few stochastic models for event merging
Probabilistic Coreference in Information
Extraction. Andrew Kehler. http//ling.ucsd.edu/k
ehler/Papers/emnlp97.ps.gz (not covered in this
presentation)
All of the above issues are manually developed,
which yields high domain development time (larger
than 40 person hours per domain). This prohibits
the use of this approach for real-time
information extraction.

23
Overview

What is information extraction?
A traditional system and its problems
Pattern learning and classification
Beyond patterns

24
Automatically Generating Extraction Patterns from
Untagged Text

The first system to successfully discover domain
patterns ? AutoSlog-TS.
Automatically Generating Extraction Patterns from
Untagged Text. Ellen Riloff. http//www.cs.utah.ed
u/riloff/psfiles/aaai96.pdf
The intuition is that domain-specific patterns
will appear more often in documents related to
the domain of interest than in unrelated
documents.

25
Weakly-Supervised Pattern Learning Algorithm
(1/2)

Separate the training document set into relevant
and irrelevant documents (manual process).
Generate all possible patterns in all documents,
according to some meta-patterns. Examples below.

Meta Pattern Pattern
ltsubjgt active-verb ltperpetratorgt bombed
active-verb ltdobjgt bombed lttargetgt
infinitive ltdobjgt to kill ltvictimgt
gerund ltdobjgt killing ltvictimgt
ltnpgt prep ltnpgt ltbombgt against lttargetgt
26
Weakly-Supervised Pattern Learning Algorithm
(2/2)

Rank all generated patterns according to the
formula relevance_rate x log2(frequency), where
the relevance_rate indicates the ratio of
relevant instances (i.e. in relevant documents
versus non-relevant documents) of the
corresponding pattern, and frequency indicates
the number of times the pattern was seen in
relevant documents.
Add the top-ranked pattern to the list of learned
patterns, and mark all documents where the
pattern appears as relevant.
Repeat the process from Step 3 for a number of N
iterations. Hence the output of the algorithm is
N learned patterns.

27
Examples of Learned Patterns
Patterns learned for the MUC-4 terrorism domain
ltsubjgt exploded
murder of ltnpgt
assasination of ltnpgt
ltsubjgt was killed
ltsubjgt was kidnapped
attack on ltnpgt
ltsubjgt was injured
exploded in ltnpgt
death of ltnpgt
ltsubjgt took_place
28
The Good and the Bad

The good
Performance very close to the manually-customized
system
The bad
Documents must be separated into
relevant/irrelevant by hand
When does the learning process stop?
Pattern classification and event merging still
developed by human experts

29
The ExDisco IE System

Automatic Acquisition of Domain Knowledge for
Information Extraction. Roman Yangarber et al.
http//www.cs.nyu.edu/roman/Papers/2000-coling-pub
.ps.gz
Quasi automatically separates documents in
relevant/non-relevant using a set of seed
patterns selected by the user, e.g. ltcompanygt
appoint-verb ltpersongt for the MUC-6 management
succession domain.
In addition to ranking patterns, ExDisco ranks
documents based on how many relevant patterns
they contain ? immediate application to text
filtering.

30
Counter-Training for Pattern Discovery

Counter-Training in Discovery of Semantic
Patterns. Roman Yangarber. http//www.cs.nyu.edu/r
oman/Papers/2003-acl-countertrain-web.pdf
Previous approaches are iterative learning
algorithms, where the output is a continuous
stream of patterns with degrading precision. What
is the best stopping point?
The approach is to introduce competition among
multiple scenario learners (e.g. management
succession, mergers and acquisitions, legal
actions). Stop when the learners wander in the
territories already discovered by others.
Pattern frequency weighted by the document
relevance.
Document relevance receives negative weight based
on how many patterns from a different scenario it
contains.
The learning for each scenario stops when the
best pattern has a negative score.

31
Pattern Classification

Multiple systems perform successful pattern
acquisition by now, e.g. attacked ltnpgt is
discovered for the bombing domain. But what does
the ltnpgt actually mean? Is it the victim, the
physical target, or something else?
An Empirical Approach to Conceptual Case Frame
Acquisition. Ellen Riloff and Mark Schmelzenbach.
http//www.cs.utah.edu/riloff/psfiles/wvlc98.pdf

32
Pattern Classification Algorithm

Requires 5 seed words per semantic category (e.g.
PERPETRATOR, VICTIM etc)
Builds a context for each semantic category by
expanding the seed word set with words that
appear frequently in the proximity of previous
seed words.
Uses AutoSlog to discover domain patterns.
Builds a semantic profile for each discovered
pattern based on the overlap between the noun
phrases contained in the pattern and the previous
semantic contexts.
Each pattern is associated with the best ranked
semantic category.

33
Pattern Classification Example
Semantic Category Probability
BUILDING 0.10
CIVILIAN 0.03
DATE 0.05
GOVOFFICIAL 0.03
LOCATION 0.03
MILITARYPEOPLE 0.09
TERRORIST 0.00
VEHICLE 0.03
WEAPON 0.00
Semantic profile for the pattern attack on ltnpgt
34
Other Pattern-Learning Systems RAPIER (1/2)

Relational Learning of Pattern-Match Rules for
Information Extraction. Mary Elaine Califf and
Raymond J. Mooney. http//citeseer.nj.nec.com/cali
ff98relational.html
Uses Inductive Logic Programming (ILP) to
implement a bottom-up generalization of patterns.
Patterns specified with pre-fillers (conditions
on the tokens preceding the pattern), fillers
(conditions on the tokens included in the
pattern), and post-fillers (conditions on the
tokens following the pattern)
The only linguistic resource used is a
part-of-speech (POS) tagger. No parser (full or
partial) used!
More robust to unstructured text.
Applicability limited to simpler domains (e.g.
job postings)

35
Other Pattern-Learning Systems RAPIER (2/2)
located in Atlanta, Georgia
Pre-filler Filler Post-filler
word located, tag VBN word in, tag IN word Atlanta, tag NNP word , , tag , word Georgia, tag NNP
offices in Kansas City, Missouri
Pre-filler Filler Post-filler
word offices, tag NNS word in, tag IN word Kansas, tag NNP word City, tag NNP word , , tag , word Missouri, tag NNP
Pre-filler Filler Post-filler
word in, tag IN list len 2, tag NNP word , , tag , semantic STATE, tag NNP
36
Other Pattern-Learning Systems

SRV
Toward General-Purpose Learning for Information
Extraction. Dayne Freitag. http//citeseer.nj.nec.
com/freitag98toward.html
Supervised machine learning based on FOIL.
Constructs HORN clauses from examples.
Active learning
Active Learning for Information Extraction with
Multiple View Feature Sets. Rosie Jones et al.
http//www.cs.utah.edu/riloff/psfiles/ecml-wkshp0
3.pdf
Active learning with multiple views. Ion Muslea.
http//www.ai.sri.com/muslea/PS/dissertation-02.p
df
Interactively learn and annotate data to reduce
human effort in data annotation.

37
Overview

What is information extraction?
A traditional system and its problems
Pattern learning and classification
Beyond patterns

38
The Need to Move Beyond the Pattern-Based
Paradigm (1/2)
The space shuttle Challenger/AGENT_OF_DEATH flew
apart over Florida like a billion-dollar
confetti killing/MANNER_OF_DEATH six
astronauts/DECEASED.
Hard using surface-level information
Easier using full parse trees
AGENT_OF_DEATH
MANNER_OF_DEATH
DECEASED
39
The Need to Move Beyond the Pattern-Based
Paradigm (2/2)

Pattern-based systems
Have limited power due to the strict formalism ?
accuracy lt 60 without additional discourse
processing.
Were developed also due to the historical
conjecture there was no high-performance full
parser widely available.
Recent NLP developments
Full syntactic parsing ? 90 Collins,
1997Charniak, 2000.
Predicate-argument frames provide open-domain
event representation Surdeanu et al, 2003,
Gildea and Jurafsky, 2002Gildea and Palmer,
2002.

40
Goal

Novel IE paradigm
Syntactic representation provided by full parser.
Event representation based on predicate-argument
frames.
Entity coreference provides pronominal and
nominal anaphora resolution (future work).
Event merging merges similar/overlapping events
(future work).
Advantages
High accuracy due to enhanced syntactic and
semantic processing.
Minimal domain customization time because most
components are open-domain.

41
Proposition Bank Overview
S
VP
NP
VP
PP
NP
The futures halt
was
assailed
by
Big Board floor traders
ARG1 entity assailed
PRED
ARG0 agent

A one million word corpus annotated with
predicate argument structures Kingsbury, 2002.
Currently only predicates lexicalized by verbs.
Numbered arguments from 0 to 5. Typically ARG0
agent, ARG1 direct object or theme, ARG2
indirect object, benefactive, or instrument, but
they are predicate dependent!
Functional tags ARMG-LOC locative, ARGM-TMP
temporal, ARGM-DIR direction.

42
Block Architecture
identification of pred-arg structures
43
Walk-Through Example
The space shuttle Challenger flew apart over
Florida like a billion-dollar confetti killing
six astronauts.
44
The Model

Consists of two tasks (1) identifying parse tree
constituents corresponding to predicate
arguments, and (2) assigning a role to each
argument constituent.
Both tasks modeled using C5.0 decision tree
learning, and two sets of features Feature Set 1
adapted from Gildea and Jurafsky, 2002, and
Feature Set 2, novel set of semantic and
syntactic features.

45
Feature Set 1

POSITION (pos) indicates if constituent appears
before predicate in sentence. E.g. true for ARG1
and false for ARG2.
VOICE (voice) predicate voice (active or
passive). E.g. passive for PRED.
HEAD WORD (hw) head word of the evaluated
phrase. E.g. halt for ARG1.
GOVERNING CATEGORY (gov) indicates if an NP is
dominated by a S phrase or a VP phrase. E.g. S
for ARG1, VP for ARG0.
PREDICATE WORD the verb with morphological
information preserved (verb), and the verb
normalized to lower case and infinitive form
(lemma). E.g. for PRED verb is assailed, lemma
is assail.

PHRASE TYPE (pt) type of the syntactic phrase as
argument. E.g. NP for ARG1.
PARSE TREE PATH (path) path between argument
and predicate. E.g. NP ? S ? VP ? VP for ARG1.
PATH LENGTH (pathLen) number of labels stored in
the predicate-argument path. E.g. 4 for ARG1.

46
Observations about Feature Set 1

Because most of the argument constituents are
prepositional attachments (PP) and relative
clauses (SBAR), often the head word (hw) is not
the most informative word in the phrase.
Due to its strong lexicalization, the model
suffers from data sparsity. E.g. hw used lt 3.
The problem can be addressed with a back-off
model from words to part of speech tags.
The features in set 1 capture only syntactic
information, even though semantic information
like named-entity tags should help. For example,
ARGM-TMP typically contains DATE entities, and
ARGM-LOC includes LOCATION named entities.
Feature set 1 does not capture predicates
lexicalized by phrasal verbs, e.g. put up.

47
Feature Set 2 (1/2)

CONTENT WORD (cw) lexicalized feature that
selects an informative word from the constituent,
other than the head. Selection heuristics
available in the paper. E.g. June for the
phrase in last June.
PART OF SPEECH OF CONTENT WORD (cPos) part of
speech tag of the content word. E.g. NNP for the
phrase in last June.
PART OF SPEECH OF HEAD WORD (hPos) part of
speech tag of the head word. E.g. NN for the
phrase the futures halt.
NAMED ENTITY CLASS OF CONTENT WORD (cNE) The
class of the named entity that includes the
content word. 7 named entity classes (from the
MUC-7 specification) covered. E.g. DATE for in
last June.

48
Feature Set 2 (2/2)

BOOLEAN NAMED ENTITY FLAGS set of features that
indicate if a named entity is included at any
position in the phrase
neOrganization set to true if an organization
name is recognized in the phrase.
neLocation set to true if a location name is
recognized in the phrase.
nePerson set to true if a person name is
recognized in the phrase.
neMoney set to true if a currency expression is
recognized in the phrase.
nePercent set to true if a percentage expression
is recognized in the phrase.
neTime set to true if a time of day expression
is recognized in the phrase.
neDate set to true if a date temporal expression
is recognized in the phrase.
PHRASAL VERB COLLOCATIONS set of two features
that capture information about phrasal verbs
pvcSum the frequency with which a verb is
immediately followed by any preposition or
particle.
pvcMax the frequency with which a verb is
followed by its predominant preposition or
particle.

49
Experiments (1/3)

Trained on PropBank release 2002/7/15, Treebank
release 2, both without Section 23. Named entity
information extracted using CiceroLite.
Tested on PropBank and Treebank section 23. Used
gold-standard trees from Treebank, and named
entities from CiceroLite.
Task 1 (identifying argument constituents)
Negative examples any Treebank phrases not
tagged in PropBank. Due to memory limitations, we
used 11 of Treebank.
Positive examples Treebank phrases (from the
same 11 set) annotated with any PropBank role.
Task 2 (assigning roles to argument
constituents)
Due to memory limitations we limited the example
set to the first 60 of PropBank annotations.

50
Experiments (2/3)
Features Arg P Arg R Arg F1 Role A
FS1 84.96 84.26 84.61 78.76
FS1 POS tag of head word 92.24 84.50 88.20 79.04
FS1 content word and POS tag 92.19 84.67 88.27 80.80
FS1 NE label of content word 83.93 85.69 84.80 79.85
FS1 phrase NE flags 87.78 85.71 86.73 81.28
FS1 phrasal verb information 84.88 82.77 83.81 78.62
FS1 FS2 91.62 85.06 88.22 83.05
FS1 FS2 boosting 93.00 85.29 88.98 83.74
51
Experiments (3/3)

Four models compared
Gildea and Palmer, 2002
Gildea and Palmer, 2002, our implementation
Our model with FS1
Our model with FS1 FS2 boosting

Model Implementation Arg F1 Role A
Statistical Gildea and Palmer - 82.8
Statistical This study 71.86 78.87
Decision Trees FS1 84.61 78.76
Decision Trees FS1 FS2 boosting 88.98 83.74
52
Mapping Predicate-Argument Structures to
Templettes

The mapping rules from predicate-arguments
structures to templette slots are currently
manually produced, using training texts and the
corresponding templettes. Effort per domain lt 3
person hours, if training information is
available.
We focused on two Event99 domains
Market change tracks changes of financial
instruments. Relevant slots INSTRUMENT
description of the financial instrument
AMOUNT_CHANGE change amount and CURRENT_VALUE
current instrument value after change.
Death extracts person death events. Relevant
slots DECEASED person deceased
MANNER_OF_DEATH manner of death and
AGENT_OF_DEATH entity that caused the death
event.

53
Mappings for Event99 Death and Market Change
Domains
54
Experimental Setup

Three systems compared
This model with predicate-argument structures
detected using the statistical approach.
This model with predicate-argument structures
detected using decision trees.
Cascaded Finite-State-Automata system (Cicero).
In all systems entity coreference and event
fusion disabled.

55
Experiments
System Market Change Death
Pred/Args Statistical 68.9 58.4
Pred/Args Inductive 82.8 67.0
FSA 91.3 72.7
System Correct Missed Incorrect
Pred/Args Statistical 26 16 3
Pred/Args Inductive 33 9 2
FSA 38 4 2
56
The good and the bad

The good
The method achieves over 88 F-measure for the
task of identifying argument constituents, and
over 83 accuracy for role labeling.
The model scales well to unknown predicates
because predicate lexical information is used for
less than 5 of the branching decisions.
Domain customization of the complete IE system is
less than 3 person hours per domain because most
of the components are open-domain.
Domain-specific components can be modeled with
machine learning (future work).
Performance degradation versus a fully-customized
IE system is only 10. Will be further decreased
by including coreference resolution (open-domain)
and event fusion (domain-specific).
The bad
Currently PropBank provides annotations only for
verb-based predicates. Noun-noun relations cannot
be modeled for now.
Can not be applied to unstructured text, where
full parsing does not work.
Slower than the cascaded FSA models.

57
Other Pattern-Free Systems

Algorithms That Learn To Extract Information.
BBN Description Of The Sift System As Used For
MUC-7. Scott Miller et al. http//citeseer.nj.nec.
com/miller98algorithms.html
Probabilistic model with features extracted from
full parse trees enhanced with NEs
Kernel Methods for Relation Extraction. Dmitry
Zelenko and Chinatsu Aone. http//citeseer.nj.nec.
com/zelenko02kernel.html
Tree-based SVM kernels used to discover EELD
relations.
Automatic Pattern Acquisition for Japanese
Information Extraction. Kiyoshi Sudo et al.
http//citeseer.nj.nec.com/sudo01automatic.html
Learns parse trees that subsume the information
of interest.

58
End
Gràcies!

Write a Comment

User Comments (0)

About PowerShow.com

Information Extraction A Practical Survey - PowerPoint PPT Presentation

Information Extraction A Practical Survey

Information Extraction A Practical Survey Mihai Surdeanu TALP Research Center Dep. Llenguatges i Sistemes Inform tics Universitat Polit cnica de Catalunya – PowerPoint PPT presentation