Title: Information Extraction
1Information Extraction
Ronen Feldman Bar Ilan University Israel
- Junichi Tsujii
- Graduate School of Science
- University of Tokyo
- Japan
2Application Tasks of NLP
(1)Information Retrieval/Detection
To search and retrieve documents in response to
queries for information
(2)Passage Retrieval
To search and retrieve part of documents in
response to queries for information
(3)Information Extraction
To extract information that fits pre-defined
database schemas or templates, specifying the
output formats
(4) Question/Answering Tasks
To answer general questions by using texts as
knowledge base Fact retrieval, combination of IR
and IE
(5)Text Understanding
To understand texts as people do Artificial
Intelligence
3Ranges of Queries
(1)Information Retrieval/Detection
(2)Passage Retrieval
Pre-Defined Fixed aspects of information carried
in texts
(3)Information Extraction
(4) Question/Answering Tasks
(5)Text Understanding
4IE definitions
- Entity an object of interest such as a person or
organization - Attribute A property of an entity such as name,
alias, descriptor or type - Fact A relationship held between two or more
entities such as Position of Person in Company - Event An activity involving several entities
such as terrorist act, airline crash, product
information
5IE accuracy typical figures by information type
- Entity 90-98
- Attribute 80
- Fact 60-70
- Event 50-60
6MUC conferences
- MUC 1 to MUC 7
- 1987 to 1997
- Topics
- Naval operations (2)
- Terrorist Activity (2)
- Joint venture and microelectronics
- Management changes
- Space Vehicles and Missile launches
7The ACE Evaluation
- The ACE program challenge of extracting content
from human language. Research effort directed to
master - first the extraction of entities
- Then the extraction of relations among these
entities - Finally the extraction of events that are
causally related sets of relations - After two years, top systems successfully capture
well over 50 of the value at the entity level
8Applications of IE
- Routing of information
- Infrastructure for IR and categorization (higher
level features) - Event based summarization
- Automatic creation of databases and knowledge
bases
9Where would IE be useful?
- Semi-structured text
- Generic documents like news articles
- Most of the information in the doc is centred
around a set of easily identifiable entities
10Example of IE FASTUS(1993)
11Example of IE FASTUS(1993)
12Example of IE FASTUS(1993)
13Example of IE FASTUS(1993)
14Example of IE FASTUS(1993)
15FASTUS
Based on finite states automata (FSA)
1.Complex Words Recognition of multi-words and
proper names
set up new Twaiwan dallors
2.Basic Phrases Simple noun groups, verb groups
and particles
a Japanese trading house had set up
3.Complex phrases Complex noun groups and verb
groups
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
16Example of IE FASTUS(1993)
17Information Extraction
. Jurgen Pfrang, 51, reportedly stumbled upon
the robbers on the second floor of his Nanjing
home early on Sunday. The deputy general manager
of Yaxing Benz, a Sino-German joint venture that
makes buses and bus chassis in nearby
Yangzhou, was hacked to death with 45 cm
watermelon knives. .
Name of the Venture Yaxing Benz Products
buses and bus chassis Location
Yangzhou,China Companies involved
(1)Name X?
Country German
(2)Name Y?
Country China
18Information Extraction
A German vehicle-firm executive was stabbed to
death . . Jurgen Pfrang, 51, reportedly
stumbled upon the robbers on the second floor of
his Nanjing home early on Sunday. The deputy
general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis
in nearby Yangzhou, was hacked to death with 45
cm watermelon knives. .
Crime-Type Murder Type
Stabbing The killed Name Jurgen Pfrang
Age 51
Profession Deputy general
manager Location Nanjing, China
Different template for crimes
19Interpretation of Texts
(1)Information Retrieval/Detection
(2)Passage Retrieval
(3)Information Extraction
(4) Question/Answering Tasks
(5)Text Understanding
20IR System
Collection of Texts
21IR System
Collection of Texts
22Passage IR System
Collection of Texts
23Passage IR System
IE System
Collection of Texts
Texts
24IE System
Templates
Texts
25IE as compromise NLP
Interpretation
IE System
Templates
Texts
Predefined
26Performance Evaluation
(1)Information Retrieval/Detection
(2)Passage Retrieval
(3)Information Extraction
(4) Question/Answering Tasks
(5)Text Understanding
27Collection of Documents
28Collection of Documents
More complicated due to partially filled
templates
29Framework of IE
IE as compromise NLP
30Difficulties of NLP
General Framework of NLP
(1) Robustness Incomplete Knowledge
Morphological and Lexical Processing
Syntactic Analysis
Semantic Analysis
Incomplete Domain Knowledge Interpretation
Rules
Context processing Interpretation
31Difficulties of NLP
General Framework of NLP
(1) Robustness Incomplete Knowledge
Morphological and Lexical Processing
Syntactic Analysis
Semantic Analysis
Incomplete Domain Knowledge Interpretation
Rules
Context processing Interpretation
32Approaches for building IE systems
- Knowledge Engineering Approach
- Rules crafted by linguists in cooperation with
domain experts - Most of the work done by insoecting a set of
relevant documents
33Approaches for building IE systems
- Automatically trainable systems
- Techniques based on statistics and almost no
linguistic knowledge - Language independent
- Main input annotated corpus
- Small effort for creating rules, but crating
annotated corpus laborious
34Techniques in IE
(1) Domain Specific Partial Knowledge
Knowledge relevant to information to be extracted
(2) Ambiguities Ignoring irrelevant
ambiguities Simpler NLP techniques
(3) Robustness Coping with Incomplete
dictionaries (open
class words) Ignoring irrelevant parts of
sentences
(4) Adaptation Techniques Machine
Learning, Trainable systems
35General Framework of NLP
Open class words Named entity recognition
(ex) Locations Persons
Companies Organizations
Position names
Morphological and Lexical Processing
Syntactic Analysis
Semantic Anaysis
Domain specific rules ltWordgtltWordgt, Inc.
Mr. ltCpt-Lgt. ltWordgt Machine Learning
HMM, Decision Trees Rules Machine Learning
Context processing Interpretation
36FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words Recognition of multi-words and
proper names
Morphological and Lexical Processing
2.Basic Phrases Simple noun groups, verb groups
and particles
Syntactic Analysis
3.Complex phrases Complex noun groups and verb
groups
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
Semantic Anaysis
Context processing Interpretation
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
37FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words Recognition of multi-words and
proper names
Morphological and Lexical Processing
2.Basic Phrases Simple noun groups, verb groups
and particles
Syntactic Analysis
3.Complex phrases Complex noun groups and verb
groups
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
Semantic Anaysis
Context processing Interpretation
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
38FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words Recognition of multi-words and
proper names
Morphological and Lexical Processing
2.Basic Phrases Simple noun groups, verb groups
and particles
Syntactic Analysis
3.Complex phrases Complex noun groups and verb
groups
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
Semantic Analysis
Context processing Interpretation
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
39Chomsky Hierarchy Hierarchy of
Grammar of Automata Regular
Grammar Finite State
Automata Context Free Grammar
Push Down Automata Context Sensitive Grammar
Linear Bounded Automata Type 0
Grammar Turing
Machine
40Chomsky Hierarchy Hierarchy of
Grammar of Automata Regular
Grammar Finite State
Automata Context Free Grammar Push
Down Automata Context Sensitive Grammar
Linear Bounded Automata Type 0 Grammar
Turing Machine
411
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
421
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
431
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
441
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
451
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
461
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
471
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
481
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
491
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
501
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
51Pattern-maching PN s (ADJ) N P Art (ADJ) N
PN s/ Art(ADJ) N(P Art (ADJ) N)
1
s
PN
Art
2
0
ADJ
N
Art
s
3
Johns interesting book with a nice cover
P
4
PN
52FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words Recognition of multi-words and
proper names
Morphological and Lexical Processing
2.Basic Phrases Simple noun groups, verb groups
and particles
Syntactic Analysis
3.Complex phrases Complex noun groups and verb
groups
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
Semantic Analysis
Context processing Interpretation
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
53Example of IE FASTUS(1993)
1.Complex words
2.Basic Phrases Bridgestone Sports Co.
Company name said
Verb Group Friday
Noun Group it
Noun Group had set up
Verb Group a joint venture
Noun Group in
Preposition Taiwan
Location
54Example of IE FASTUS(1993)
1.Complex words
2.Basic Phrases Bridgestone Sports Co.
Company name said
Verb Group Friday
Noun Group it
Noun Group had set up
Verb Group a joint venture
Noun Group in
Preposition Taiwan
Location
a Japanese tea house a Japanese tea house a
Japanese tea house
55Example of IE FASTUS(1993)
1.Complex words
2.Basic Phrases Bridgestone Sports Co.
Company name said
Verb Group Friday
Noun Group it
Noun Group had set up
Verb Group a joint venture
Noun Group in
Preposition Taiwan
Location
56Example of IE FASTUS(1993)
3.Complex Phrases
2.Basic Phrases Bridgestone Sports Co.
Company name said
Verb Group Friday
Noun Group it
Noun Group had set up
Verb Group a joint venture
Noun Group in
Preposition Taiwan
Location
57Example of IE FASTUS(1993)
3.Complex Phrases
2.Basic Phrases Bridgestone Sports Co.
Company name said
Verb Group Friday
Noun Group it
Noun Group had set up
Verb Group a joint venture
Noun Group in
Preposition Taiwan
Location
Some syntactic structures like
58Example of IE FASTUS(1993)
3.Complex Phrases
2.Basic Phrases Bridgestone Sports Co.
Company name said
Verb Group Friday
Noun Group it
Noun Group had set up
Verb Group a joint venture
Noun Group in
Preposition Taiwan
Location
Syntactic structures relevant to information to
be extracted are dealt with.
59Syntactic variations
GM set up a joint venture with Toyota. GM
announced it was setting up a joint venture with
Toyota. GM signed an agreement setting up a joint
venture with Toyota. GM announced it was signing
an agreement to set up a joint venture with
Toyota.
60Syntactic variations
GM set up a joint venture with Toyota. GM
announced it was setting up a joint venture with
Toyota. GM signed an agreement setting up a joint
venture with Toyota. GM announced it was signing
an agreement to set up a joint venture with
Toyota.
GM plans to set up a joint venture with
Toyota. GM expects to set up a joint venture with
Toyota.
61Syntactic variations
GM set up a joint venture with Toyota. GM
announced it was setting up a joint venture with
Toyota. GM signed an agreement setting up a joint
venture with Toyota. GM announced it was signing
an agreement to set up a joint venture with
Toyota.
S
NP
VP
GM
V
set up
GM plans to set up a joint venture with
Toyota. GM expects to set up a joint venture with
Toyota.
62Example of IE FASTUS(1993)
3.Complex Phrases 4.Domain Events COMPANYSET-U
PJOINT-VENTUREwithCOMPNY COMPANYSET-UPJO
INT-VENTURE (others) withCOMPNY
63Complications caused by syntactic variations
Relative clause The mayor, who was kidnapped
yesterday, was found dead today.
NG Relpro NG/others VG NG/othersVG N
G Relpro NG/others VG
64Complications caused by syntactic variations
Relative clause The mayor, who was kidnapped
yesterday, was found dead today.
NG Relpro NG/others VG NG/othersVG N
G Relpro NG/others VG
65Complications caused by syntactic variations
Relative clause The mayor, who was kidnapped
yesterday, was found dead today.
NG Relpro NG/others VG NG/othersVG N
G Relpro NG/others VG
66FASTUS
Based on finite states automata (FSA)
NP, who was kidnapped, was found.
1.Complex Words
2.Basic Phrases
3.Complex phrases
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
67FASTUS
Based on finite states automata (FSA)
NP, who was kidnapped, was found.
1.Complex Words
2.Basic Phrases
3.Complex phrases
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
Piece-wise recognition of basic templates
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
Reconstructing information carried via syntactic
structures by merging basic templates
68FASTUS
Based on finite states automata (FSA)
NP, who was kidnapped, was found.
1.Complex Words
2.Basic Phrases
3.Complex phrases
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
Piece-wise recognition of basic templates
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
Reconstructing information carried via syntactic
structures by merging basic templates
69FASTUS
Based on finite states automata (FSA)
NP, who was kidnapped, was found.
1.Complex Words
2.Basic Phrases
3.Complex phrases
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
Piece-wise recognition of basic templates
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
Reconstructing information carried via syntactic
structures by merging basic templates
70Current state of the arts of IE
- Carefully constructed IE systems
- F-60 level (interannotater agreement
60-80) - Domain telegraphic messages about naval
operation - (MUC-187, MUC-289)
- news articles and
transcriptions of radio broadcasts - Latin American terrorism
(MUC-391, MUC-41992) - News articles about joint
ventures (MUC-5, 93) - News articles about
management changes (MUC-6, 95) - News articles about space
vehicle (MUC-7, 97) - Handcrafted rules (named entity recognition,
domain events, etc)
Automatic learning from texts Supervised
learning corpus preparation
Non-supervised, or controlled learning