Title: Introduction to Information Extraction
1Introduction to Information Extraction
- Chia-Hui Chang
- Dept. of Computer Science and Information
Engineering, National Central University, Taiwan - chia_at_csie.ncu.edu.tw
2Problem Definition
- Information Extraction (IE) is to identify
relevant information from documents, pulling
information from a variety of sources and
aggregates it into a homogeneous form. - The output template of the IE task
- Several fields (slots)
- Several instances of a field
3Difficulties of IE tasks depends on
- Text type
- From Wall Street Journal articles, or email
message, to HTML documents. - Domain
- From financial news, or tourist information, to
various language. - Scenario
4Various IE Tasks
- Free-text IE
- For MUC (Message Understanding Conference)
- E.g. terrorist activities, corporate joint
ventures - Semi-structured IE
- E.g. meta-search engines, shopping agents,
Bio-integration system
5Types of IE from MUC
- Named Entity recognition (NE)
- Finds and classifies names, places, etc.
- Coreference Resolution (CO)
- Identifies identity relations between entities in
texts. - Template Element construction (TE)
- Adds descriptive information to NE results.
- Scenario Template production (ST)
- Fits TE results into specified event scenarios.
6Name Entity Recognition
- http//www.cs.nyu.edu/cs/faculty/grishman/NEtask20
.book_3.html
7NE Recognition (Cont.)
- Spanish 93
- Japanese 92
- Chinese 84.51
8Coreference Resolution
- Coreference resolution (CO) involves identifying
identity relations between entities in texts. - For example, in
- Alas, poor Yorick, I knew him well.
- Tie Yorick" with him.
- The Sheffield system scored 51 recall and 71
precision. - http//www.cs.nyu.edu/cs/faculty/grishman/COtask21
.book_4.html
9Template Element Production
- Adds description with named entities
- Sheffield system scores 71
10Scenario Template Extraction
- STs are the prototypical outputs of IE systems
- They tie together TE entities into event and
relation descriptions. - Performance for Sheffield 49
http//www.cs.nyu.edu/cs/ faculty/grishman/
IEtask15.book_2.html
11Example
- The operational domains that user interests are
centered around are drug enforcement, money
laundering, organized crime, terrorism, . - 1. Input texts dealing with drug enforcement,
money laundering, organized crime, terrorism, and
legislation - 2. NE recognizes entities in those texts and
assigns them to one of a number of categories
drawn from the set of entities of interest
(person, company, . . . ) - 3. TE associates certain types of descriptive
information with these entities, e.g. the
location of companies - 4. ST identifies a set (relatively small to
begin with) of events of interest by tying
entities together into event relations.
12Example Text
13Output Example (NE, TE)
14Output (STs)
15Another IE Example
- Corporate Management Changes
- Purpose
- which positions in which organizations are
changing hands? - who is leaving a position and where the person is
going to? - who is appointed to a position and where the
person is coming from? - the locations and types of the organizations
involved in the succession events - the names and titles of the persons involved in
the succession events - http//www.cs.umanitoba.ca/lindek/ie-ex.htm
16Input Text
- President Clinton nominated John Rollwagen, the
chairman and CEO of Cray Research Inc., as the
No. 2 Commerce Department official. Mr. Rollwagen
said he wants to push the Clinton administration
to aggressively confront U.S. trading partners
such as Japan to open their markets, particularly
for high-tech industries. In a letter sent
throughout the Eagan, Minn.-based company on
Friday, Mr. Rollwagen warned "Whether we like it
or not, our country is in an economic war and we
are at a key turning point in that war." ...... - Cray said it has appointed John F. Carlson, its
president and chief operating officer, to succeed
him. ......
17Extraction Result
18MUC
- Data Set for
- MET2 http//www.itl.nist.gov/iaui/894.02/related_p
rojects/muc/met2/met2package.tar.gz - MUC34 http//www.itl.nist.gov/iaui/894.02/related
_projects/muc/muc_data/muc34.tar.gz - MUC67 from LDC http//www.ldc.upenn.edu/
- MUC-6 http//www.cs.nyu.edu/cs/faculty/grishman/m
uc6.html - MUC-7
- http//www.itl.nist.gov/iaui/894.02/related_pr
ojects/muc/ proceedings/muc_7_toc.html
19Summary
- Evaluation
- Precision
- Recall
- Design Methodology
- Natural Language Processing
- Machine Learning
of correctly extracted fields of extracted
fields
of correctly extracted fields of fields to be
extracted
20IE from Semi-structured Documents
- Output Template k-tuple
- Multiple instances of a field
- Missing data
21Various IE Tasks for Semi-structured Documents
- Multiple-record page extraction
- One-record (singular) page extraction
22Multiple-record page extraction
23One-record (singular) page extraction
24Summary
- Evaluation
- Precision
- Recall
- Design Methodology
- Machine Learning
- Pattern Mining
of correctly extracted records of extracted
records
of correctly extracted records of records to
be extracted
25News Group IE
- Example Computer-Related Jobs
26Output Template
- Between free-text IE and semi-structured IE
- CaliffRapier 99
27Annotated Training Examples
- Most systems require annotated training examples
(answer keys) - AutoSlog, Rapier, SRV, WIEN, Softmealy, Stalker
- Very few systems require unannotated training
examples - AutoSlog-TS, IEPAD, OLERA
28The Type of Extraction Rule
- Delimiter-based Rule
- WIEN, Stalker
- Content-based Rule
- Context-based Rule
- Rapier, AutoSlog, SRV, IEPAD
29Background Knowledge
- For Rule Generalization
- Implicit or Explicit
- Example
- Specified format for date, email, etc.
- Special feature for color, location, etc.
30Conclusion
- Define the IE problem
- Specify the input training example
- with annotation, or
- without annotation
- Depict the extraction rule
- Use necessary background knowledge
31References
- H. Cunningham, Information Extraction a User
Guide, http//www.dcs.shef.ac.uk - MUC-6, http//www.cs.nyu.edu/cs/faculty/
grishman/muc6.html - I. Muslea, Extraction Patterns for Information
Extraction Tasks A Survey, The AAAI-99 Workshop
on Machine Learning for Information Extraction. - Califf, Relational Learning of Pattern-Matching
Rule for Information Extraction, AAAI-99.