Introduction to Information Extraction - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Introduction to Information Extraction

Description:

Information Extraction (IE) is to identify relevant information from documents, ... Alas, poor Yorick, I knew him well. Tie 'Yorick' with 'him' ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 32
Provided by: jahui
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Information Extraction


1
Introduction to Information Extraction
  • Chia-Hui Chang
  • Dept. of Computer Science and Information
    Engineering, National Central University, Taiwan
  • chia_at_csie.ncu.edu.tw

2
Problem Definition
  • Information Extraction (IE) is to identify
    relevant information from documents, pulling
    information from a variety of sources and
    aggregates it into a homogeneous form.
  • The output template of the IE task
  • Several fields (slots)
  • Several instances of a field

3
Difficulties of IE tasks depends on
  • Text type
  • From Wall Street Journal articles, or email
    message, to HTML documents.
  • Domain
  • From financial news, or tourist information, to
    various language.
  • Scenario

4
Various IE Tasks
  • Free-text IE
  • For MUC (Message Understanding Conference)
  • E.g. terrorist activities, corporate joint
    ventures
  • Semi-structured IE
  • E.g. meta-search engines, shopping agents,
    Bio-integration system

5
Types of IE from MUC
  • Named Entity recognition (NE)
  • Finds and classifies names, places, etc.
  • Coreference Resolution (CO)
  • Identifies identity relations between entities in
    texts.
  • Template Element construction (TE)
  • Adds descriptive information to NE results.
  • Scenario Template production (ST)
  • Fits TE results into specified event scenarios.

6
Name Entity Recognition
  • http//www.cs.nyu.edu/cs/faculty/grishman/NEtask20
    .book_3.html

7
NE Recognition (Cont.)
  • Spanish 93
  • Japanese 92
  • Chinese 84.51

8
Coreference Resolution
  • Coreference resolution (CO) involves identifying
    identity relations between entities in texts.
  • For example, in
  • Alas, poor Yorick, I knew him well.
  • Tie Yorick" with him.
  • The Sheffield system scored 51 recall and 71
    precision.
  • http//www.cs.nyu.edu/cs/faculty/grishman/COtask21
    .book_4.html

9
Template Element Production
  • Adds description with named entities
  • Sheffield system scores 71

10
Scenario Template Extraction
  • STs are the prototypical outputs of IE systems
  • They tie together TE entities into event and
    relation descriptions.
  • Performance for Sheffield 49

http//www.cs.nyu.edu/cs/ faculty/grishman/
IEtask15.book_2.html
11
Example
  • The operational domains that user interests are
    centered around are drug enforcement, money
    laundering, organized crime, terrorism, .
  • 1. Input texts dealing with drug enforcement,
    money laundering, organized crime, terrorism, and
    legislation
  • 2. NE recognizes entities in those texts and
    assigns them to one of a number of categories
    drawn from the set of entities of interest
    (person, company, . . . )
  • 3. TE associates certain types of descriptive
    information with these entities, e.g. the
    location of companies
  • 4. ST identifies a set (relatively small to
    begin with) of events of interest by tying
    entities together into event relations.

12
Example Text
13
Output Example (NE, TE)
14
Output (STs)
15
Another IE Example
  • Corporate Management Changes
  • Purpose
  • which positions in which organizations are
    changing hands?
  • who is leaving a position and where the person is
    going to?
  • who is appointed to a position and where the
    person is coming from?
  • the locations and types of the organizations
    involved in the succession events
  • the names and titles of the persons involved in
    the succession events
  • http//www.cs.umanitoba.ca/lindek/ie-ex.htm

16
Input Text
  • President Clinton nominated John Rollwagen, the
    chairman and CEO of Cray Research Inc., as the
    No. 2 Commerce Department official. Mr. Rollwagen
    said he wants to push the Clinton administration
    to aggressively confront U.S. trading partners
    such as Japan to open their markets, particularly
    for high-tech industries. In a letter sent
    throughout the Eagan, Minn.-based company on
    Friday, Mr. Rollwagen warned "Whether we like it
    or not, our country is in an economic war and we
    are at a key turning point in that war." ......
  • Cray said it has appointed John F. Carlson, its
    president and chief operating officer, to succeed
    him. ......

17
Extraction Result
18
MUC
  • Data Set for
  • MET2 http//www.itl.nist.gov/iaui/894.02/related_p
    rojects/muc/met2/met2package.tar.gz
  • MUC34 http//www.itl.nist.gov/iaui/894.02/related
    _projects/muc/muc_data/muc34.tar.gz
  • MUC67 from LDC http//www.ldc.upenn.edu/
  • MUC-6 http//www.cs.nyu.edu/cs/faculty/grishman/m
    uc6.html
  • MUC-7
  • http//www.itl.nist.gov/iaui/894.02/related_pr
    ojects/muc/ proceedings/muc_7_toc.html

19
Summary
  • Evaluation
  • Precision
  • Recall
  • Design Methodology
  • Natural Language Processing
  • Machine Learning

of correctly extracted fields of extracted
fields
of correctly extracted fields of fields to be
extracted
20
IE from Semi-structured Documents
  • Output Template k-tuple
  • Multiple instances of a field
  • Missing data

21
Various IE Tasks for Semi-structured Documents
  • Multiple-record page extraction
  • One-record (singular) page extraction

22
Multiple-record page extraction
23
One-record (singular) page extraction
24
Summary
  • Evaluation
  • Precision
  • Recall
  • Design Methodology
  • Machine Learning
  • Pattern Mining

of correctly extracted records of extracted
records
of correctly extracted records of records to
be extracted
25
News Group IE
  • Example Computer-Related Jobs

26
Output Template
  • Between free-text IE and semi-structured IE
  • CaliffRapier 99

27
Annotated Training Examples
  • Most systems require annotated training examples
    (answer keys)
  • AutoSlog, Rapier, SRV, WIEN, Softmealy, Stalker
  • Very few systems require unannotated training
    examples
  • AutoSlog-TS, IEPAD, OLERA

28
The Type of Extraction Rule
  • Delimiter-based Rule
  • WIEN, Stalker
  • Content-based Rule
  • Context-based Rule
  • Rapier, AutoSlog, SRV, IEPAD

29
Background Knowledge
  • For Rule Generalization
  • Implicit or Explicit
  • Example
  • Specified format for date, email, etc.
  • Special feature for color, location, etc.

30
Conclusion
  • Define the IE problem
  • Specify the input training example
  • with annotation, or
  • without annotation
  • Depict the extraction rule
  • Use necessary background knowledge

31
References
  • H. Cunningham, Information Extraction a User
    Guide, http//www.dcs.shef.ac.uk
  • MUC-6, http//www.cs.nyu.edu/cs/faculty/
    grishman/muc6.html
  • I. Muslea, Extraction Patterns for Information
    Extraction Tasks A Survey, The AAAI-99 Workshop
    on Machine Learning for Information Extraction.
  • Califf, Relational Learning of Pattern-Matching
    Rule for Information Extraction, AAAI-99.
Write a Comment
User Comments (0)
About PowerShow.com