Introduction to Information Extraction - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Introduction to Information Extraction

Description:

Japanese: 92% Chinese: 84.51% 8. Coreference Resolution ... Translation and Wrapping. Semantic Integration. Mediation. Abstracted. Information. Text, ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 36
Provided by: jahui
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Information Extraction


1
Introduction to Information Extraction
  • Chia-Hui Chang
  • Dept. of Computer Science and Information
    Engineering, National Central University, Taiwan
  • chia_at_csie.ncu.edu.tw

2
Problem Definition
  • Information Extraction (IE) is to identify
    relevant information from documents, pulling
    information from a variety of sources and
    aggregates it into a homogeneous form.
  • Input ? extractor ?structured output
  • The output template of the IE task
  • Several fields (slots)
  • Several instances of a field

3
Difficulties of IE tasks depends on
  • Text type
  • From plain text to semi-structured Web pages
  • e.g. Wall Street Journal articles, or email
    message, HTML documents.
  • Domain
  • From financial news, or tourist information, to
    various language.
  • Scenario

4
Various IE Tasks
  • Free-text IE
  • For MUC (Message Understanding Conference)
  • E.g. terrorist activities, corporate joint
    ventures
  • Semi-structured IE
  • E.g. meta-search engines, shopping agents,
    Bio-integration system

5
Types of IE from MUC
  • Named Entity recognition (NE)
  • Finds and classifies names, places, etc.
  • Coreference Resolution (CO)
  • Identifies identity relations between entities in
    texts.
  • Template Element construction (TE)
  • Adds descriptive information to NE results.
  • Scenario Template production (ST)
  • Fits TE results into specified event scenarios.

6
Named Entity Recognition
  • http//www.cs.nyu.edu/cs/faculty/grishman/NEtask20
    .book_3.html

7
NE Recognition (Cont.)
  • Spanish 93
  • Japanese 92
  • Chinese 84.51

8
Coreference Resolution
  • Coreference resolution (CO) involves identifying
    identity relations between entities in texts.
  • For example, in
  • Alas, poor Yorick, I knew him well.
  • Tie Yorick" with him.
  • The Sheffield system scored 51 recall and 71
    precision.

http//www.cs.nyu.edu/cs/faculty/grishman/COtask21
.book_4.html
9
Template Element Production
  • Adds description with named entities
  • Sheffield system scores 71

10
Scenario Template Extraction
  • STs are the prototypical outputs of IE systems
  • They tie together TE entities into event and
    relation descriptions.
  • Performance for Sheffield 49

http//www.cs.nyu.edu/cs/ faculty/grishman/
IEtask15.book_2.html
11
Example
  • The operational domains that user interests are
    centered around are drug enforcement, money
    laundering, organized crime, terrorism, .
  • 1. Input texts dealing with drug enforcement,
    money laundering, organized crime, terrorism, and
    legislation
  • 2. NE recognizes entities in those texts and
    assigns them to one of a number of categories
    drawn from the set of entities of interest
    (person, company, . . . )
  • 3. TE associates certain types of descriptive
    information with these entities, e.g. the
    location of companies
  • 4. ST identifies a set (relatively small to
    begin with) of events of interest by tying
    entities together into event relations.

12
Example Text
13
Output Example (NE, TE)
14
Output (STs)
15
Another IE Example
  • Corporate Management Changes
  • Purpose
  • which positions in which organizations are
    changing hands?
  • who is leaving a position and where the person is
    going to?
  • who is appointed to a position and where the
    person is coming from?
  • the locations and types of the organizations
    involved in the succession events
  • the names and titles of the persons involved in
    the succession events
  • http//www.cs.umanitoba.ca/lindek/ie-ex.htm

16
Input Text
  • President Clinton nominated John Rollwagen, the
    chairman and CEO of Cray Research Inc., as the
    No. 2 Commerce Department official. Mr. Rollwagen
    said he wants to push the Clinton administration
    to aggressively confront U.S. trading partners
    such as Japan to open their markets, particularly
    for high-tech industries. In a letter sent
    throughout the Eagan, Minn.-based company on
    Friday, Mr. Rollwagen warned "Whether we like it
    or not, our country is in an economic war and we
    are at a key turning point in that war." ......
  • Cray said it has appointed John F. Carlson, its
    president and chief operating officer, to succeed
    him. ......

17
Extraction Result
18
MUC
  • Data Set for
  • MET2 http//www.itl.nist.gov/iaui/894.02/related_p
    rojects/muc/met2/met2package.tar.gz
  • MUC34 http//www.itl.nist.gov/iaui/894.02/related
    _projects/muc/muc_data/muc34.tar.gz
  • MUC67 from LDC http//www.ldc.upenn.edu/
  • MUC-6 http//www.cs.nyu.edu/cs/faculty/grishman/m
    uc6.html
  • MUC-7
  • http//www.itl.nist.gov/iaui/894.02/related_pr
    ojects/muc/ proceedings/muc_7_toc.html

19
Summary
  • Evaluation
  • Precision
  • Recall
  • Design Methodology for Text IE
  • Natural Language Processing
  • Machine Learning

of correctly extracted fields of extracted
fields
of correctly extracted fields of fields to be
extracted
20
IE from Web pages
  • Output Template k-tuple
  • Multiple instances of a field
  • Missing data

21
Web data extraction
  • Various Web pages
  • Multiple-record page extraction
  • One-record (singular) page extraction

22
Multiple-record page extraction
23
One-record (singular) page extraction
24
Applications
  • Information integration
  • Meta Search Engines
  • Shopping agents
  • Travel agents

25
Information Integration Systems
Abstracted Information
Agent/Module Coordination
Mediation
Semantic Integration
Translation and Wrapping
Unprocessed, Unintegrated Details
26
Web Wrappers
  • What is a wrapper?
  • An extracting program to extract desired
    information from Web pages.
  • Web pages ? wrapper? Structure Info.
  • Web wrappers wrap...
  • Query-able or Search-able Web sites
  • Web pages with large itemized lists

27
Summary
  • Evaluation
  • Precision
  • Recall
  • Methodology for Web IE
  • Programming package
  • Machine Learning
  • Pattern Mining

of correctly extracted records of extracted
records
of correctly extracted records of records to
be extracted
28
Type III News Group IE
  • Example Computer-Related Jobs

29
Output Template
  • Between free-text IE and semi-structured IE
  • CaliffRapier 99

30
Wrapper Induction Systems
  • Wrapper induction (WI) or information extraction
    (IE) systems are software that are designed to
    generate wrappers.
  • Taxonomy of Web IE systems by
  • Task domain
  • free text vs semi-structured pages
  • Automation degree
  • supervised vs unsupervised
  • Techniques applied
  • Machine learning vs pattern mining

31
Task Domain
  • Document type
  • Extraction level
  • Field-level, record-level, page-level
  • Extraction target variation
  • Missing Attributes
  • Multi-valued Attributes
  • Multi-order attribute Permutations
  • Nested Data Objects
  • Template variation
  • Various Templates for an attribute
  • Common Templates for various attributes
  • Untokenized Attributes

32
Automation Degree
  • Page-fetching Support
  • Annotation Requirement
  • Output Support
  • API Support

33
Techniques Applied
  • Scan passes
  • Extraction rule types
  • Learning algorithms
  • Tokenization schemes
  • Feature used

34
Conclusion
  • Define the IE problem
  • Specify the input training example
  • with annotation, or
  • without annotation
  • Depict the extraction rule
  • Use necessary background knowledge

35
References
  • H. Cunningham, Information Extraction a User
    Guide, http//www.dcs.shef.ac.uk
  • MUC-6, http//www.cs.nyu.edu/cs/faculty/
    grishman/muc6.html
  • I. Muslea, Extraction Patterns for Information
    Extraction Tasks A Survey, The AAAI-99 Workshop
    on Machine Learning for Information Extraction.
  • Califf, Relational Learning of Pattern-Matching
    Rule for Information Extraction, AAAI-99.
Write a Comment
User Comments (0)
About PowerShow.com