CS 388: Natural Language Processing: Information Extraction - PowerPoint PPT Presentation

About This Presentation
Title:

CS 388: Natural Language Processing: Information Extraction

Description:

... (Perl) Perl Regex s Perl Regex Examples Simple Extraction Patterns Adding NLP Information to Patterns Pattern-Match Rule Learning RAPIER Pattern ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 53
Provided by: Raymond161
Category:

less

Transcript and Presenter's Notes

Title: CS 388: Natural Language Processing: Information Extraction


1
CS 388 Natural Language ProcessingInformation
Extraction
  • Raymond J. Mooney
  • University of Texas at Austin

1
1
2
Information Extraction (IE)
  • Identify specific pieces of information (data) in
    a unstructured or semi-structured textual
    document.
  • Transform unstructured information in a corpus of
    documents or web pages into a structured
    database.
  • Applied to different types of text
  • Newspaper articles
  • Web pages
  • Scientific articles
  • Newsgroup messages
  • Classified ads
  • Medical notes

3
Sample Job Posting
Subject US-TN-SOFTWARE PROGRAMMER Date 17 Nov
1996 173729 GMT Organization Reference.Com
Posting Service Message-ID lt56nigpmrs_at_bilbo.refe
rence.comgt SOFTWARE PROGRAMMER Position
available for Software Programmer experienced in
generating software for PC-Based Voice Mail
systems. Experienced in C Programming. Must be
familiar with communicating with and controlling
voice cards preferable Dialogic, however,
experience with others such as Rhetorix and
Natural Microsystems is okay. Prefer 5 years or
more experience with PC Based Voice Mail, but
will consider as little as 2 years. Need to find
a Senior level person who can come on board and
pick up code with very little training. Present
Operating System is DOS. May go to OS-2 or UNIX
in future. Please reply to Kim
Anderson AdNET (901) 458-2888 fax kimander_at_memphis
online.com
Subject US-TN-SOFTWARE PROGRAMMER Date 17 Nov
1996 173729 GMT Organization Reference.Com
Posting Service Message-ID lt56nigpmrs_at_bilbo.refe
rence.comgt SOFTWARE PROGRAMMER Position
available for Software Programmer experienced in
generating software for PC-Based Voice Mail
systems. Experienced in C Programming. Must be
familiar with communicating with and controlling
voice cards preferable Dialogic, however,
experience with others such as Rhetorix and
Natural Microsystems is okay. Prefer 5 years or
more experience with PC Based Voice Mail, but
will consider as little as 2 years. Need to find
a Senior level person who can come on board and
pick up code with very little training. Present
Operating System is DOS. May go to OS-2 or UNIX
in future. Please reply to Kim
Anderson AdNET (901) 458-2888 fax kimander_at_memphis
online.com
4
Extracted Job Template
computer_science_job id 56nigpmrs_at_bilbo.referenc
e.com title SOFTWARE PROGRAMMER salary company
recruiter state TN city country US language
C platform PC \ DOS \ OS-2 \ UNIX application ar
ea Voice Mail req_years_experience
2 desired_years_experience 5 req_degree desired_
degree post_date 17 Nov 1996
5
Named Entity Recognition
  • Specific type of information extraction in which
    the goal is to extract formal names of particular
    types of entities such as people, places,
    organizations, etc.
  • Usually a preprocessing step for subsequent
    task-specific IE, or other tasks such as question
    answering.

6
Named Entity Recognition Example
U.S. Supreme Court quashes
'illegal' Guantanamo trials Military trials
arranged by the Bush administration for detainees
at Guantanamo Bay are illegal, the United States
Supreme Court ruled Thursday. The court found
that the trials known as military commissions
for people detained on suspicion of terrorist
activity abroad do not conform to any act of
Congress. The justices also rejected the
government's argument that the Geneva Conventions
regarding prisoners of war do not apply to those
held at Guantanamo Bay. Writing for the 5-3
majority, Justice Stephen Breyer said the White
House had overstepped its powers under the U.S.
Constitution. "Congress has not issued the
executive a blank cheque," Breyer
wrote. President George W. Bush said he takes the
ruling very seriously and would find a way to
both respect the court's findings and protect the
American people.
7
Named Entity Recognition Example
people places
organizations U.S. Supreme Court
quashes 'illegal' Guantanamo trials Military
trials arranged by the Bush administration for
detainees at Guantanamo Bay are illegal, the
United States Supreme Court ruled Thursday. The
court found that the trials known as military
commissions for people detained on suspicion of
terrorist activity abroad do not conform to any
act of Congress. The justices also rejected the
government's argument that the Geneva Conventions
regarding prisoners of war do not apply to those
held at Guantanamo Bay. Writing for the 5-3
majority, Justice Stephen Breyer said the White
House had overstepped its powers under the U.S.
Constitution. "Congress has not issued the
executive a blank cheque," Breyer
wrote. President George W. Bush said he takes the
ruling very seriously and would find a way to
both respect the court's findings and protect the
American people.
8
Relation Extraction
  • Once entities are recognized, identify specific
    relations between entities
  • Employed-by
  • Located-at
  • Part-of
  • Example
  • Michael Dell is the CEO of Dell Computer
    Corporation and lives in Austin Texas.

9
Early Information Extraction
  • FRUMP (Dejong, 1979) was an early information
    extraction system that processed news stories and
    identified various types of events (e.g.
    earthquakes, terrorist attacks, floods).
  • Used sketchy scripts of various events to
    identify specific pieces of information about
    such events.
  • Able to summarize articles in multiple languages.
  • Relied on brittle hand-built symbolic knowledge
    structures that were hard to build and not very
    robust.

10
MUC
  • DARPA funded significant efforts in IE in the
    early to mid 1990s.
  • Message Understanding Conference (MUC) was an
    annual event/competition where results were
    presented.
  • Focused on extracting information from news
    articles
  • Terrorist events
  • Industrial joint ventures
  • Company management changes
  • Information extraction of particular interest to
    the intelligence community (CIA, NSA).
  • Established standard evaluation methodolgy using
    development (training) and test data and metrics
    precision, recall, F-measure.

11
Other Applications
  • Job postings
  • Newsgroups Rapier from austin.jobs
  • Web pages Flipdog
  • Job resumes
  • BurningGlass
  • Mohomine
  • Seminar announcements
  • Company information from the web
  • Continuing education course info from the web
  • University information from the web
  • Apartment rental ads
  • Molecular biology information from MEDLINE

12
Medline Corpus
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene Moreover, cyclin D1
was found to be phosphorylated on tyrosine
residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro. In
synchronized human osteosarcoma cells, cyclin D1
is induced in early G1 and becomes associated
with p9Ckshs1, a Cdk-binding subunit. Immunoprecip
itation experiments with human osteosarcoma cells
and Ewings sarcoma cells demonstrated that
cyclin D1 is associated with both p34cdc2 and
p33cdk2, and that cyclin D1 immune complexes
exhibit appreciable histone H1 kinase activity
13
Medline Corpus Named Entity Recognition
(Proteins)
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene Moreover, cyclin D1
was found to be phosphorylated on tyrosine
residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro. In
synchronized human osteosarcoma cells, cyclin D1
is induced in early G1 and becomes associated
with p9Ckshs1, a Cdk-binding subunit. Immunoprecip
itation experiments with human osteosarcoma cells
and Ewings sarcoma cells demonstrated that
cyclin D1 is associated with both p34cdc2 and
p33cdk2, and that cyclin D1 immune complexes
exhibit appreciable histone H1 kinase activity
14
Medline Corpus Relation ExtractionProtein
Interactions
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene Moreover, cyclin D1
was found to be phosphorylated on tyrosine
residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro. In
synchronized human osteosarcoma cells, cyclin D1
is induced in early G1 and becomes associated
with p9Ckshs1, a Cdk-binding subunit. Immunoprecip
itation experiments with human osteosarcoma cells
and Ewings sarcoma cells demonstrated that
cyclin D1 is associated with both p34cdc2 and
p33cdk2, and that cyclin D1 immune complexes
exhibit appreciable histone H1 kinase activity
15
Web Extraction
  • Many web pages are generated automatically from
    an underlying database.
  • Therefore, the HTML structure of pages is fairly
    specific and regular (semi-structured).
  • However, output is intended for human
    consumption, not machine interpretation.
  • An IE system for such generated pages allows the
    web site to be viewed as a structured database.
  • An extractor for a semi-structured web site is
    sometimes referred to as a wrapper.
  • Process of extracting from such pages is
    sometimes referred to as screen scraping.

16
Amazon Book Description
. lt/tdgtlt/trgt lt/tablegt ltb class"sans"gtThe Age of
Spiritual Machines When Computers Exceed Human
Intelligencelt/bgtltbrgt ltfont faceverdana,arial,helv
etica size-1gt by lta href"/exec/obidos/search-han
dle-url/indexbooksfield-author
Kurzweil2C20Ray/002-6235079-4593641"gt Ray
Kurzweillt/agtltbrgt lt/fontgt ltbrgt lta
href"http//images.amazon.com/images/P/0140282025
.01.LZZZZZZZ.jpg"gt ltimg src"http//images.amazon.
com/images/P/0140282025.01.MZZZZZZZ.gif" width90
height140 alignleft border0gtlt/agt ltfont
faceverdana,arial,helvetica size-1gt ltspan
class"small"gt ltspan class"small"gt ltbgtList
Pricelt/bgt ltspan classlistpricegt14.95lt/spangtltbrgt
ltbgtOur Price ltfont color990000gt11.96lt/fontgtlt/
bgtltbrgt ltbgtYou Savelt/bgt ltfont color990000gtltbgt2.
99 lt/bgt (20)lt/fontgtltbrgt lt/spangt ltpgt ltbrgt
. lt/tdgtlt/trgt lt/tablegt ltb class"sans"gtThe Age of
Spiritual Machines When Computers Exceed Human
Intelligencelt/bgtltbrgt ltfont faceverdana,arial,helv
etica size-1gt by lta href"/exec/obidos/search-han
dle-url/indexbooksfield-author
Kurzweil2C20Ray/002-6235079-4593641"gt Ray
Kurzweillt/agtltbrgt lt/fontgt ltbrgt lta
href"http//images.amazon.com/images/P/0140282025
.01.LZZZZZZZ.jpg"gt ltimg src"http//images.amazon.
com/images/P/0140282025.01.MZZZZZZZ.gif" width90
height140 alignleft border0gtlt/agt ltfont
faceverdana,arial,helvetica size-1gt ltspan
class"small"gt ltspan class"small"gt ltbgtList
Pricelt/bgt ltspan classlistpricegt14.95lt/spangtltbrgt
ltbgtOur Price ltfont color990000gt11.96lt/fontgtlt/
bgtltbrgt ltbgtYou Savelt/bgt ltfont color990000gtltbgt2.
99 lt/bgt (20)lt/fontgtltbrgt lt/spangt ltpgt ltbrgt
17
Extracted Book Template
Title The Age of Spiritual Machines
When Computers Exceed Human Intelligence Author
Ray Kurzweil List-Price 14.95 Price 11.96
18
Template Types
  • Slots in template typically filled by a substring
    from the document.
  • Some slots may have a fixed set of pre-specified
    possible fillers that may not occur in the text
    itself.
  • Terrorist act threatened, attempted,
    accomplished.
  • Job type clerical, service, custodial, etc.
  • Company type SEC code
  • Some slots may allow multiple fillers.
  • Programming language
  • Some domains may allow multiple extracted
    templates per document.
  • Multiple apartment listings in one ad

19
IE as Sequence Labeling
  • Can treat IE as a sequence labeling problem.
  • Can apply a sliding window classifier using
    various classification algorithms.
  • Can apply probabilistic sequence models
  • HMM
  • CRF

20
Pattern-Matching Rule Extraction
  • Another approach to building IE systems is to use
    pattern-matching rules for each field to identify
    the strings to extract for that field.
  • When building web extraction systems (wrappers)
    manually, it is common to write regular
    expression patterns (in a language like Perl) to
    identify the desired regions of the text.
  • Works well when a fairly fixed local context is
    sufficient to identify extractions, as in
    extracting from web pages generated by a program
    or very stylized text like classified ads.

21
Regular Expressions
  • Language for composing complex patterns from
    simpler ones.
  • An individual character is a regex.
  • Union If e1 and e2 are regexes, then (e1 e2 )
    is a regex that matches whatever either e1 or e2
    matches.
  • Concatenation If e1 and e2 are regexes, then e1
    e2 is a regex that matches a string that consists
    of a substring that matches e1 immediately
    followed by a substring that matches e2
  • Repetition (Kleene closure) If e1 is a regex,
    then e1 is a regex that matches a sequence of
    zero or more strings that match e1

22
Regular Expression Examples
  • (ue)nabl(eing) matches
  • unable
  • unabling
  • enable
  • enabling
  • (unen)able matches
  • able
  • unable
  • unenable
  • enununenable

23
Enhanced Regexs (Perl)
  • Special terms for common sets of characters, such
    as alphabetic or numeric or general wildcard.
  • Special repetition operator () for 1 or more
    occurrences.
  • Special optional operator (?) for 0 or 1
    occurrences.
  • Special repetition operator for specific range of
    number of occurrences min,max.
  • A1,5 One to five As.
  • A5, Five or more As
  • A5 Exactly five As

24
Perl Regexs
  • Character classes
  • \w (word char) Any alpha-numeric (not \W)
  • \d (digit char) Any digit (not \D)
  • \s (space char) Any whitespace (not \S)
  • . (wildcard) Anything
  • Anchor points
  • \b (boundary) Word boundary
  • Beginning of string
  • End of string

25
Perl Regex Examples
  • U.S. phone number with optional area code
  • /\b(\(\d3\)\s?)?\d3-\d4\b/
  • Email address
  • /\b\S_at_\S(\.com\.edu\.gov\.org\.net)\b/

26
Simple Extraction Patterns
  • Specify an item to extract for a slot using a
    regular expression pattern.
  • Price pattern \b\\d(\.\d2)?\b
  • May require preceding (pre-filler) pattern to
    identify proper context.
  • Amazon list price
  • Pre-filler pattern ltbgtList Pricelt/bgt ltspan
    classlistpricegt
  • Filler pattern \\d(\.\d2)?\b
  • May require succeeding (post-filler) pattern to
    identify the end of the filler.
  • Amazon list price
  • Pre-filler pattern ltbgtList Pricelt/bgt ltspan
    classlistpricegt
  • Filler pattern .
  • Post-filler pattern lt/spangt

27
Adding NLP Information to Patterns
  • If extracting from automatically generated web
    pages, simple regex patterns usually work.
  • If extracting from more natural, unstructured,
    human-written text, some NLP may help.
  • Part-of-speech (POS) tagging
  • Mark each word as a noun, verb, preposition, etc.
  • Syntactic parsing
  • Identify phrases NP, VP, PP
  • Semantic word categories (e.g. from WordNet)
  • KILL kill, murder, assassinate, strangle,
    suffocate
  • Extraction patterns can use POS or phrase tags.
  • Crime victim
  • Prefiller POS V, Hypernym KILL
  • Filler Phrase NP

28
Pattern-Match Rule Learning
  • Writing accurate patterns for each slot for each
    application requires laborious software
    engineering.
  • Alternative is to use rule induction methods.
  • RAPIER system (Califf Mooney, 1999) learns
    three regex-style patterns for each slot
  • Pre-filler pattern
  • Filler pattern
  • Post-filler pattern
  • RAPIER allows use of POS and WordNet categories
    in patterns to generalize over lexical items.

29
RAPIER Pattern Induction Example
  • If goal is to extract the name of the city in
    which a posted job is located, the
    least-general-generalization constructed by
    RAPIER is

30
Evaluating IE Accuracy
  • Always evaluate performance on independent,
    manually-annotated test data not used during
    system development.
  • Measure for each test document
  • Total number of correct extractions in the
    solution template N
  • Total number of slot/value pairs extracted by the
    system E
  • Number of extracted slot/value pairs that are
    correct (i.e. in the solution template) C
  • Compute average value of metrics adapted from IR
  • Recall C/N
  • Precision C/E
  • F-Measure Harmonic mean of recall and precision

31
IE Experiment in Bioinformatics
  • Large scale comparison of IE methods on
    identifying names of human proteins in biomedical
    journal abstracts (Bunescu et al. 2004).
  • Goal is to mine the large body of biomedical
    literature to extract a useful database of all
    known protein interactions.
  • Biologists can use this protein network to
    better understand the overall biochemical
    functioning of an organism.

32
Non-Learning Protein Extractors
  • Dictionary-based extraction
  • Uses a gazetteer of known human protein names.
  • KEX (Fukuda et al., 1998)
  • General protein-name identifier not specialized
    for human.

33
Learning Methods for Protein Extraction
  • Rule-based pattern induction
  • Rapier (Califf Mooney, 1999)
  • BWI (Freitag Kushmerick, 2000)
  • Token classification (chunking approach)
  • K-nearest neighbor
  • Transformation-Based Rule Learning Abgene
    (Tanabe Wilbur, 2002)
  • Support Vector Machine (maximum-margin
    Perceptron)
  • Maximum entropy (discriminative version of Naïve
    Bayes)
  • Hidden Markov Models
  • Conditional Random Fields (Lafferty, McCallum,
    and Pereira, 2001)
  • Relational Markov Networks (Taskar, Abbeel, and
    Koller, 2002)

34
Biomedical Corpora
  • AIMed 750 abstracts that contain the word human
    were randomly chosen from Medline for testing
    protein name extraction. They were manually
    tagged by experts to annotate a total of 5,206
    human protein references (Bunescu et al., 2005).
  • Yapex Another corpus of 200 abstracts manually
    tagged for human protein names.

35
Experimental Method
  • 10-fold cross-validation Average results over 10
    trials with different training and (independent)
    test data.
  • For methods which produce confidence in
    extractions, vary threshold for extraction in
    order to explore recall-precision trade-off.
  • Use standard methods from information-retrieval
    to generate a complete precision-recall curve.
  • Maximizing F-measure assumes a particular
    cost-benefit trade-off between incorrect and
    missed extractions.

36
Protein Name Extraction ResultsAIMed Corpus
37
Protein Name Extraction Results Yapex Corpus
38
Relation Extraction
  • Biomedical corpora gt Interactions between
    Proteins.

interaction
protein
protein
Cyclin D1 is induced in early G1 and becomes
associated with p9Ckshs1, a Cdk binding subunit.
  • Newspaper corpora gt relationships (e.g. Role,
    Part, Location, Near, Social) between predefined
    types of entities (e.g. Person, Organization,
    Facility, Location, Geo-Political).

location
location
people
people
facility
Protesters seized several pumping stations,
holding 127 Shell workers hostage.
39
ELCS (Extraction using Longest Common
Subsequences)
  • A method for inducing pattern-match rules that
    extract interactions between previously tagged
    proteins.
  • Each rule consists of a sequence of words with
    allowable word gaps between them (similar to
    Blaschke Valencia, 2001, 2002). - (7)
    interactions (0) between (5) PROT (9) PROT (17) .
  • Any pair of proteins in a sentence if tagged as
    interacting forms a positive example, otherwise
    it forms a negative example.
  • Positive examples are repeatedly generalized to
    form rules until the rules become overly general
    and start matching negative examples.

40
Generalizing Rules using Longest Common
Subsequence
The self - association site appears to be formed
by interactions between helices 1 and 2 of beta
spectrin repeat 17 of one dimer with helix 3 of
alpha spectrin repeat 1 of the other dimer to
form two combined alpha - beta triple - helical
segments . Title - Physical and functional
interactions between the transcriptional
inhibitors Id3 and ITF-2b .
41
Protein Interaction Corpus
  • 200 abstracts previously known to contain protein
    interactions were obtained from the Database of
    Interacting Proteins. They contain 1,101
    interactions and 4,141 protein names.
  • As negative examples for interaction extraction
    are rare, an extra set of 30 abstracts containing
    sentences with non-interacting proteins are
    included.
  • The resulting 230 abstracts are used for testing
    protein interaction extraction.

42
Protein Interaction Extraction Results(gold-stand
ard protein tags)
43
Protein Interaction Extraction Results(automated
protein tags)
44
ERK Relation Extraction using a String
Subsequence Kernel
  • Subsequences of words and POS tags are used as
    implicit features.
  • Assumes the entities have already been annotated.
  • The feature space can be further pruned down in
    almost all examples, a sentence asserts a
    relationship between two entities using one of
    the following patterns
  • FI Fore-Inter interaction of P1 with P2,
    activation of P1 by P2
  • I Inter P1 interacts with P2, P1 is
    activated by P2
  • IA Inter-After P1 P2 complex, P1 and P2
    interact

Bunescu et al., 2005.
interaction of (3) PROT (3) with PROT
45
Protein Interaction Extraction Results(gold-stand
ard protein tags)
46
ACE 2002 Newspaper Corpus
  • Newspaper article extraction task.
  • Documents
  • 422 training documents
  • 97 test documents
  • Extracted information
  • Entities Person, Organization, Facility,
    Location, _______Geopolitical Entity
  • Relations Role, Part, Located, Near, Social

47
ACE 2002 Newspaper Corpus
  • Compared
  • ERK string subsequence kernel extractor
  • K4 The tree dependency kernel from
  • Culotta et. al, 2004.

Method Precision Recall F-measure
ERK 73.9 35.2 47.7
K4 70.3 26.3 38.0
48
Text Mining
  • Automatically extract information from a large
    corpus to build a large database or
    knowledge-base of useful information.
  • For example, we have used our trained protein
    interaction extractor to mine biomedical journal
    abstracts
  • Input 753,459 Medline abstracts that reference
    human
  • Output Database of 6,580 interactions between
    3,737 human proteins

49
Active Learning
  • Annotating training documents for each
    application is difficult and expensive.
  • Random selection can waste effort on annotating
    documents that do not help the learner.
  • Best to focus human effort on annotating the most
    informative documents.
  • Active learning methods pick only the most
    informative examples for training.
  • At each step, select the example that is
    estimated to be the most useful for improving the
    current learner and then ask the human oracle to
    annotate this example.

50
Uncertainty Sampling
  • Assume learned system can provide confidence in
    its predicted labelings of examples.
  • From a pool of unlabeled data, pick as most
    informative, the unlabelled example about which
    the current learned system is most uncertain.

Let D be a set of unlabeled examples Until
desired accuracy is reached Apply current
learned system, L, to all examples in D
From D, select the example, E, whose label is
most uncertain Ask the user to label E and
remove it from D. Add E to the training
set and retrain L
51
Rapier Uncertainty Sampling Results
52
Information Extraction Issues
  • Effectively exploiting global information
  • Better active learning methods
  • Integrating entity and relation extraction
  • Unsupervised IE
  • Semi-supervised IE
  • Adaptation and transfer to new tasks
  • Mining extracted data to find cross-document
    regularities.
  • Use resulting mined knowledge to improve IE
Write a Comment
User Comments (0)
About PowerShow.com