Information%20Extraction%20and%20Integration:%20an%20Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Information%20Extraction%20and%20Integration:%20an%20Overview

Description:

Extracting Job Openings from the Web. foodscience.com-Job2. JobTitle: Ice ... MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 80
Provided by: willia95
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Information%20Extraction%20and%20Integration:%20an%20Overview


1
Information Extractionand Integration an
Overview
  • William W. Cohen
  • Carnegie Mellon University
  • April 26, 2004

2
Example The Problem
Martin Baker, a person
Genomics job
Employers job posting form
3
Example A Solution
4
Extracting Job Openings from the Web
5
Job Openings Category Food Services Keyword
Baker Location Continental U.S.
6
IE from Research Papers
7
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
8
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
9
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
10
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
12
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




13
Tutorial Outline
  • IE History
  • Landscape of problems and solutions
  • Models for named entity recognition
  • Sliding window
  • Boundary finding
  • Finite state machines
  • Overview of related problems and solutions
  • Association, Clustering
  • Integration with Data Mining

14
IE History
  • Pre-Web
  • Mostly news articles
  • De Jongs FRUMP 1982
  • Hand-built system to fill Schank-style scripts
    from news wire
  • Message Understanding Conference (MUC) DARPA
    87-95, TIPSTER 92-96
  • Early work dominated by hand-built models
  • E.g. SRIs FASTUS, hand-built FSMs.
  • But by 1990s, some machine learning Lehnert,
    Cardie, Grishman and then HMMs Elkan Leek 97,
    BBN Bikel et al 98
  • Web
  • AAAI 94 Spring Symposium on Software Agents
  • Much discussion of ML applied to Web. Maes,
    Mitchell, Etzioni.
  • Tom Mitchells WebKB, 96
  • Build KBs from the Web.
  • Wrapper Induction
  • Initially hand-build, then ML Soderland 96,
    Kushmeric 97,
  • Citeseer Cora FlipDog contEd courses,
    corpInfo,

15
IE History
  • Biology
  • Gene/protein entity extraction
  • Protein/protein fact interaction
  • Automated curation/integration of databases
  • At CMU SLIF (Murphy et al, subcellular
    information from images text in journal
    articles)
  • Email
  • EPCA, PAL, RADAR, CALO intelligent office
    assistant that understands some part of email
  • At CMU web site update requests, office-space
    requests calendar scheduling requests social
    network analysis of email.

16
IE is different in different domains!
Example on web there is less grammar, but more
formatting linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store in New York
City MACWORLD EXPO, NEW YORK--July 17,
2002--Apple's first retail store in New York City
will open in Manhattan's SoHo district on
Thursday, July 18 at 800 a.m. EDT. The SoHo
store will be Apple's largest retail store to
date and is a stunning example of Apple's
commitment to offering customers the world's best
computer shopping experience. "Fourteen months
after opening our first retail store, our 31
stores are attracting over 100,000 visitors each
week," said Steve Jobs, Apple's CEO. "We hope our
SoHo store will surprise and delight both Mac and
PC users who want to see everything the Mac can
do to enhance their digital lifestyles."
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
The directory structure, link structure,
formatting layout of the Web is its own new
grammar.
17
Landscape of IE Tasks (1/4)Degree of Formatting
Text paragraphs without formatting
Grammatical sentencesand some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting links
Tables
18
Landscape of IE Tasks (2/4)Intended Breadth of
Coverage
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
19
Landscape of IE Tasks (3/4)Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
20
Landscape of IE Tasks (4/4)Single Field/Record
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
21
Landscape of IE Techniques (1/1)Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Any of these models can be used to capture words,
formatting or both.
22
Sliding Windows
23
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
24
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
25
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
26
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
27
A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun


w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATIONwindow) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length, prefix
words, suffix words, content words Estimate from
data quantities like Pr(Place in
prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
28
Naïve Bayes Sliding Window Results
Domain CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Field F1 Person Name 30 Location 61 Start
Time 98
29
SRV a realistic sliding-window-classifier IE
system
Frietag AAAI 98
  • What windows to consider?
  • all windows containing as many tokens as the
    shortest example, but no more tokens than the
    longest example
  • How to represent a classifier? It might
  • Restrict the length of window
  • Restrict the vocabulary or formatting used
    before/after/inside window
  • Restrict the relative order of tokens
  • Use inductive logic programming techniques to
    express all these

lttitlegtCourse Information for CS213lt/titlegt lth1gtCS
213 C Programminglt/h1gt
30
SRV a rule-learner for sliding-window
classification
  • Primitive predicates used by SRV
  • token(X,W), allLowerCase(W), numerical(W),
  • nextToken(W,U), previousToken(W,V)
  • HTML-specific predicates
  • inTitleTag(W), inH1Tag(W), inEmTag(W),
  • emphasized(W) inEmTag(W) or inBTag(W) or
  • tableNextCol(W,U) U is some token in the
    column after the column W is in
  • tablePreviousCol(W,V), tableRowHeader(W,T),

31
SRV a rule-learner for sliding-window
classification
  • Non-primitive conditions used by SRV
  • every(X, f, c) for all W in X f(W)c
  • some(X, W, ltf1,,fkgt, g, c) exists W
    g(fk((f1(W)))c
  • tokenLength(X, relop, c)
  • position(W,direction,relop, c)
  • e.g., tokenLength(X,gt,4), position(W,fromEnd,lt,2)

32
Rapier results vs. SRV
33
Rule-learning approaches to sliding-window
classification Summary
  • SRV, Rapier, and WHISK Soderland KDD 97
  • Representations for classifiers allow restriction
    of the relationships between tokens, etc
  • Representations are carefully chosen subsets of
    even more powerful representations based on logic
    programming (ILP and Prolog)
  • Use of these heavyweight representations is
    complicated, but seems to pay off in results
  • Some questions to consider
  • Can simpler, propositional representations for
    classifiers work (see Roth and Yih)
  • What learning methods to consider (NB, ILP,
    boosting, semi-supervised see Collins Singer)
  • When do we want to use this method vs fancier
    ones?

34
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000
  • Another formulation learn three probabilistic
    classifiers
  • START(i) Prob( position i starts a field)
  • END(j) Prob( position j ends a field)
  • LEN(k) Prob( an extracted field has length k)
  • Then score a possible extraction (i,j) by
  • START(i) END(j) LEN(j-i)
  • LEN(k) is estimated from a histogram

35
BWI Learning to detect boundaries
Field F1 Person Name 30 Location 61 Start
Time 98
36
Problems with Sliding Windows and Boundary
Finders
  • Decisions in neighboring parts of the input are
    made independently from each other.
  • Expensive for long entity names
  • Sliding Window may predict a seminar end time
    before the seminar start time.
  • It is possible for two overlapping windows to
    both be above threshold.
  • In a Boundary-Finding system, left boundaries are
    laid down independently from right boundaries,
    and their pairing happens as a separate step.

37
Finite State Machines
38
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Pedro Domingos spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Pedro Domingos spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Pedro Domingos
39
HMM for Segmentation
  • Simplest Model One state per entity type

40
What is a symbol ???
Cohen gt Cohen, cohen, Xxxxx, Xx,
? 4601 gt 4601, 9999, 9, number, ?
Datamold choose best abstraction level using
holdout set
41
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 500k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
42
What is a symbol?
  • Bikel et al mix symbols from two abstraction
    levels

43
What is a symbol?
Ideally we would like to use many, arbitrary,
overlapping features of words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Lots of learning systems are not confounded by
multiple, non-independent features decision
trees, neural nets, SVMs,
44
What is a symbol?
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
45
What is a symbol?
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state
46
What is a symbol?
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
S
t
-
1
t1

t
is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state history
47
Ratnaparkhis MXPOST
  • Sequential learning problem predict POS tags of
    words.
  • Uses MaxEnt model described above.
  • Rich feature set.
  • To smooth, discard features occurring lt 10 times.

48
Conditional Markov Models (CMMs) aka MEMMs aka
Maxent Taggers vs HMMS
St-1
St
St1
...
Ot
Ot1
Ot-1
St-1
St
St1
...
Ot
Ot1
Ot-1
49
Label Bias Problem
  • Consider this MEMM, and enough training data to
    perfectly model it

Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
50
Another view of label bias Sha Pereira
So whats the alternative?
51
CMMs to CRFs
52
Whats the new model look like?
Whats independent?
53
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
54
Sha Pereira results
CRF beats MEMM (McNemars test) MEMM probably
beats voted perceptron
55
Sha Pereira results
in minutes, 375k examples
56
Combining Sliding Windows and Finite-State Models
57
Problems with Tagging-based NER
  • Decisions are made token-by-token
  • Feature engineering can be awkward
  • How long is the extracted name?
  • Does the extracted name appear in a dictionary?
  • Is the extracted name similar to (an acronym for,
    a short version of...) a name appearing earlier
    in the document?
  • Tag engineering is also crucial
  • Do we use one state per entity? a special state
    for beginning an entity? a special state for end?

58
Semi-Markov models for IE
with Sunita Sarawagi, IIT Bombay
  • Train on sequences of labeled segments, not
    labeled words.
  • S(start,end,label)
  • Build probability model of segment sequences,
    not word sequences
  • Define features f of segments
  • (Approximately) optimize feature weights on
    training data

f(S) words xt...xu, length, previous words,
case information, ..., distance to known name
maximize
59
Segments vs tagging
1 2 3 4 5 6 7 8
Fred please stop by my office this afternoon
Person other other other Loc Loc other Time
t
x
y
f(xt,yt)
t1u11 t22, u24 t22, u24 t22, u24 t35,u36 t35,u36 t4u47 t5u58
Fred please stop by my office this afternoon
Person other other other Loc Loc other Time
t,u
x
y
f(xj,yj)
60
A training algorithm for CSMMs (1)
Review Collins perceptron training algorithm
Correct tags
Viterbi tags
61
A training algorithm for CSMMs (2)
Variant of Collins perceptron training algorithm
voted perceptron learner for TTRANS
like Viterbi
62
A training algorithm for CSMMs (3)
Variant of Collins perceptron training algorithm
voted perceptron learner for TSEGTRANS
like Viterbi
63
Results
64
(No Transcript)
65
Results varying history
66
Results changing the dictionary
67
Results vs CRF
68
Results vs CRF
69
Broader Issues in IE
70
Broader View
Up to now we have been focused on segmentation
and classification
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
71
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Documentcollection
Train extraction models
4
Data mine
5
Label training data
72
(1) Association as Binary Classification
Christos Faloutsos conferred with Ted Senator,
the KDD 2003 General Chair.
Person
Person
Role
Person-Role (Christos Faloutsos, KDD 2003
General Chair) ? NO
Person-Role ( Ted Senator, KDD 2003
General Chair) ? YES
Do this with SVMs and tree kernels over parse
trees.
Zelenko et al, 2002
73
(1) Association with Finite State Machines
Ray Craven, 2001
This enzyme, UBC6, localizes to the endoplasmic
reticulum, with the catalytic domain facing the
cytosol.
DET this N enzyme N ubc6 V localizes PREP to ART t
he ADJ endoplasmic N reticulum PREP with ART the A
DJ catalytic N domain V facing ART theN cytosol
Subcellular-localization (UBC6, endoplasmic
reticulum)
74
(1) Association with Graphical Models
Roth Yih 2002
Capture arbitrary-distance dependencies among
predictions.
75
(1) Association with Graphical Models
Roth Yih 2002
Also capture long-distance dependencies among
predictions.
Random variableover the class ofrelation
between entity 2 and 1, e.g. over lives-in,
is-boss-of,
person
Random variableover the class ofentity 1, e.g.
overperson, location,
lives-in
Local languagemodels contributeevidence to
relationclassification.
location
Local languagemodels contributeevidence to
entityclassification.
Dependencies between classesof entities and
relations!
Inference with loopy belief propagation.
76
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Documentcollection
Train extraction models
4
Data mine
5
Label training data
When do two extracted strings refer to the same
object?
77
(2) Information Integration
Minton, Knoblock, et al 2001, Doan, Domingos,
Halevy 2001, Richardson Domingos 2003
  • Goal might be to merge results of two IE systems

Title Intro. to Comp. Sci.
Num 101
Dept Computer Science
Teacher Dr. Klüdge
TA John Smith
Topic Java Programming
Start time 910 AM
Name Introduction to Computer Science
Number CS 101
Teacher M. A. Kludge
Time 9-11am
Name Data Structures in Java
Room 5032 Wean Hall
78
(2) Other Information Integration Issues
  • Distance metrics for text which work well?
  • Cohen, Ravikumar, Fienberg, 2003
  • Finessing integration by soft database operations
    based on similarity
  • Cohen, 2000
  • Integration of complex structured databases
    (capture dependencies among multiple merges)
  • Cohen, MacAllister, Kautz KDD 2000 Pasula,
    Marthi, Milch, Russell, Shpitser, NIPS 2002
    McCallum and Wellner, KDD WS 2003

79
IE Resources
  • Data
  • RISE, http//www.isi.edu/muslea/RISE/index.html
  • Linguistic Data Consortium (LDC)
  • Penn Treebank, Named Entities, Relations, etc.
  • http//www.biostat.wisc.edu/craven/ie
  • Papers, tutorials, lectures, code
  • http//www.cs.cmu.edu/wcohen/10-707

80
References
  • Bikel et al 1997 Bikel, D. Miller, S.
    Schwartz, R. and Weischedel, R. Nymble a
    high-performance learning name-finder. In
    Proceedings of ANLP97, p194-201.
  • Califf Mooney 1999, Califf, M.E. Mooney, R.
    Relational Learning of Pattern-Match Rules for
    Information Extraction, in Proceedings of the
    Sixteenth National Conference on Artificial
    Intelligence (AAAI-99).
  • Cohen, Hurst, Jensen, 2002 Cohen, W. Hurst,
    M. Jensen, L. A flexible learning system for
    wrapping tables and lists in HTML documents.
    Proceedings of The Eleventh International World
    Wide Web Conference (WWW-2002)
  • Cohen, Kautz, McAllester 2000 Cohen, W Kautz,
    H. McAllester, D. Hardening soft information
    sources. Proceedings of the Sixth International
    Conference on Knowledge Discovery and Data Mining
    (KDD-2000).
  • Cohen, 1998 Cohen, W. Integration of
    Heterogeneous Databases Without Common Domains
    Using Queries Based on Textual Similarity, in
    Proceedings of ACM SIGMOD-98.
  • Cohen, 2000a Cohen, W. Data Integration using
    Similarity Joins and a Word-based Information
    Representation Language, ACM Transactions on
    Information Systems, 18(3).
  • Cohen, 2000b Cohen, W. Automatically Extracting
    Features for Concept Learning from the Web,
    Machine Learning Proceedings of the Seventeeth
    International Conference (ML-2000).
  • Collins Singer 1999 Collins, M. and Singer,
    Y. Unsupervised models for named entity
    classification. In Proceedings of the Joint
    SIGDAT Conference on Empirical Methods in Natural
    Language Processing and Very Large Corpora, 1999.
  • De Jong 1982 De Jong, G. An Overview of the
    FRUMP System. In Lehnert, W. Ringle, M. H.
    (eds), Strategies for Natural Language
    Processing. Larence Erlbaum, 1982, 149-176.
  • Freitag 98 Freitag, D Information extraction
    from HTML application of a general machine
    learning approach, Proceedings of the Fifteenth
    National Conference on Artificial Intelligence
    (AAAI-98).
  • Freitag, 1999, Freitag, D. Machine Learning
    for Information Extraction in Informal Domains.
    Ph.D. dissertation, Carnegie Mellon University.
  • Freitag 2000, Freitag, D Machine Learning for
    Information Extraction in Informal Domains,
    Machine Learning 39(2/3) 99-101 (2000).
  • Freitag Kushmerick, 1999 Freitag, D
    Kushmerick, D. Boosted Wrapper Induction.
    Proceedings of the Sixteenth National Conference
    on Artificial Intelligence (AAAI-99)
  • Freitag McCallum 1999 Freitag, D. and
    McCallum, A. Information extraction using HMMs
    and shrinakge. In Proceedings AAAI-99 Workshop
    on Machine Learning for Information Extraction.
    AAAI Technical Report WS-99-11.
  • Kushmerick, 2000 Kushmerick, N Wrapper
    Induction efficiency and expressiveness,
    Artificial Intelligence, 118(pp 15-68).
  • Lafferty, McCallum Pereira 2001 Lafferty,
    J. McCallum, A. and Pereira, F., Conditional
    Random Fields Probabilistic Models for
    Segmenting and Labeling Sequence Data, In
    Proceedings of ICML-2001.
  • Leek 1997 Leek, T. R. Information extraction
    using hidden Markov models. Masters thesis. UC
    San Diego.
  • McCallum, Freitag Pereira 2000 McCallum, A.
    Freitag, D. and Pereira. F., Maximum entropy
    Markov models for information extraction and
    segmentation, In Proceedings of ICML-2000
  • Miller et al 2000 Miller, S. Fox, H.
    Ramshaw, L. Weischedel, R. A Novel Use of
    Statistical Parsing to Extract Information from
    Text. Proceedings of the 1st Annual Meeting of
    the North American Chapter of the ACL (NAACL), p.
    226 - 233.

81
References
  • Muslea et al, 1999 Muslea, I. Minton, S.
    Knoblock, C. A. A Hierarchical Approach to
    Wrapper Induction. Proceedings of Autonomous
    Agents-99.
  • Muslea et al, 2000 Musclea, I. Minton, S.
    and Knoblock, C. Hierarhical wrapper induction
    for semistructured information sources. Journal
    of Autonomous Agents and Multi-Agent Systems.
  • Nahm Mooney, 2000 Nahm, Y. and Mooney, R. A
    mutually beneficial integration of data mining
    and information extraction. In Proceedings of the
    Seventeenth National Conference on Artificial
    Intelligence, pages 627--632, Austin, TX.
  • Punyakanok Roth 2001 Punyakanok, V. and
    Roth, D. The use of classifiers in sequential
    inference. Advances in Neural Information
    Processing Systems 13.
  • Ratnaparkhi 1996 Ratnaparkhi, A., A maximum
    entropy part-of-speech tagger, in Proc.
    Empirical Methods in Natural Language Processing
    Conference, p133-141.
  • Ray Craven 2001 Ray, S. and Craven, Ml.
    Representing Sentence Structure in Hidden Markov
    Models for Information Extraction. Proceedings
    of the 17th International Joint Conference on
    Artificial Intelligence, Seattle, WA. Morgan
    Kaufmann.
  • Soderland 1997 Soderland, S. Learning to
    Extract Text-Based Information from the World
    Wide Web. Proceedings of the Third International
    Conference on Knowledge Discovery and Data Mining
    (KDD-97).
  • Soderland 1999 Soderland, S. Learning
    information extraction rules for semi-structured
    and free text. Machine Learning,
    34(1/3)233-277.
Write a Comment
User Comments (0)
About PowerShow.com