Title: Lecture 16: Information Extraction
1Lecture 16 Information Extraction
Oct. 26, 2007 ChengXiang Zhai
Most slides are from Eugene Agichteins and
William Cohens tutorials
2The Value of Text Data
- Unstructured text data is the primary form of
human-generated information - Blogs, web pages, news, scientific literature,
online reviews, - Semi-structured data (database generated) see
Prof. Bing Lius KDD webinar http//www.cs.uic.ed
u/liub/WCM-Refs.html - The techniques discussed here are complimentary
to structured object extraction methods - Need to extract structured information to
effectively manage, search, and mine the data - Information Extraction mature, but active
research area - Intersection of Computational Linguistics,
Machine Learning, Data mining, Databases, and
Information Retrieval - Traditional focus on accuracy of extraction
- Recently attention paid to scalability
3Example Answering Queries Over Text
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from William Cohens IE tutorial, 2003)
4IE History Pre-Web
- Mostly news articles
- De Jongs FRUMP 1982
- Hand-built system to fill Schank-style scripts
from news wire - Message Understanding Conference (MUC) DARPA
87-95, TIPSTER 92-96 - Early work dominated by hand-built models
- E.g. SRIs FASTUS, hand-built FSMs.
- But by 1990s, some machine learning Lehnert,
Cardie, Grishman and then HMMs Elkan Leek 97,
BBN Bikel et al 98
5IE History Web
- AAAI 94 Spring Symposium on Software Agents
- Much discussion of ML applied to Web. Maes,
Mitchell, Etzioni. - Tom Mitchells WebKB, 96
- Build KBs from the Web.
- Wrapper Induction
- Initially hand-build, then ML Soderland 96,
Kushmeric 97, - Citeseer Cora FlipDog contEd courses,
corpInfo, - WebFountain (IBM)
- KnowItAll (University of Washington)
6IE History Other Domains
- Biology
- Gene/protein entity extraction
- Protein/protein fact interaction
- Automated curation/integration of databases
- At CMU SLIF (Murphy et al, subcellular
information from images text in journal
articles) - At UIUC BeeSpace (http//www.beespace.uiuc.edu/)
- Email
- EPCA, PAL, RADAR, CALO intelligent office
assistant that understands some part of email - At CMU web site update requests, office-space
requests calendar scheduling requests social
network analysis of email.
7Landscape of IE Tasks (1/4)Degree of Formatting
Text paragraphs without formatting
Grammatical sentencesand some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting links
Tables
8Landscape of IE Tasks (2/4)Intended Breadth of
Coverage
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
9Landscape of IE Tasks (3/4)Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
10Landscape of IE Tasks (4/4)Single Field/Record
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
11Landscape of IE Techniques (1/1)Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Any of these models can be used to capture words,
formatting or both.
12Hand-Coded Methods
- Easy to construct in some cases
- e.g., to recognize prices, phone numbers, zip
codes, conference names, etc. - Intuitive to debug and maintain
- Especially if written in a high-level language
- Can incorporate domain knowledge
- Scalability issues
- Labor-intensive to create
- Highly domain-specific
- Often corpus-specific
- Rule-matches can be expensive
IBM Avatar
13Machine Learning Methods
- Can work well when lots of training data easy to
construct - Can capture complex patterns that are hard to
encode with hand-crafted rules - e.g., determine whether a review is positive or
negative - extract long complex gene names
- Non-local dependencies
14Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
15Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
16Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
17Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
18A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun
w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATIONwindow) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length, prefix
words, suffix words, content words Estimate from
data quantities like Pr(Place in
prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
19BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000
- Another formulation learn three probabilistic
classifiers - START(i) Prob( position i starts a field)
- END(j) Prob( position j ends a field)
- LEN(k) Prob( an extracted field has length k)
- Then score a possible extraction (i,j) by
- START(i) END(j) LEN(j-i)
- LEN(k) is estimated from a histogram
-
20IE with Hidden Markov Models
Given a sequence of observations
Yesterday Pedro Domingos spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Pedro Domingos spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Pedro Domingos
21HMM for Segmentation
- Simplest Model One state per entity type
22HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or
(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 500k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90
Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
23Popular Machine Learning Methods
For details Feldman, 2006 and Cohen, 2004
- Naive Bayes
- SRV Freitag 1998, Inductive Logic Programming
- Rapier Califf and Mooney 1997
- Hidden Markov Models Leek 1997
- Maximum Entropy Markov Models McCallum et al.
2000 - Conditional Random Fields Lafferty et al. 2001
- Scalability
- Can be labor intensive to construct training data
- At run time, complex features can be expensive to
construct or process (batch algorithms can help
Chandel et al. 2006 )
24Some Available Entity Taggers
- ABNER
- http//www.cs.wisc.edu/bsettles/abner/
- Linear-chain conditional random fields (CRFs)
with orthographic and contextual features. - Alias-I LingPipe
- http//www.alias-i.com/lingpipe/
- MALLET
- http//mallet.cs.umass.edu/index.php/Main_Page
- Collection of NLP and ML tools, can be trained
for name entity tagging - MinorThird
- http//minorthird.sourceforge.net/
- Tools for learning to extract entities,
categorization, and some visualization - Stanford Named Entity Recognizer
- http//nlp.stanford.edu/software/CRF-NER.shtml
- CRF-based entity tagger with non-local features
25Alias-I LingPipe ( http//www.alias-i.com/lingpipe
/ )
- Statistical named entity tagger
- Generative statistical model
- Find most likely tags given lexical and
linguistic features - Accuracy at (or near) state of the art on
benchmark tasks - Explicitly targets scalability
- 100K tokens/second runtime on single PC
- Pipelined extraction of entities
- User-defined mentions, pronouns and stop list
- Specified in a dictionary, left-to-right, longest
match - Can be trained/bootstrapped on annotated corpora
26Relation Extraction Examples
- Extract tuples of entities that are related in
predefined way
Disease Outbreaks relation
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Relation Extraction
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
From AliBaba
27Relation Extraction Approaches
- Knowledge engineering
- Experts develop rules, patterns
- Can be defined over lexical items ltcompanygt
located in ltlocationgt - Over syntactic structures ((Obj ltcompanygt)
(Verb located) () (Subj ltlocationgt)) - Sophisticated development/debugging environments
- Proteus, GATE
- Machine learning
- Supervised Train system over manually labeled
data - Soderland et al. 1997, Muslea et al. 2000, Riloff
et al. 1996, Roth et al 2005, Cardie et al 2006,
Mooney et al. 2005, - Partially-supervised train system by
bootstrapping from seed examples - Agichtein Gravano 2000, Etzioni et al., 2004,
Yangarber Grishman 2001, - Open (no seeds) Sekine et al. 2006, Cafarella
et al. 2007, Banko et al. 2007 - Hybrid or interactive systems
- Experts interact with machine learning algorithms
(e.g., active learning family) to iteratively
refine/extend rules and patterns - Interactions can involve annotating examples,
modifying rules, or any combination
28Open Information Extraction Banko et al., IJCAI
2007
- Self-Supervised Learner
- All triples in a sample corpus (e1, r, e2) are
considered potential tuples for relation r - Positive examples candidate triplets generated
by a dependency parser - Train classifier on lexical features for positive
and negative examples - Single-Pass Extractor
- Classify all pairs of candidate entities for some
(undetermined) relation - Heuristically generate a relation name from the
words between entities - Redundancy-Based Assessor
- Estimate probability that entities are related
from co-occurrence statistics - Scalability
- Extraction/Indexing
- No tuning or domain knowledge during extraction,
relation inclusion determined at query time - 0.04 CPU seconds pre sentence, 9M web page corpus
in 68 CPU hours - Every document retrieved, processed (parsed,
indexed, classified) in a single pass - Query-time
- Distributed index for tuples by hashing on the
relation name text - Related efforts Cucerzan and Agichtein 2005,
Pasca et al. 2006, Sekine et al. 2006,
Rozenfeld and Feldman 2006,
29Event Extraction
- Similar to Relation Extraction, but
- Events can be nested
- Significantly more complex (e.g., more slots)
than relations/template elements - Often requires coreference resolution,
disambiguation, deduplication, and inference - Example an integrated disease outbreak event
Hatunnen et al. 2002 -
30Event Extraction Integration Challenges
- Information spans multiple documents
- Missing or incorrect values
- Combining simple tuples into complex events
- No single key to order or cluster likely
duplicates while separating them from similar but
different entities. - Ambiguity distinct physical entities with same
name (e.g., Kennedy) - Duplicate entities, relation tuples extracted
- Large lists with multiple noisy mentions of the
same entity/tuple - Need to depend on fuzzy and expensive string
similarity functions - Cannot afford to compare each mention with every
other.
31Accuracy of Extraction Tasks
Feldman, ICML 2006 tutorial
- Errors cascade (errors in entity tag cause errors
in relation extraction) - This estimate is optimistic
- Primarily for well-established (tuned) tasks
- Many specialized or novel IE tasks (e.g. bio- and
medical- domains) exhibit lower accuracy - Accuracy for all tasks is significantly lower for
non-English
32Multilingual Information Extraction
- Active research area, beyond the scope of this
talk. Nevertheless, a few (incomplete) pointers
are provided. - Closely tied to machine translation and
cross-language information retrieval efforts. - Language-independent named entity tagging and
related tasks at CoNLL - 2006 multi-lingual dependency parsing
(http//nextens.uvt.nl/conll/) - 2002, 2003 shared tasks language independent
Named Entity Tagging (http//www.cnts.ua.ac.be/con
ll2003/ner/) - Global Autonomous Language Exploitation program
(GALE) - http//www.darpa.mil/ipto/Programs/gale/concept.ht
m - Interlingual Annotation of Multilingual Text
Corpora (IAMTC) - Tools and data for building MT and IE systems for
six languages - http//aitc.aitcnet.org/nsf/iamtc/index.html
- REFLEX project NER for 50 languages
- Exploit for training temporal correlations in
weekly aligned corpora - http//l2r.cs.uiuc.edu/cogcomp/wpt.php?pr_keyREF
LEX
33Scaling Information Extraction to the Web
- Dimensions of Scalability
- Corpus size
- Applying rules/patterns is expensive
- Need efficient ways to select/filter relevant
documents - Document accessibility
- Deep web documents only accessible via a search
interface - Dynamic sources documents disappear from top
page - Source heterogeneity
- Coding/learning patterns for each source is
expensive - Requires many rules (expensive to apply)
- Domain diversity
- Extracting information for any domain, entities,
relationships
34Scaling Up Information Extraction
- Scan-based extraction
- Classification/filtering to avoid processing
documents - Sharing common tags/annotations
- General keyword index-based techniques
- QXtract, KnowItAll
- Specialized indexes
- BE/KnowItNow, Linguists Search Engine
- Parallelization/distributed processing
- IBM WebFountain, UIMA, Googles Map/Reduce
35Efficient Scanning for Information Extraction
Output Tuples
Extraction System
Text Database
filtered
- Extract output tuples
- Process documents
- Retrieve docs from database
- 80/20 rule use few simple rules to capture
majority of the instances Pantel et al. 2004 - Train a classifier to discard irrelevant
documents without processing Grishman et al.
2002 - (e.g., the Sports section of NYT is unlikely to
describe disease outbreaks) - Share base annotations (entity tags) for multiple
extraction tasks
36Exploiting Keyword and Phrase Indexes
- Generate queries to retrieve only relevant
documents - Data mining problem!
- Some methods in literature
- Traversing Query Graphs Agichtein et al. 2003
- Iteratively refine queries Agichtein and Gravano
2003 - Iteratively partition document space Etzioni et
al., 2004 - Example systems QXtract, KnowItAll
37Index Structures for Information Extraction
- Bindings Engine Cafarella and Etzioni 2005
- Indexing and querying entities K. Chakrabarti
et al. 2006 - IBM Avatar project
- http//www.almaden.ibm.com/cs/projects/avatar/
- Other indexing schemes
- Linguists search engine (P. Resnik)
http//lse.umiacs.umd.edu8080/ - FREE Indexing regular expressions Cho and
Rajagolapan, ICDE 2002 - Indexing and querying linguistic information in
XML Bird et al., 2006
38Bindings Engine (BE) Cafarella and Etzioni 2005
- Variabilized search query language
- Integrates variable/type data with inverted
index, minimizing query seeks - Index ltNounPhrasegt, ltAdj-Termgt terms
- Key idea neighbor index
- At each position in the index, store neighbor
text both lexemes and tags - Query cities such as ltNounPhrasegt
-
-
docs
pos0
pos1
dociddocs-1
posdocs-1
docid0
docid1
as
billy
cities
friendly
give
mayors
nickels
philadelphia
such
words
19
posns
pos0
pos1
pospos-1
posns
pos0
neighbor0
pos1
neighbor1
pospos-1
12
neighbor1
str1
neighbors
blk_offset
neighbor0
str0
Result in document 19 I love cities such as
Philadelphia.
39Parallelization/Adaptive Processing
- Parallelize processing
- WebFountain Gruhl et al. 2004
- UIMA architecture
- Map/Reduce
40IBM WebFountain
Gruhl et al. 2004
- Dedicated share-nothing 256-node cluster
- Blackboard annotation architecture
- Data pipelined and streamed past each augmenter
to add annotations - Merge and index annotations
- Index both tokens and annotations
- Between 25K-75K entities per second
41UIMA (IBM Research)
- Unstructured Information Management Architecture
(UIMA) - http//www.research.ibm.com/UIMA/
- Open component software architecture for
development, composition, and deployment of text
processing and analysis components. - Run-time framework allows to plug in components
and applications and run them on different
platforms. Supports distributed processing,
failure recovery, - Scales to millions of documents incorporated
into IBM OmniFind, grid computing-ready - The UIMA SDK (freely available) includes a
run-time framework, APIs, and tools for composing
and deploying UIMA components. - Framework source code also available on
Sourceforge - http//uima-framework.sourceforge.net/
42Map/Reduce (Dean Ghemawat, OSDI 2004)
43Map/Reduce (continued)
- General framework
- Scales to 1000s of machines
- Recently implemented in Nutch and other open
source efforts - Maps nicely to information extraction
- Map phase
- Parse individual documents
- Tag entities
- Propose candidate relation tuples
- Reduce phase
- Merge multiple mentions of same relation tuple
- Resolve co-references, duplicates
44References
- Tutorials
- Eugene Agichtein, Towards Web-Scale Information
Extraction, KDD 2007 - http//www.mathcs.emory.edu/eugene/kdd-we
binar/ - R. Feldman, Information Extraction Theory and
Practice, ICML 2006http//www.cs.biu.ac.il/feldm
an/icml_tutorial.html -
- W. Cohen, A. McCallum, Information Extraction and
Integration an Overview, KDD 2003
http//www.cs.cmu.edu/wcohen/ie-survey.ppt -
45What Should You Know
- Information extraction is key to convert
unstructured data to structured data - Basic tasks in information extraction (entities,
relations, events) - Basic ideas of some of the methods