Title: Towards WebScale Information Extraction
1Towards Web-Scale Information Extraction
- Eugene Agichtein
- Mathematics Computer Science
- Emory University
- eugene_at_mathcs.emory.edu
- http//www.mathcs.emory.edu/eugene/
2The Value of Text Data
- Unstructured text data is the primary form of
human-generated information - Blogs, web pages, news, scientific literature,
online reviews, - Semi-structured data (database generated) see
Prof. Bing Lius KDD webinar http//www.cs.uic.ed
u/liub/WCM-Refs.html - The techniques discussed here are complimentary
to structured object extraction methods - Need to extract structured information to
effectively manage, search, and mine the data - Information Extraction mature, but active
research area - Intersection of Computational Linguistics,
Machine Learning, Data mining, Databases, and
Information Retrieval - Traditional focus on accuracy of extraction
3Example Answering Queries Over Text
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from William Cohens IE tutorial, 2003)
4Outline
- Information Extraction Tasks
- Entity tagging
- Relation extraction
- Event extraction
- Scaling up Information Extraction
- Focus on scaling up to large collections (where
data mining can be most beneficial) - Other dimensions of scalability
5Information Extraction Tasks
- Extracting entities and relations this talk
- Entities named (e.g., Person) and generic (e.g.,
disease name) - Relations entities related in a predefined way
(e.g., Location of a Disease outbreak, or a CEO
of a Company) - Events can be composed from multiple relation
tuples - Common extraction subtasks
- Preprocess sentence chunking, syntactic parsing,
morphological analysis - Create rules or extraction patterns hand-coded,
machine learning, and hybrid - Apply extraction patterns or rules to extract new
information - Postprocess and integrate information
- Co-reference resolution, deduplication,
disambiguation
6Entity Tagging
- Identifying mentions of entities (e.g., person
names, locations, companies) in text - MUC (1997) Person, Location, Organization,
Date/Time/Currency - ACE (2005) more than 100 more specific types
- Hand-coded vs. Machine Learning approaches
- Best approach depends on entity type and domain
- Closed class (e.g., geographical locations,
disease names, gene protein names) hand coded
dictionaries - Syntactic (e.g., phone numbers, zip codes)
regular expressions - Semantic (e.g., person and company names)
mixture of context, syntactic features,
dictionaries, heuristics, etc. - Almost solved for common/typical entity types
7Example Extracting Entities from Text
- Useful for data warehousing, data cleaning, web
data integration
Address
4089 Whispering Pines Nobel Drive San Diego CA
92122
1
Ronald Fagin, Combining Fuzzy Information from
Multiple Systems, Proc. of ACM SIGMOD, 2002
Citation
8Hand-Coded Methods
- Easy to construct in some cases
- e.g., to recognize prices, phone numbers, zip
codes, conference names, etc. - Intuitive to debug and maintain
- Especially if written in a high-level language
- Can incorporate domain knowledge
- Scalability issues
- Labor-intensive to create
- Highly domain-specific
- Often corpus-specific
- Rule-matches can be expensive
IBM Avatar
9Machine Learning Methods
- Can work well when lots of training data easy to
construct - Can capture complex patterns that are hard to
encode with hand-crafted rules - e.g., determine whether a review is positive or
negative - extract long complex gene names
- Non-local dependencies
10Representation Models Cohen and McCallum, 2003
Classify Pre-segmentedCandidates
Lexicons
Sliding Window
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
member?
Classifier
Classifier
Alabama Alaska Wisconsin Wyoming
which class?
which class?
Try alternatewindow sizes
Context Free Grammars
Finite State Machines
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
V
P
NP
V
NNP
Most likely parse?
Classifier
PP
which class?
VP
NP
VP
BEGIN
END
BEGIN
END
S
and beyond
Any of these models can be used to capture words,
formatting or both.
11Popular Machine Learning Methods
For details Feldman, 2006 and Cohen, 2004
- Naive Bayes
- SRV Freitag 1998, Inductive Logic Programming
- Rapier Califf and Mooney 1997
- Hidden Markov Models Leek 1997
- Maximum Entropy Markov Models McCallum et al.
2000 - Conditional Random Fields Lafferty et al. 2001
- Scalability
- Can be labor intensive to construct training data
- At run time, complex features can be expensive to
construct or process (batch algorithms can help
Chandel et al. 2006 )
12Some Available Entity Taggers
- ABNER
- http//www.cs.wisc.edu/bsettles/abner/
- Linear-chain conditional random fields (CRFs)
with orthographic and contextual features. - Alias-I LingPipe
- http//www.alias-i.com/lingpipe/
- MALLET
- http//mallet.cs.umass.edu/index.php/Main_Page
- Collection of NLP and ML tools, can be trained
for name entity tagging - MinorThird
- http//minorthird.sourceforge.net/
- Tools for learning to extract entities,
categorization, and some visualization - Stanford Named Entity Recognizer
- http//nlp.stanford.edu/software/CRF-NER.shtml
- CRF-based entity tagger with non-local features
13Alias-I LingPipe ( http//www.alias-i.com/lingpipe
/ )
- Statistical named entity tagger
- Generative statistical model
- Find most likely tags given lexical and
linguistic features - Accuracy at (or near) state of the art on
benchmark tasks - Explicitly targets scalability
- 100K tokens/second runtime on single PC
- Pipelined extraction of entities
- User-defined mentions, pronouns and stop list
- Specified in a dictionary, left-to-right, longest
match - Can be trained/bootstrapped on annotated corpora
14Outline
- Overview of Information Extraction
- Entity tagging
- Relation extraction
- Event extraction
- Scaling up Information Extraction
- Focus on scaling up to large collections (where
data mining and ML techniques shine) - Other dimensions of scalability
15Relation Extraction Examples
- Extract tuples of entities that are related in
predefined way
Disease Outbreaks relation
Relation Extraction
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
From AliBaba
16Relation Extraction Approaches
- Knowledge engineering
- Experts develop rules, patterns
- Can be defined over lexical items ltcompanygt
located in ltlocationgt - Over syntactic structures ((Obj ltcompanygt)
(Verb located) () (Subj ltlocationgt)) - Sophisticated development/debugging environments
- Proteus, GATE
- Machine learning
- Supervised Train system over manually labeled
data - Soderland et al. 1997, Muslea et al. 2000, Riloff
et al. 1996, Roth et al 2005, Cardie et al 2006,
Mooney et al. 2005, - Partially-supervised train system by
bootstrapping from seed examples - Agichtein Gravano 2000, Etzioni et al., 2004,
Yangarber Grishman 2001, - Open (no seeds) Sekine et al. 2006, Cafarella
et al. 2007, Banko et al. 2007 - Hybrid or interactive systems
- Experts interact with machine learning algorithms
(e.g., active learning family) to iteratively
refine/extend rules and patterns - Interactions can involve annotating examples,
modifying rules, or any combination
17Open Information Extraction Banko et al., IJCAI
2007
- Self-Supervised Learner
- All triples in a sample corpus (e1, r, e2) are
considered potential tuples for relation r - Positive examples candidate triplets generated
by a dependency parser - Train classifier on lexical features for positive
and negative examples - Single-Pass Extractor
- Classify all pairs of candidate entities for some
(undetermined) relation - Heuristically generate a relation name from the
words between entities - Redundancy-Based Assessor
- Estimate probability that entities are related
from co-occurrence statistics - Scalability
- Extraction/Indexing
- No tuning or domain knowledge during extraction,
relation inclusion determined at query time - 0.04 CPU seconds pre sentence, 9M web page corpus
in 68 CPU hours - Every document retrieved, processed (parsed,
indexed, classified) in a single pass - Query-time
- Distributed index for tuples by hashing on the
relation name text
18Event Extraction
- Similar to Relation Extraction, but
- Events can be nested
- Significantly more complex (e.g., more slots)
than relations/template elements - Often requires coreference resolution,
disambiguation, deduplication, and inference - Example an integrated disease outbreak event
Hatunnen et al. 2002 -
19Event Extraction Integration Challenges
- Information spans multiple documents
- Missing or incorrect values
- Combining simple tuples into complex events
- No single key to order or cluster likely
duplicates while separating them from similar but
different entities. - Ambiguity distinct physical entities with same
name (e.g., Kennedy) - Duplicate entities, relation tuples extracted
- Large lists with multiple noisy mentions of the
same entity/tuple - Need to depend on fuzzy and expensive string
similarity functions - Cannot afford to compare each mention with every
other. - See Part II of KDD 2006 Tutorial Scalable
Information Extraction and Integration --
scaling up integration http//www.scalability-tut
orial.net/
20Other Information Extraction Tutorials
- See these tutorials for more details
- R. Feldman, Information Extraction Theory and
Practice, ICML 2006http//www.cs.biu.ac.il/feldm
an/icml_tutorial.html -
- W. Cohen, A. McCallum, Information Extraction and
Integration an Overview, KDD 2003
http//www.cs.cmu.edu/wcohen/ie-survey.ppt
21Summary Accuracy of Extraction Tasks
Feldman, ICML 2006 tutorial
- Errors cascade (errors in entity tag cause errors
in relation extraction) - This estimate is optimistic
- Primarily for well-established (tuned) tasks
- Many specialized or novel IE tasks (e.g. bio- and
medical- domains) exhibit lower accuracy - Accuracy for all tasks is significantly lower for
non-English
22Multilingual Information Extraction
- Active research area, beyond the scope of this
talk. Nevertheless, a few (incomplete) pointers
are provided. - Closely tied to machine translation and
cross-language information retrieval efforts. - Language-independent named entity tagging and
related tasks at CoNLL - 2006 multi-lingual dependency parsing
(http//nextens.uvt.nl/conll/) - 2002, 2003 shared tasks language independent
Named Entity Tagging (http//www.cnts.ua.ac.be/con
ll2003/ner/) - Global Autonomous Language Exploitation program
(GALE) - http//www.darpa.mil/ipto/Programs/gale/concept.ht
m - Interlingual Annotation of Multilingual Text
Corpora (IAMTC) - Tools and data for building MT and IE systems for
six languages - http//aitc.aitcnet.org/nsf/iamtc/index.html
- REFLEX project NER for 50 languages
- Exploit for training temporal correlations in
weekly aligned corpora - http//l2r.cs.uiuc.edu/cogcomp/wpt.php?pr_keyREF
LEX
23Outline
- Overview of Information Extraction
- Entity tagging
- Relation extraction
- Event Extraction
- Scaling up Information Extraction
- Focus on scaling up to large collections (where
data mining and ML techniques shine) - Other dimensions of scalability
24Scaling Information Extraction to the Web
- Dimensions of Scalability
- Corpus size
- Applying rules/patterns is expensive
- Need efficient ways to select/filter relevant
documents - Document accessibility
- Deep web documents only accessible via a search
interface - Dynamic sources documents disappear from top
page - Source heterogeneity
- Coding/learning patterns for each source is
expensive - Requires many rules (expensive to apply)
- Domain diversity
- Extracting information for any domain, entities,
relationships - Some recent progress (e.g., see slide 17)
- Not the focus of this talk
25Scaling Up Information Extraction
- Scan-based extraction
- Classification/filtering to avoid processing
documents - Sharing common tags/annotations
- General keyword index-based techniques
- QXtract, KnowItAll
- Specialized indexes
- BE/KnowItNow, Linguists Search Engine
- Parallelization/distributed processing
- IBM WebFountain, UIMA, Googles Map/Reduce
26Efficient Scanning for Information Extraction
Extraction System
Text Database
filtered
- Retrieve docs from database
- 80/20 rule use few simple rules to capture
majority of the instances Pantel et al. 2004 - Train a classifier to discard irrelevant
documents without processing Grishman et al.
2002 - (e.g., the Sports section of NYT is unlikely to
describe disease outbreaks) - Share base annotations (entity tags) for multiple
extraction tasks
27Exploiting Keyword and Phrase Indexes
- Generate queries to retrieve only relevant
documents - Data mining problem!
- Some methods in literature
- Traversing Query Graphs Agichtein et al. 2003
- Iteratively refine queries Agichtein and Gravano
2003 - Iteratively partition document space Etzioni et
al., 2004 - Case studies QXtract, KnowItAll
28Simple Strategy Iterative Set Expansion
Text Database
Extraction System
Query Generation
- Process retrieved documents
- Augment seed tuples with new tuples
- Query database with seed tuples
(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)
- Execution time Retrieved Docs (R P)
Queries Q
Time for retrieving a document
Time for answering a query
Time for processing a document
29Reachability via Querying
Agichtein et al. 2003b
Reachability Graph
Tuples
Documents
t1
t1
d1
ltSARS, Chinagt
t2
t3
d2
t2
ltEbola, Zairegt
t3
d3
t4
t5
ltMalaria, Ethiopiagt
t4
d4
t1 retrieves document d1 that contains t2
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
Upper recall limit determined by the size of
the biggest connected component
30Reachability Graph for DiseaseOutbreaks
DiseaseOutbreaks, New York Times 1995
31Getting Around Reachability Limits
- KnowItAll Etzioni et al., WWW 2004
- Add keywords to partition documents into
retrievable disjoint sets - Submit queries with parts of extracted instances
- QXtract Agichtein and Gravano 2003a
- General queries with many matching documents
- Assumes many documents retrievable per query
32QXtract Agichtein and Gravano 2003
User-Provided Seed Tuples
Seed Sampling
- Get document sample with likely negative and
likely positive examples. - Label sample documents usinginformation
extraction systemas oracle. - Train classifiers to recognizeuseful
documents. - Generate queries from classifiermodel/rules.
Information Extraction
Classifier Training
Query Generation
Queries
33Using Generic Indexes Summary
- Order of magnitude speed up in runtime
- But keyword indexes are approximate (so the
queries are not precise) - Require many documents to retrieve
- Can we do better?
34Index Structures for Information Extraction
- Bindings Engine Cafarella and Etzioni 2005
- Indexing and querying entities K. Chakrabarti
et al. 2006 - IBM Avatar project
- http//www.almaden.ibm.com/cs/projects/avatar/
- Other indexing schemes
- Linguists search engine (P. Resnik)
http//lse.umiacs.umd.edu8080/ - FREE Indexing regular expressions Cho and
Rajagolapan, ICDE 2002 - Indexing and querying linguistic information in
XML Bird et al., 2006
35Bindings Engine (BE) Cafarella and Etzioni 2005
- Variabilized search query language
- Integrates variable/type data with inverted
index, minimizing query seeks - Index ltNounPhrasegt, ltAdj-Termgt terms
- Key idea neighbor index
- At each position in the index, store neighbor
text both lexemes and tags - Query cities such as ltNounPhrasegt
-
-
docs
pos0
pos1
dociddocs-1
posdocs-1
docid0
docid1
19
posns
pos0
pos1
pospos-1
posns
pos0
neighbor0
pos1
neighbor1
pospos-1
12
neighbor1
str1
neighbors
blk_offset
neighbor0
str0
Result in document 19 I love cities such as
Philadelphia.
36Related Approach K. Chakrabarti et al. 2006
- Support relationship keyword queries over
indexed entities - Top-K support for early processing termination
37Workload-Driven Indexing S. Chakrabarti et al.
2006Indexing Thousands of Entity Types
38Workload-Driven Indexing (continued)
39Parallelization/Adaptive Processing
- Parallelize processing
- WebFountain Gruhl et al. 2004
- UIMA architecture
- Map/Reduce
40IBM WebFountain
Gruhl et al. 2004
- Dedicated share-nothing 256-node cluster
- Blackboard annotation architecture
- Data pipelined and streamed past each augmenter
to add annotations - Merge and index annotations
- Index both tokens and annotations
- Between 25K-75K entities per second
41UIMA (IBM Research)
- Unstructured Information Management Architecture
(UIMA) - http//www.research.ibm.com/UIMA/
- Open component software architecture for
development, composition, and deployment of text
processing and analysis components. - Run-time framework allows to plug in components
and applications and run them on different
platforms. Supports distributed processing,
failure recovery, - Scales to millions of documents incorporated
into IBM OmniFind, grid computing-ready - The UIMA SDK (freely available) includes a
run-time framework, APIs, and tools for composing
and deploying UIMA components. - Framework source code also available on
Sourceforge - http//uima-framework.sourceforge.net/
42Map/Reduce (Dean Ghemawat, OSDI 2004)
43Map/Reduce (continued)
- General framework
- Scales to 1000s of machines
- Recently implemented in Nutch and other open
source efforts - Maps nicely to information extraction
- Map phase
- Parse individual documents
- Tag entities
- Propose candidate relation tuples
- Reduce phase
- Merge multiple mentions of same relation tuple
- Resolve co-references, duplicates
44Summary
- Presented a brief overview of information
extraction - Techniques and ideas to scale up information
extraction - Scan-based techniques (limited impact)
- Exploiting general indexes (limited accuracy)
- Building specialized index structures (most
promising) - Scalability is a data mining problem
- Querying graphs ? link discovery
- Workload mining for index optimization
- Automatically optimizing for specific text mining
application - Considerations for building integrated data
mining systems
45References and Supplemental Materials
- References, slides, and additional information
available at - http//www.mathcs.emory.edu/eugene/kdd-webinar/
- Will also post answers to questions
-