Discovering and Utilizing Structure in Large Unstructured Text Datasets - PowerPoint PPT Presentation

About This Presentation
Title:

Discovering and Utilizing Structure in Large Unstructured Text Datasets

Description:

... years, Microsoft Corporation CEO Bill Gates railed against the economic ... Bill Gates CEO Microsoft. Bill Veghte VP Microsoft. Richard Stallman founder ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 77
Provided by: EugeneAg8
Category:

less

Transcript and Presenter's Notes

Title: Discovering and Utilizing Structure in Large Unstructured Text Datasets


1
Discovering and Utilizing Structure in Large
Unstructured Text Datasets
  • Eugene AgichteinMath and Computer Science
    Department

2
Information Extraction Example
  • Information extraction systems represent text in
    structured form

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
3
How can information extraction help?
Structured Relation
  • allow precise and efficient querying
  • allow returning answers instead of documents
  • support powerful query constructs
  • allow data integration with (structured) RDBMS
  • provide input to data mining statistics
    analysis

4
Goal Detect, Monitor, Predict Outbreaks
Detection, Monitoring, Prediction
Data Integration, Data Mining, Trend Analysis
IESys 4
IESys 3
IESys 2
IESys 1
Historical news, breaking news stories,wire,
alerts,
Current Patient Records Diagnosis, physicians
notes, lab results/analysis,
911 CallsTraffic accidents,
5
Challenges in Information Extraction
  • Portability
  • Reduce effort to tune for new domains and tasks
  • MUC systems experts would take 8-12 weeks to
    tune
  • Scalability, Efficiency, Access
  • Enable information extraction over large
    collections
  • 1 sec / document 5 billion docs 158 CPU years
  • Approach learn from data ( Bootstrapping )
  • Snowball Partially Supervised Information
    Extraction
  • Querying Large Text Databases for Efficient
    Information Extraction

6
Outline
  • Information extraction overview
  • Partially supervised information extraction
  • Adaptivity
  • Confidence estimation
  • Text retrieval for scalable extraction
  • Query-based information extraction
  • Implicit connections/graphs in text databases
  • Current and future work
  • Inferring and analyzing social networks
  • Utility-based extraction tuning
  • Multi-modal information extraction and data
    mining
  • Authority/trust/confidence estimation

7
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
8
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
9
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
12
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




13
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
14
Information Extraction Tasks
  • Extracting entities and relations
  • Entities
  • Named (e.g., Person)
  • Generic (e.g., disease name)
  • Relations
  • Entities related in a predefined way (e.g.,
    Location of a Disease outbreak)
  • Discovered automatically
  • Common information extraction steps
  • Preprocessing sentence chunking, parsing,
    morphological analysis
  • Rules/extraction patterns manual, machine
    learning, and hybrid
  • Applying extraction patterns to extract new
    information
  • Postprocessing and complex extraction not
    covered
  • Co-reference resolution
  • Combining Relations into Events, Rules,

15
Two kinds of IE approaches
  • Knowledge Engineering
  • rule based
  • developed by experienced language engineers
  • make use of human intuition
  • requires only small amount of training data
  • development could be very time consuming
  • some changes may be hard to accommodate
  • Machine Learning
  • use statistics or other machine learning
  • developers do not need LE expertise
  • requires large amounts of annotated training data
  • some changes may require re-annotation of the
    entire training corpus
  • annotators are cheap (but you get what you pay
    for!)

16
Extracting Entities from Text
Classify Pre-segmentedCandidates
Lexicons
Sliding Window
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
member?
Classifier
Classifier
Alabama Alaska Wisconsin Wyoming
which class?
which class?
Try alternatewindow sizes
Context Free Grammars
Finite State Machines
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
V
P
NP
V
NNP
Most likely parse?
Classifier
PP
which class?
VP
NP
VP
BEGIN
END
BEGIN
END
S
and beyond
Any of these models can be used to capture words,
formatting or both.
17
Hidden Markov Models
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
18
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Lawrence Saul spoke this example
sentence.
and a trained HMM
Find the most likely state sequence (Viterbi)
Yesterday Lawrence Saul spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Lawrence Saul
19
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 450k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
20
Relation Extraction
  • Extract structured relations from text

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
21
Relation Extraction
  • Typically require Entity Tagging as preprocessing
  • Knowledge Engineering
  • Rules defined over lexical items
  • ltcompanygt located in ltlocationgt
  • Rules defined over parsed text
  • ((Obj ltcompanygt) (Verb located) () (Subj
    ltlocationgt))
  • Proteus, GATE,
  • Machine Learning-based
  • Learn rules/patterns from examples
  • Dan Roth 2005, Cardie 2006, Mooney 2005,
  • Partially-supervised bootstrap from seed
    examples
  • Agichtein Gravano 2000, Etzioni et al., 2004,
  • Recently, hybrid models Feldman2004, 2006

22
Comparison of Approaches
significanteffort
  • Use language-engineering environments to help
    experts create extraction patterns
  • GATE 2002, Proteus 1998
  • Train system over manually labeled data
  • Soderland et al. 1997, Muslea et al. 2000,
    Riloff et al. 1996
  • Exploit large amounts of unlabeled data
  • DIPRE Brin 1998, Snowball Agichtein Gravano
    2000
  • Etzioni et al. (04) KnowItAll extracting unary
    relations
  • Yangarber et al. (00, 02) Pattern refinement,
    generalized names detection

substantialeffort
minimaleffort
23
The Snowball System Overview
Organization Location Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7


157th Street Manhattan 0.52
15th Party Congress China 0.3
15th Century Europe Dark Ages 0.1
Snowball
... ... ..
24
Snowball Getting User Input
ACM DL 2000
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
  • User input
  • a handful of example instances
  • integrity constraints on the relation e.g.,
    Organization is a key, Age gt 0, etc

25
Snowball Finding Example Occurrences
Can use any full-text search engine
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
Search Engine
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA-based Microsoft Corp The
Armonk-based IBM introduced a new lineChange
of guard at IBM Corporations headquarters near
Armonk, NY ...
26
Snowball Tagging Entities
Named entity taggers can recognize Dates, People,
Locations, Organizations, MITREs Alembic,
IBMs Talent, LingPipe,
Computer servers at Microsoft s headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA -based Microsoft Corp The Armonk
-based IBM introduced a new lineChange of
guard at IBM Corporations headquarters near
Armonk, NY ...
27
Snowball Extraction Patterns
  • General extraction pattern model
  • acceptor0, Entity,
    acceptor1, Entity, acceptor2
  • Acceptor instantiations
  • String Match (accepts string s headquarters
    in)
  • Vector-Space ( vector (-s,0.5), (headquarters,
    0.5), (in, 0.5) )
  • Sequence Classifier (Prob(Tvalid s,
    headquarters, in) )
  • HMMs, Sparse sequences, Conditional Random
    Fields,

28
Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
LOCATION
ORGANIZATION
lt's 0.57gt, ltheadquarters 0.57gt, lt in 0.57gt

29
Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
Create patterns as filtered cluster centroids
3
LOCATION
ORGANIZATION
lt's 0.71gt, ltheadquarters 0.71gt

30
Vector Space Clustering
31
Snowball Extracting New Tuples
Match tagged text fragments against patterns
Google 's new headquarters in
Mountain View are
V
LOCATION
ORGANIZATION
lt's 0.5gt, ltnew 0.5gt ltheadquarters 0.5gt, lt in
0.5gt
ltare 1gt

LOCATION
ltlocated 0.71gt, lt in 0.71gt

P2
Match0.4
ORGANIZATION
ORGANIZATION
lt's 0.71gt, ltheadquarters 0.71gt

P1
Match0.8
LOCATION
LOCATION
lt- 0.71gt, ltbased 0.71gt

ORGANIZATION
P3
Match0
32
Snowball Evaluating Patterns
Automatically estimate pattern confidenceConf(P4
) Positive / Total 2/3 0.66
Current seed tuples
LOCATION
ORGANIZATION
lt , 1gt

P4
Organization Headquarters
IBM Armonk
Intel Santa Clara
Microsoft Redmond
?
IBM, Armonk, reported Positive Intel,
Santa Clara, introduced... Positive Bet on
Microsoft, New York-based analyst Jane Smith
said... Negative
?
x
33
Snowball Evaluating Tuples
Automatically evaluate tuple confidence Conf(T)
A tuple has high confidence if generated by
high-confidence patterns.
Conf(T) 0.83
ORGANIZATION

P4 0.66
LOCATION
lt , 1gt
3COM Santa Clara
0.4
lt- 0.75gt, ltbased 0.75gt

LOCATION
ORGANIZATION
0.8
P3 0.95
34
Snowball Evaluating Tuples
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7


157th Street Manhattan 0.52
15th Party Congress China 0.3
15th Century Europe Dark Ages 0.1

... .... ..

... .... ..
Keep only high-confidence tuples for next
iteration
35
Snowball Evaluating Tuples
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7

Start new iteration with expanded example set
Iterate until no new tuples are extracted
36
Pattern-Tuple Duality
  • A good tuple
  • Extracted by good patterns
  • Tuple weight ? goodness
  • A good pattern
  • Generated by good tuples
  • Extracts good new tuples
  • Pattern weight ? goodness
  • Edge weight
  • Match/Similarity of tuple context to pattern

37
How to Set Node Weights
  • Constraint violation (from before)
  • Conf(P) Log(Pos) Pos/(PosNeg)
  • Conf(T)
  • HITS Hassan et al., EMNLP 2006
  • Conf(P) ?Conf(T)
  • Conf(T) ?Conf(P)
  • URNS Downey et al., IJCAI 2005
  • EM-Spy Agichtein, SDM 2006
  • Unknown tuples Neg
  • Compute Conf(P), Conf(T)
  • Iterate

38
Evaluating Patterns and Tuples Expectation
Maximization
  • EM-Spy Algorithm
  • Hide labels for some seed tuples
  • Iterate EM algorithm to convergence on
    tuple/pattern confidence values
  • Set threshold t such that (t gt 90 of spy
    tuples)
  • Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final
Microsoft Redmond 1 1
IBM Armonk 1 0.8
Intel Santa Clara 1 0.9
AG Edwards St Louis 0 0.9
Air Canada Montreal 0 0.8
7th Level Richardson 0 0.8
3Com Corp Santa Clara 0 0.8
3DO Redwood City 0 0.7
3M Minneapolis 0 0.7
MacWorld San Francisco 0 0.7
0
0
157th Street Manhattan 0 0.52
15th Party Congress China 0 0.3
15th Century Europe Dark Ages 0 0.1
..
39
Adapting Snowball for New Relations
  • Large parameter space
  • Initial seed tuples (randomly chosen, multiple
    runs)
  • Acceptor features words, stems, n-grams,
    phrases, punctuation, POS
  • Feature selection techniques OR, NB, Freq,
    support, combinations
  • Feature weights TFIDF, TF, TFNB, NB
  • Pattern evaluation strategies NN, Constraint
    violation, EM, EM-Spy
  • Automatically estimate parameter values
  • Estimate operating parameters based on
    occurrences of seed tuples
  • Run cross-validation on hold-out sets of seed
    tuples for optimal perf.
  • Seed occurrences that do not have close
    neighbors are discarded

40
Example Task DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
41
Snowball Used in Various Domains
  • News NYT, WSJ, AP DL00, SDM06
  • CompanyHeadquarters, MergersAcquisitions,
    DiseaseOutbreaks
  • Medical literature PDR, Micromedex Thesis
  • AdverseEffects, DrugInteractions,
    RecommendedTreatments
  • Biological literature GeneWays corpus ISMB03
  • Gene and Protein Synonyms

42
Outline
  • Information extraction overview
  • Partially supervised information extraction
  • Adaptivity
  • Confidence estimation
  • Text retrieval for scalable extraction
  • Query-based information extraction
  • Implicit connections/graphs in text databases
  • Current and future work
  • Inferring and analyzing social networks
  • Utility-based extraction tuning
  • Multi-modal information extraction and data
    mining
  • Authority/trust/confidence estimation

43
Extracting A Relation From a Large Text Database
InformationExtraction System
StructuredRelation
  • Brute force approach feed all docs to
    information extraction system
  • Only a tiny fraction of documents are often
    useful
  • Many databases are not crawlable
  • Often a search interface is available, with
    existing keyword index
  • How to identify useful documents?

44
An Abstract View of Text-Centric Tasks
Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006
Output tuples

Text Database
Extraction System
  1. Retrieve documents from database
  1. Process documents
  1. Extract output tuples

Task tuple
Information Extraction Relation Tuple
Database Selection Word (Frequency)
Focused Crawling Web Page about a Topic
45
Executing a Text-Centric Task
Output tuples

Text Database
Extraction System
  1. Retrieve documents from database
  1. Extract output tuples
  1. Process documents
  • Two major execution paradigms
  • Scan-based Retrieve and process documents
    sequentially
  • Index-based Query database (e.g., case
    fatality rate), retrieve and process
    documents in results
  • Similar to relational world

?underlying data distribution dictates what is
best
  • Indexes are only approximate index is on
    keywords, not on tuples of interest
  • Choice of execution plan affects output
    completeness (not only speed)

Unlike the relational world
46
Scan
Output tuples

Extraction System
Text Database
  1. Extract output tuples
  1. Process documents
  1. Retrieve docs from database
  • Scan retrieves and processes documents
    sequentially (until reaching target recall)
  • Execution time Retrieved Docs (R P)

Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
47
Iterative Query Expansion
Output tuples

Text Database
Extraction System
Query Generation
  1. Extract tuplesfrom docs
  1. Process retrieved documents
  1. Augment seed tuples with new tuples
  1. Query database with seed tuples

(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)
  • Execution time Retrieved Docs (R P)
    Queries Q

Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
48
QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Query Generation
Queries
Promising Documents
Information Extraction System
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Mad Cow Disease The U.K. July 1995
Pneumonia The U.S. Feb. 1995
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
49
Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples
  1. Get document sample with likely negative and
    likely positive examples.
  2. Label sample documents using information
    extraction system as oracle.
  3. Train classifiers to recognize useful
    documents.
  4. Generate queries from classifier model/rules.

Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
50
Training Classifiers to Recognize Useful
Documents

D1
disease reported epidemic expected area
virus reported expected infected patients
products made used exported far
past old homerun sponsored event
Document features words

D2
-
D3
-
D4
Ripper
SVM
Okapi (IR)
products
virus 3
infected 2
sponsored -1
disease AND reported gt USEFUL
disease
exported
reported
used
epidemic
far
infected
virus
51
Generating Queries from Classifiers
SVM
Ripper
Okapi (IR)
disease AND reported gt USEFUL
products
virus 3
infected 2
sponsored -1
disease
exported
reported
epidemic
used
infected
far
virus
epidemicvirus
virusinfected
disease AND reported
QCombined
disease and reportedepidemicvirus
52
SIGMOD 2003 Demonstration
53
An Even Simpler Querying Strategy Tuples
Ebola and Zaire
DiseaseName Location Date
Ebola Zaire May 1995
Search Engine
Malaria Ethiopia Jan. 1995
hemorrhagic fever Africa May 1995
InformationExtraction System
  1. Convert given tuples into queries
  2. Retrieve matching documents
  3. Extract new tuples from documents and iterate

54
Comparison of Document Access Methods
QXtract 60 of relation extracted from 10 of
documents of 135,000 newspaper article
database Tuples strategy Recall at most 46
55
Predicting Recall of Tuples Strategy
WebDB 2003
Seed Tuple
Seed Tuple
SUCCESS!
FAILURE ?
Can we predict if Tuples will succeed?
56
Using Querying Graph for Analysis
tuples
Documents
  • We need to compute the
  • Number of documents retrieved after sending Q
    tuples as queries (estimates time)
  • Number of tuples that appear in the retrieved
    documents (estimates recall)
  • To estimate these we need to compute the
  • Degree distribution of the tuples discovered by
    retrieving documents
  • Degree distribution of the documents retrieved by
    the tuples
  • (Not the same as the degree distribution of a
    randomly chosen tuple or document it is easier
    to discover documents and tuples with high
    degrees)

t1
d1
ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
57
Information Reachability Graph
Tuples
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
58
Connected Components
Reachable Tuples, do not retrieve tuples in Core
Tuples that retrieve other tuples and themselves
Tuples that retrieve other tuples but are not
reachable
59
Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)
  • Conjecture
  • Degree distribution in reachability graphs
    follows power-law.
  • Then, reachability graph has at most one giant
    component.
  • Define Reachability as Fraction of tuples in
    largest Core Out

60
NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
61
NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
62
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
63
Estimating Reachability
  • In a power-law random graph G a giant component
    CG emerges if d (the average outdegree) gt 1,
    and
  • Estimate Reachability CG / T
  • Depends only on d (average outdegree)

Chung and Lu, Annals of Combinatorics, 2002
For b lt 3.457
64
Estimating Reachability Algorithm
Tuples
  1. Pick some random tuples
  2. Use tuples to query database
  3. Extract tuples from matching documents to compute
    reachability graph edges
  4. Estimate average outdegree
  5. Estimate reachability using results of Chung and
    Lu, Annals of Combinatorics, 2002

Documents
t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t4
65
Estimating Reachability of NYT
.46
Approximate reachability is estimated after 50
queries. Can be used to predict success (or
failure) of a Tuples querying strategy.
66
Outline
  • Information extraction overview
  • Partially supervised information extraction
  • Adaptivity
  • Confidence estimation
  • Text retrieval for scalable extraction
  • Query-based information extraction
  • Implicit connections/graphs in text databases
  • Current and future work
  • Adaptive information extraction and tuning
  • Authority/trust/confidence estimation
  • Inferring and analyzing social networks
  • Multi-modal information extraction and data
    mining

67
Goal Detect, Monitor, Predict Outbreaks
Detection, Monitoring, Prediction
Data Integration, Data Mining, Trend Analysis
IESys 4
IESys 3
IESys 2
IESys 1
Historical news, breaking news stories,wire,
alerts,
Current Patient Records Diagnosis, physicians
notes, lab results/analysis,
911 CallsTraffic accidents,
68
Adaptive, Utility-Driven Extraction
  • Extract relevant symptoms and modifiers from text
  • Physician notes, patient narrative, call
    transcripts
  • Call transcripts a difficult extraction problem
  • Not grammatical, dialogue, speech?text
    unreliable,
  • Use partially supervised techniques to learn
    extraction patterns
  • One approach
  • Link together (when possible) call transcript and
    patient record (e.g., by time, address, and
    patient name)
  • Correlate patterns in transcript with
    diagnosis/symptoms
  • Fine-grained learning can automatically train
    for each symptom or group of patients, etc.

69
Authority, Trust, Confidence
  • How reliable are signals emitted by information
    extraction?
  • Dimensions of trust/confidence
  • Source reliability diagnosis vs. notes vs. 911
    calls
  • Tuple extraction confidence
  • Source extraction difficulty

70
Source Confidence Estimation
CIKM 2005
  • Task easy when context term distributions
    diverge from background
  • Quantify as relative entropy (Kullback-Liebler
    divergence)
  • After calibration, metric predicts task is easy
    or hard

71
Inferring Social Networks
  • Explicit networks
  • Patient records family, geographical entities in
    structured and unstructured portions
  • Implicit connections
  • Extract events (e.g., went to restaurant X
    yesterday)
  • Extract relationships (e.g., I work in Kroegers
    in Toco Hills

72
Modeling Social Networks for
  • Epidemiology, security,

Email exchange mapped onto cubicle locations.
73
Improve Prediction Accuracy
  • Suppose we managed to
  • Automatically identify people currently sick or
    about to get sick
  • Automatically infer (part of) their social
    network
  • Can we improve prediction for dynamics of an
    outbreak?

74
Multimodal Information Extraction and Data Mining
  • Develop joint models over structured data
  • E.g., lab results and symptoms extracted from
    text
  • One approach mutual reinforcement
  • Co-training train classifier on redundant views
    of data (e.g., structured unstructured)
  • Bootstrap on examples proposed by both views
  • More generally graphical models

75
Summary
  • Information extraction overview
  • Partially supervised information extraction
  • Adaptivity
  • Confidence estimation
  • Text retrieval for scalable extraction
  • Query-based information extraction
  • Implicit connections/graphs in text databases
  • Current and future work
  • Adaptive information extraction and tuning
  • Authority/trust/confidence estimation
  • Inferring and analyzing social networks
  • Multi-modal information extraction and data
    mining

76
Thank You
  • Details papers, other talk slides
  • http//www.mathcs.emory.edu/eugene/
Write a Comment
User Comments (0)
About PowerShow.com