Surfacing Information in Large Text Collections - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Surfacing Information in Large Text Collections

Description:

Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy ... Often a search interface is available, with existing keyword index ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 33

Provided by: EugeneAg8

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Surfacing Information in Large Text Collections

1
Surfacing Information in Large Text Collections

Eugene Agichtein
Microsoft Research

2
Example Angina treatments
Structured databases (e.g., drug info, WHO drug
adverse effects DB, etc)
Medical reference and literature
Web search results
3
Research Goal

Seamless, intuitive, efficient, and robust access
to knowledge in unstructured sources
Some approaches
Retrieve the relevant documents or passages
Question answering
Construct domain-specific verticals (MedLine)
Extract entities and relationships
Network of relationships Semantic Web

4
Semantic Relationships Buried in Unstructured
Text
RecommendedTreatment
A number of well-designed and -executed
large-scale clinical trials have now shown that
treatment with statins reduces recurrent
myocardial infarction, reduces strokes, and
lessens the need for revascularization or
hospitalization for unstable angina pectoris

Web, newsgroups, web logs
Text databases (PubMed, CiteSeer, etc.)
Newspaper Archives
Corporate mergers, succession, location
Terrorist attacks

5
What Structured Representation Can Do for You
Structured Relation

allow precise and efficient querying
allow returning answers instead of documents
support powerful query constructs
allow data integration with (structured) RDBMS
provide useful content for Semantic Web

6
Challenges in Information Extraction

Portability
Reduce effort to tune for new domains and tasks
MUC systems experts would take 8-12 weeks to
tune
Scalability, Efficiency, Access
Enable information extraction over large
collections
1 sec / document 5 billion docs 158 CPU years
Approach learn from data ( Bootstrapping )
Snowball Partially Supervised Information
Extraction
Querying Large Text Databases for Efficient
Information Extraction

7
The Snowball System Overview
Snowball
... ... ..
8
Snowball Getting User Input
ACM DL 2000

User input
a handful of example instances
integrity constraints on the relation e.g.,
Organization is a key, Age 0, etc

9
Evaluating Patterns and TuplesExpectation
Maximization

EM-Spy Algorithm
Hide labels for some seed tuples
Iterate EM algorithm to convergence on
tuple/pattern confidence values
Set threshold t such that (t 90 of spy
tuples)
Re-initialize Snowball using new seed tuples

..
10
Adapting Snowball for New Relations

Large parameter space
Initial seed tuples (randomly chosen, multiple
runs)
Acceptor features words, stems, n-grams,
phrases, punctuation, POS
Feature selection techniques OR, NB, Freq,
support, combinations
Feature weights TFIDF, TF, TFNB, NB
Pattern evaluation strategies NN, Constraint
violation, EM, EM-Spy
Automatically estimate parameter values
Estimate operating parameters based on
occurrences of seed tuples
Run cross-validation on hold-out sets of seed
tuples for optimal perf.
Seed occurrences that do not have close
neighbors are discarded

11
Example Task 1 DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
12
Example Task 2 Bioinformatics
ISMB 2003
APO-1, also known as DR6MEK4, also called
SEK1

100,000 gene and protein synonyms extracted from
50,000 journal articles
Approximately 40 of confirmed synonyms not
previously listed in curated authoritative
reference (SWISSPROT)

13
Snowball Used in Various Domains

News NYT, WSJ, AP DL00, SDM06
CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks
Medical literature PDRHealth, Micromedex Ph.D.
Thesis
AdverseEffects, DrugInteractions,
RecommendedTreatments
Biological literature GeneWays corpus ISMB03
Gene and Protein Synonyms

14
Limits of Bootstrapping for Extraction
CIKM 2005

Task easy when context term distributions
diverge from background
Quantify as relative entropy (Kullback-Liebler
divergence)
After calibration, metric predicts if
bootstrapping likely to work

15
Extracting All Relation Instances From a Text
Database
InformationExtraction System
StructuredRelation

Brute force approach feed all docs to
information extraction system
Only a tiny fraction of documents are often
useful
Many databases are not crawlable
Often a search interface is available, with
existing keyword index
How to identify useful documents?

16
Accessing Text DBs via Search Engines
InformationExtraction System
Search Engine

Search engines impose limitations
Limit on documents retrieved per query
Support simple keywords and phrases
Ignore stopwords (e.g., a, is)

StructuredRelation
17
Text-Centric Task I Information Extraction

Information extraction applications extract
structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
Information Extraction tutorial yesterday by
AnHai Doan, Raghu Ramakrishnan, Shivakumar
Vaithyanathan
18
Executing a Text-Centric Task
Text Database
Extraction System

Retrieve documents from database

Extract output tokens

Process documents

Two major execution paradigms
Scan-based Retrieve and process documents
sequentially
Index-based Query database (e.g., case
fatality rate), retrieve and process
documents in results

Similar to relational world

?underlying data distribution dictates what is
best

Indexes are only approximate index is on
keywords, not on tokens of interest
Choice of execution plan affects output
completeness (not only speed)

Unlike the relational world
19
QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
Query Generation
Queries
Promising Documents
Information Extraction System
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
20
Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples

Get document sample with likely negative and
likely positive examples.
Label sample documents using information
extraction system as oracle.
Train classifiers to recognize useful
documents.
Generate queries from classifier model/rules.

Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
21
SIGMOD 2003 Demonstration
22
Querying Graph
Tokens
Documents
t1
d1

The querying graph is a bipartite graph,
containing tokens and documents
Each token (transformed to a keyword query)
retrieves documents
Documents contain tokens

d2
t2

t3
d3

t4
d4

t5
d5

23
Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)

Conjecture
Degree distribution in reachability graphs
follows power-law.
Then, reachability graph has at most one giant
component.
Define Reachability as Fraction of tuples in
largest Core Out

24
NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
25
NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
26
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
27
Estimate Cost of Retrieval Methods
SIGMOD 2006

Alternatives
Scan, Filtered Scan, Tuples, QXtract
General cost model for text-centric tasks
Information extraction, summary construction,
etc
Estimate the expected cost of each access method
Parametric model describing all retrieval steps
Extended analysis to arbitrary degree
distributions
Parameters estimates can be piggybacked at
runtime
Cost estimates can be provided to a query
optimizer for nearly optimal execution

28
Optimized Execution of Text-Centric Tasks
Scan
Filtered Scan
Tuples
29
Current Research Agenda

Seamless, intuitive, and robust access to
knowledge in biologicial and medical sources
Some research problems
Robust query processing over unstructured data
Intelligently interpreting user information needs
Text mining for bio- and medical informatics
Model implicit network structures
Entity graphs in Wikipedia
Protein-Protein interaction networks
Semantic maps of MedLine

30
Deriving Actionable Knowledge from Unstructured
(text) Data