Surfacing Information in Large Text Collections - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Surfacing Information in Large Text Collections

Description:

Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy ... Often a search interface is available, with existing keyword index ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 33
Provided by: EugeneAg8
Category:

less

Transcript and Presenter's Notes

Title: Surfacing Information in Large Text Collections


1
Surfacing Information in Large Text Collections
  • Eugene Agichtein
  • Microsoft Research

2
Example Angina treatments
Structured databases (e.g., drug info, WHO drug
adverse effects DB, etc)
Medical reference and literature
Web search results
3
Research Goal
  • Seamless, intuitive, efficient, and robust access
    to knowledge in unstructured sources
  • Some approaches
  • Retrieve the relevant documents or passages
  • Question answering
  • Construct domain-specific verticals (MedLine)
  • Extract entities and relationships
  • Network of relationships Semantic Web

4
Semantic Relationships Buried in Unstructured
Text
RecommendedTreatment
A number of well-designed and -executed
large-scale clinical trials have now shown that
treatment with statins reduces recurrent
myocardial infarction, reduces strokes, and
lessens the need for revascularization or
hospitalization for unstable angina pectoris
  • Web, newsgroups, web logs
  • Text databases (PubMed, CiteSeer, etc.)
  • Newspaper Archives
  • Corporate mergers, succession, location
  • Terrorist attacks

5
What Structured Representation Can Do for You
Structured Relation
  • allow precise and efficient querying
  • allow returning answers instead of documents
  • support powerful query constructs
  • allow data integration with (structured) RDBMS
  • provide useful content for Semantic Web

6
Challenges in Information Extraction
  • Portability
  • Reduce effort to tune for new domains and tasks
  • MUC systems experts would take 8-12 weeks to
    tune
  • Scalability, Efficiency, Access
  • Enable information extraction over large
    collections
  • 1 sec / document 5 billion docs 158 CPU years
  • Approach learn from data ( Bootstrapping )
  • Snowball Partially Supervised Information
    Extraction
  • Querying Large Text Databases for Efficient
    Information Extraction

7
The Snowball System Overview
Snowball
... ... ..
8
Snowball Getting User Input
ACM DL 2000
  • User input
  • a handful of example instances
  • integrity constraints on the relation e.g.,
    Organization is a key, Age 0, etc

9
Evaluating Patterns and TuplesExpectation
Maximization
  • EM-Spy Algorithm
  • Hide labels for some seed tuples
  • Iterate EM algorithm to convergence on
    tuple/pattern confidence values
  • Set threshold t such that (t 90 of spy
    tuples)
  • Re-initialize Snowball using new seed tuples

..
10
Adapting Snowball for New Relations
  • Large parameter space
  • Initial seed tuples (randomly chosen, multiple
    runs)
  • Acceptor features words, stems, n-grams,
    phrases, punctuation, POS
  • Feature selection techniques OR, NB, Freq,
    support, combinations
  • Feature weights TFIDF, TF, TFNB, NB
  • Pattern evaluation strategies NN, Constraint
    violation, EM, EM-Spy
  • Automatically estimate parameter values
  • Estimate operating parameters based on
    occurrences of seed tuples
  • Run cross-validation on hold-out sets of seed
    tuples for optimal perf.
  • Seed occurrences that do not have close
    neighbors are discarded

11
Example Task 1 DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
12
Example Task 2 Bioinformatics
ISMB 2003
APO-1, also known as DR6MEK4, also called
SEK1
  • 100,000 gene and protein synonyms extracted from
    50,000 journal articles
  • Approximately 40 of confirmed synonyms not
    previously listed in curated authoritative
    reference (SWISSPROT)

13
Snowball Used in Various Domains
  • News NYT, WSJ, AP DL00, SDM06
  • CompanyHeadquarters, MergersAcquisitions,
    DiseaseOutbreaks
  • Medical literature PDRHealth, Micromedex Ph.D.
    Thesis
  • AdverseEffects, DrugInteractions,
    RecommendedTreatments
  • Biological literature GeneWays corpus ISMB03
  • Gene and Protein Synonyms

14
Limits of Bootstrapping for Extraction
CIKM 2005
  • Task easy when context term distributions
    diverge from background
  • Quantify as relative entropy (Kullback-Liebler
    divergence)
  • After calibration, metric predicts if
    bootstrapping likely to work

15
Extracting All Relation Instances From a Text
Database
InformationExtraction System
StructuredRelation
  • Brute force approach feed all docs to
    information extraction system
  • Only a tiny fraction of documents are often
    useful
  • Many databases are not crawlable
  • Often a search interface is available, with
    existing keyword index
  • How to identify useful documents?

16
Accessing Text DBs via Search Engines
InformationExtraction System
Search Engine
  • Search engines impose limitations
  • Limit on documents retrieved per query
  • Support simple keywords and phrases
  • Ignore stopwords (e.g., a, is)

StructuredRelation
17
Text-Centric Task I Information Extraction
  • Information extraction applications extract
    structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
Information Extraction tutorial yesterday by
AnHai Doan, Raghu Ramakrishnan, Shivakumar
Vaithyanathan
18
Executing a Text-Centric Task
Text Database
Extraction System
  • Retrieve documents from database
  • Extract output tokens
  • Process documents
  • Two major execution paradigms
  • Scan-based Retrieve and process documents
    sequentially
  • Index-based Query database (e.g., case
    fatality rate), retrieve and process
    documents in results
  • Similar to relational world

?underlying data distribution dictates what is
best
  • Indexes are only approximate index is on
    keywords, not on tokens of interest
  • Choice of execution plan affects output
    completeness (not only speed)

Unlike the relational world
19
QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
Query Generation
Queries
Promising Documents
Information Extraction System
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
20
Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples
  • Get document sample with likely negative and
    likely positive examples.
  • Label sample documents using information
    extraction system as oracle.
  • Train classifiers to recognize useful
    documents.
  • Generate queries from classifier model/rules.

Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
21
SIGMOD 2003 Demonstration
22
Querying Graph
Tokens
Documents
t1
d1
  • The querying graph is a bipartite graph,
    containing tokens and documents
  • Each token (transformed to a keyword query)
    retrieves documents
  • Documents contain tokens


d2
t2

t3
d3

t4
d4

t5
d5

23
Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)
  • Conjecture
  • Degree distribution in reachability graphs
    follows power-law.
  • Then, reachability graph has at most one giant
    component.
  • Define Reachability as Fraction of tuples in
    largest Core Out

24
NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
25
NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
26
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
27
Estimate Cost of Retrieval Methods
SIGMOD 2006
  • Alternatives
  • Scan, Filtered Scan, Tuples, QXtract
  • General cost model for text-centric tasks
  • Information extraction, summary construction,
    etc
  • Estimate the expected cost of each access method
  • Parametric model describing all retrieval steps
  • Extended analysis to arbitrary degree
    distributions
  • Parameters estimates can be piggybacked at
    runtime
  • Cost estimates can be provided to a query
    optimizer for nearly optimal execution

28
Optimized Execution of Text-Centric Tasks
Scan
Filtered Scan
Tuples
29
Current Research Agenda
  • Seamless, intuitive, and robust access to
    knowledge in biologicial and medical sources
  • Some research problems
  • Robust query processing over unstructured data
  • Intelligently interpreting user information needs
  • Text mining for bio- and medical informatics
  • Model implicit network structures
  • Entity graphs in Wikipedia
  • Protein-Protein interaction networks
  • Semantic maps of MedLine

30
Deriving Actionable Knowledge from Unstructured
(text) Data
  • Extract actionable rules from medical
    text(Medline, patient reports, )
  • Joint project (early stages) with medical school,
    GT
  • Epidemiology surveillance (w/ SPH)
  • Query processing over unstructured data
  • Tune extraction for query workload
  • Index structures to support effective extraction
  • Queries over extracted and native tables

31
Text Mining for Bioinformatics
  • Impossible to keep up with literature,
    experimental notes
  • Automatically update ontologies, indexes
  • Automate tedious work of post-wetlab search
  • Identify (and assign text label) DNA structures

32
Mining Text and Sequence Data
PSB 2004
ROC50 scores for each class and method
Write a Comment
User Comments (0)
About PowerShow.com