Accessing, Managing, and Mining Unstructured Data - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Accessing, Managing, and Mining Unstructured Data

Description:

20B of machine-readable text (some of it useful) (Mostly) human-generated for ... Email exchange mapped onto cubicle locations. 36. Some Research Directions ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 47
Provided by: EugeneAg8
Category:

less

Transcript and Presenter's Notes

Title: Accessing, Managing, and Mining Unstructured Data


1
Accessing, Managing, and MiningUnstructured Data
  • Eugene Agichtein

2
The Web
  • 20B of machine-readable text (some of it useful)
  • (Mostly) human-generated for human consumption
  • Both artificial and natural phenomenon
  • Still growing?
  • Local and global structure (links)
  • Headaches
  • Dynamic vs. static content
  • People figured out how to make money ?
  • Positives
  • Everything (almost) is on the web
  • People (eventually) can find info
  • People (on average) are not evil

3
Wait, there is more
  • Blogs, wikipedia
  • Hidden web gt 25 million databases
  • Accessible via keyword search interfaces
  • E.g., MedLine, CancerLit, USPTO,
  • 100x more data than surface web
  • (Transcribed) speech from
  • Genetic sequence annotations
  • Biological Medical literature
  • Medical records, reports, alerts, 911 calls

Classified
4
Outline
  • Unstructured data (text, web, ) is
  • Important (really!)
  • Not so unstructured
  • Main tasks/requirements and challenges
  • Example problem query optimization for
    text-centric tasks
  • Fundamental research problems/directions

5
Unstructured data natural language text (for
this talk)
  • Incredibly powerful and flexible means of
    communicating knowledge
  • Papers, news, web pages, lecture notes, patient
    records, shopping lists
  • Local structures syntax
  • English syntax
  • HTML layout
  • Semantics implicit, ambiguous, subjective
  • I saw a man with a chainsaw
  • Need incredibly powerful and flexible decoder

6
Some more structure
  • Explicit link structure
  • Web, Blogs, Wikipedia, citations
  • Implicit link structure
  • Co-occurrence of entities within same
    document/context implies link between entities
  • Occurrence of same entity in multiple documents
    implies link between documents
  • Physical location
  • Page primarily about Atlanta
  • User somewhere around N. Decatur Rd
  • E-mail sender is two floors down
  • More on this later

7
Global Problem Space
  • Crawling (accessing) the data
  • Storing (multiple version of) data
  • Understanding the data ? information
  • Indexing information
  • Integration from multiple sources
  • User-driven information retrieval
  • Exploiting unstructured data in applications
  • System-driven knowledge discovery
  • Building a nuclear/hydro/wind/ power plant

8
To Search or to Crawl? Towards a Query Optimizer
for Text-Centric Tasks, Ipeirotis, Agichtein,
Jain, Gavano, SIGMOD 2006
  • Information extraction applications extract
    structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System (e.g., NYUs
Proteus)
9
An Abstract View of Text-Centric Tasks
Output Tokens

Text Database
Extraction System
  1. Retrieve documents from database
  1. Process documents
  1. Extract output tokens

Task Token
Information Extraction Relation Tuple
Database Selection Word (Frequency)
Focused Crawling Web Page about a Topic
10
Executing a Text-Centric Task
Output Tokens

Text Database
Extraction System
  1. Retrieve documents from database
  1. Extract output tokens
  1. Process documents
  • Two major execution paradigms
  • Scan-based Retrieve and process documents
    sequentially
  • Index-based Query database (e.g., case
    fatality rate), retrieve and process
    documents in results
  • Similar to relational world

?underlying data distribution dictates what is
best
  • Indexes are only approximate index is on
    keywords, not on tokens of interest
  • Choice of execution plan affects output
    completeness (not only speed)

Unlike the relational world
11
Execution Plan Characteristics
Question How do we choose the fastest execution
plan for reaching a target recall ?
Output Tokens

Text Database
Extraction System
  1. Retrieve documents from database
  1. Process documents
  1. Extract output tokens
  • Execution Plans have two main characteristics
  • Execution Time
  • Recall (fraction of tokens retrieved)

What is the fastest plan for discovering 10 of
the disease outbreaks mentioned in The New York
Times archive?
12
Outline
  • Description and analysis of crawl- and
    query-based plans
  • Scan
  • Filtered Scan
  • Iterative Set Expansion
  • Automatic Query Generation
  • Optimization strategy
  • Experimental results and conclusions

Crawl-based
Query-based
(Index-based)
13
Scan
Output Tokens

Extraction System
Text Database
  1. Extract output tokens
  1. Process documents
  1. Retrieve docs from database
  • Scan retrieves and processes documents
    sequentially (until reaching target recall)
  • Execution time Retrieved Docs (R P)

Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
14
Estimating Recall of Scan
ltSARS, Chinagt
  • Modeling Scan for Token t
  • What is the probability of seeing t (with
    frequency g(t)) after retrieving S documents?
  • A sampling without replacement process
  • After retrieving S documents, frequency of token
    t follows hypergeometric distribution
  • Recall for token t is the probability that
    frequency of t in S docs gt 0
  • Probability of seeing token t after retrieving S
    documents
  • g(t) frequency of token t

15
Estimating Recall of Scan
ltSARS, Chinagt
ltEbola, Zairegt
  • Modeling Scan
  • Multiple sampling without replacement
    processes, one for each token
  • Overall recall is average recall across tokens
  • ? We can compute number of documents required to
    reach target recall

Execution time Retrieved Docs (R P)
16
Outline
  • Description and analysis of crawl- and
    query-based plans
  • Scan
  • Filtered Scan
  • Iterative Set Expansion
  • Automatic Query Generation
  • Optimization strategy
  • Experimental results and conclusions

Crawl-based
Query-based
17
Iterative Set Expansion
Output Tokens

Text Database
Extraction System
Query Generation
  1. Extract tokensfrom docs
  1. Process retrieved documents
  1. Augment seed tokens with new tokens
  1. Query database with seed tokens

(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)
  • Execution time Retrieved Docs (R P)
    Queries Q

Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
18
Querying Graph
Tokens
Documents
t1
d1
  • The querying graph is a bipartite graph,
    containing tokens and documents
  • Each token (transformed to a keyword query)
    retrieves documents
  • Documents contain tokens

ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
19
Using Querying Graph for Analysis
Tokens
Documents
  • We need to compute the
  • Number of documents retrieved after sending Q
    tokens as queries (estimates time)
  • Number of tokens that appear in the retrieved
    documents (estimates recall)
  • To estimate these we need to compute the
  • Degree distribution of the tokens discovered by
    retrieving documents
  • Degree distribution of the documents retrieved by
    the tokens
  • (Not the same as the degree distribution of a
    randomly chosen token or document it is easier
    to discover documents and tokens with high
    degrees)

t1
d1
ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
Elegant analysis framework based on generating
functions details in the paper
20
Recall Limit Reachability Graph
Reachability Graph
Tokens
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t5
d5
Upper recall limit determined by the size of
the biggest connected component
21
Automatic Query Generation
  • Iterative Set Expansion has recall limitation due
    to iterative nature of query generation
  • Automatic Query Generation avoids this problem by
    creating queries offline (using machine
    learning), which are designed to return documents
    with tokens

Details in the papers
22
Outline
  • Description and analysis of crawl- and
    query-based plans
  • Optimization strategy
  • Experimental results and conclusions

23
Summary of Cost Analysis
  • Our analysis so far
  • Takes as input a target recall
  • Gives as output the time for each plan to reach
    target recall(time infinity, if plan cannot
    reach target recall)
  • Time and recall depend on task-specific
    properties of database
  • Token degree distribution
  • Document degree distribution
  • Next, we show how to estimate degree
    distributions on-the-fly

24
Estimating Cost Model Parameters
  • Token and document degree distributions belong to
    known distribution families

Task Document Distribution Token Distribution
Information Extraction Power-law Power-law
Content Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
Can characterize distributions with only a few
parameters!
25
Parameter Estimation
  • Naïve solution for parameter estimation
  • Start with separate, parameter-estimation phase
  • Perform random sampling on database
  • Stop when cross-validation indicates high
    confidence
  • We can do better than this!
  • No need for separate sampling phase
  • Sampling is equivalent to executing the task
  • ?Piggyback parameter estimation into execution

26
On-the-fly Parameter Estimation
Correct (but unknown) distribution
  • Pick most promising execution plan for target
    recall assuming default parameter values
  • Start executing task
  • Update parameter estimates during execution
  • Switch plan if updated statistics indicate so
  • Important
  • Only Scan acts as random sampling
  • All other execution plan need parameter
    adjustment (see paper)

27
Outline
  • Description and analysis of crawl- and
    query-based plans
  • Optimization strategy
  • Experimental results and conclusions

28
Correctness of Theoretical Analysis
Task Disease Outbreaks Snowball IE
system 182,531 documents from NYT 16,921 tokens
  • Solid lines Actual time
  • Dotted lines Predicted time with correct
    parameters

29
Experimental Results (Information Extraction)
  • Solid lines Actual time
  • Green line Time with optimizer
  • (results similar in other experiments see
    paper)

30
Conclusions
  • Common execution plans for multiple text-centric
    tasks
  • Analytic models for predicting execution time and
    recall of various crawl- and query-based plans
  • Techniques for on-the-fly parameter estimation
  • Optimization framework picks on-the-fly the
    fastest plan for target recall

31
Global Problem Space
  • Crawling (accessing) the data
  • Understand the data ? information
  • Indexing information
  • Integration from multiple sources
  • User-driven information retrieval
  • Exploiting unstructured data in applications
  • System-driven knowledge discovery

32
Some Research Directions
  • Modeling explicit and Implicit network structures
  • Modeling evolution of explicit structure on web,
    blogspace, wikipedia
  • Modeling implicit link structures in text,
    collections, web
  • Exploiting implicit explicit social networks
    (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical
    Data
  • Automatic sequence annotation ? bioinformatics,
    genetics
  • Actionable knowledge extraction from medical
    articles
  • Robust information extraction, retrieval, and
    query processing
  • Integrating information in structured and
    unstructured sources
  • Robust search/question answering for medical
    applications
  • Confidence estimation for extraction from text
    and other sources
  • Detecting reliable signals from (noisy) text data
    (e.g., medical surveillance)
  • Accuracy (!authority) of online sources
  • Information diffusion/propagation in online
    sources
  • Information propagation on the web
  • In collaborative sources (wikipedia, MedLine)

33
Page Quality In Search of an Unbiased Web
RankingCho, Roy, Adams, SIGMOD 2005
  • popular pages tend to get even more popular,
    while unpopular pages get ignored by an average
    user

34
Sic Transit Gloria Telae Towards an
Understanding of theWebs Decay Bar-Yossef,
Broder, Kumar, Tomkins, WWW 2004
35
Modeling Social Networks for
  • Epidemiology, security,

Email exchange mapped onto cubicle locations.
36
Some Research Directions
  • Modeling explicit and Implicit network structures
  • Modeling evolution of explicit structure on web,
    blogspace, wikipedia
  • Modeling implicit link structures in text,
    collections, web
  • Exploiting implicit explicit social networks
    (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical
    Data
  • Automatic sequence annotation ? bioinformatics,
    genetics
  • Actionable knowledge extraction from medical
    articles
  • Robust information extraction, retrieval, and
    query processing
  • Integrating information in structured and
    unstructured sources
  • Query processing over unstructured text
  • Robust search/question answering for medical
    applications
  • Confidence estimation for extraction from text
    and other sources
  • Detecting reliable signals from (noisy) text data
    (e.g., medical surveillance)
  • Information diffusion/propagation in online
    sources
  • Information propagation on the web
  • In collaborative sources (wikipedia, MedLine)

37
Applying Text Mining for Bioinformatics
ISMB 2003
APO-1, also known as DR6MEK4, also called
SEK1
  • 100,000 gene and protein synonyms extracted from
    50,000 journal articles
  • Approximately 40 of confirmed synonyms not
    previously listed in curated authoritative
    reference (SWISSPROT)

38
Examples of Entity-Relationship Extraction
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
39
Another Example
Z-100 is an arabinomannan extracted from
Mycobacterium tuberculosis that has various
immunomodulatory activities, such as the
induction of interleukin 12, interferon gamma
(IFN-gamma) and beta-chemokines. The effects of
Z-100 on human immunodeficiency virus type 1
(HIV-1) replication in human monocyte-derived
macrophages (MDMs) are investigated in this
paper. In MDMs, Z-100 markedly suppressed the
replication of not only macrophage-tropic
(M-tropic) HIV-1 strain (HIV-1JR-CSF), but also
HIV-1 pseudotypes that possessed amphotropic
Moloney murine leukemia virus or vesicular
stomatitis virus G envelopes. Z-100 was found to
inhibit HIV-1 expression, even when added 24 h
after infection. In addition, it substantially
inhibited the expression of the pNL43lucDeltaenv
vector (in which the env gene is defective and
the nef gene is replaced with the firefly
luciferase gene) when this vector was transfected
directly into MDMs. These findings suggest that
Z-100 inhibits virus replication, mainly at HIV-1
transcription. However, Z-100 also downregulated
expression of the cell surface receptors CD4 and
CCR5 in MDMs, suggesting some inhibitory effect
on HIV-1 entry. Further experiments revealed that
Z-100 induced IFN-beta production in these cells,
resulting in induction of the 16-kDa
CCAAT/enhancer binding protein (C/EBP) beta
transcription factor that represses HIV-1 long
terminal repeat transcription. These effects were
alleviated by SB 203580, a specific inhibitor of
p38 mitogen-activated protein kinases (MAPK),
indicating that the p38 MAPK signalling pathway
was involved in Z-100-induced repression of HIV-1
replication in MDMs. These findings suggest that
Z-100 might be a useful immunomodulator for
control of HIV-1 infection.
40
AliBaba, Ulf Leser, http//wbi.informatik.hu-berli
n.de8080/
Query
Extracted info
PubMed visualized
Links to databases
41
Mining Text and Sequence Data
Agichtein Eskin, PSB 2004
ROC50 scores for each class and method
42
Some Research Directions
  • Modeling explicit and Implicit network structures
  • Modeling evolution of explicit structure on web,
    blogspace, wikipedia
  • Modeling implicit link structures in text,
    collections, web
  • Exploiting implicit explicit social networks
    (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical
    Data
  • Automatic sequence annotation ? bioinformatics,
    genetics
  • Actionable knowledge extraction from medical
    articles
  • Robust information extraction, retrieval, and
    query processing
  • Integrating information in structured and
    unstructured sources
  • Robust search/question answering for medical
    applications
  • Confidence estimation for extraction from text
    and other sources
  • Detecting reliable signals from (noisy) text data
    (e.g., medical surveillance)
  • Accuracy (!authority) of online sources
  • Information diffusion/propagation in online
    sources
  • Information propagation on the web
  • In collaborative sources (wikipedia, MedLine)

43
Structure and evolution of blogspace Kumar,
Novak, Raghavan, Tomkins, CACM 2004, KDD 2006
Fraction of nodes in components of various sizes
within Flickr and Yahoo! 360 timegraph, by week.
44
Connected Components Visualization
Structure of implicit entity-entity networks in
text AgichteinGravano, ICDE 2003
DiseaseOutbreaks, New York Times 1995
45
Some Research Directions
  • Modeling explicit and Implicit network structures
  • Modeling evolution of explicit structure on web,
    blogspace, wikipedia
  • Modeling implicit link structures in text,
    collections, web
  • Exploiting implicit explicit social networks
    (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical
    Data
  • Automatic sequence annotation ? bioinformatics,
    genetics
  • Actionable knowledge extraction from medical
    articles
  • Robust information extraction, retrieval, and
    query processing
  • Integrating information in structured and
    unstructured sources
  • Robust search/question answering for medical
    applications
  • Confidence estimation for extraction from text
    and other sources
  • Detecting reliable signals from (noisy) text data
    (e.g., medical surveillance)
  • Accuracy (!authority) of online sources
  • Information diffusion/propagation in online
    sources
  • Information propagation on the web, news
  • In collaborative sources (wikipedia, MedLine)

46
Thank You
  • Details
  • http//www.mathcs.emory.edu/eugene/
Write a Comment
User Comments (0)
About PowerShow.com