On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data

1
On Beyond Hypertext Searching in Graphs
Containing Documents, Words, and Data

William W. Cohen
Center for Automated Learning and Discovery
Language Technology Institute
Center for Bioimage Informatics
Joint CMU-Pitt Program in Bioinformatics
Carnegie Mellon University

2
On Beyond Hypertext Searching in Graphs
Containing Documents, Words, and Data

William W. Cohen
Machine Learning Department
Language Technology Institute
Center for Bioimage Informatics
Joint CMU-Pitt Program in Bioinformatics
Carnegie Mellon University

3
On Beyond Hypertext Searching in Graphs
Containing Documents, Words, and Data

William W. Cohen
Carnegie Mellon University
joint work with
Einat Minkov (CMU)
Andrew Ng (Stanford)

4
Outline

Motivation why Im interested in
structured data that is partly text
structured data represented as graphs
measuring similarity of nodes in graphs
Contributions
a simple query language for graphs
experiments on natural types of queries
techniques for learning to answer queries of a
certain type better

5
A Little Knowledge is A Dangerous Thing A.
Pope, 1709

Three centuries later, weve learned that a lot
of knowledge is also sort of dangerous....
... so how do we deal with information overload?

6
One approach adding structure to unstructured
information
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
... by recognizing entity names... ... and
relationships between them...
7
One approach adding structure to unstructured
information
Carvalho, Cohen SIGIR05 Cohen, Carvalho,
Mitchell EMNLP 04
8
One approach adding structure to unstructured
information
Mitchell et al CEAS 2004
9
One approach adding structure to unstructured
information
10
One approach adding structure to unstructured
information
McCallum et al IJCAI05
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
Is converting unstructured data to structured
data enough?
18
Limitations of structured data
What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?

Diversity many different types of information
from many different sources, that arise to fill
many different needs.
Uncertainty information from many sources (like
IE programs or the web) need not be correct.
Complexity of interaction formulating
information needs as queries to a DB can be
difficult...especially a heterogeneous DB, with a
complex/changing schema.

How do you discover access the tens or hundreds
of structured databases? How do you understand
combine the hundreds of schemata, with thousands
of fields? How do you relate the thousands or
millions or ... of entity identifiers from the
different databases?
?
How can you include many diverse sources of
information in single database?
19
When are two entities the same?When is
referent(oid1)referent(oid2) ?

Bell Labs
Bell Telephone Labs
ATT Bell Labs
ATT Labs
ATT LabsResearch
ATT Labs Research, Shannon Laboratory
Shannon Labs
Bell Labs Innovations
Lucent Technologies/Bell Labs Innovations

1925
History of Innovation From 1925 to today, ATT
has attracted some of the world's greatest
scientists, engineers and developers.
www.research.att.com
Bell Labs Facts Bell Laboratories, the research
and development arm of Lucent Technologies, has
been operating continuously since 1925
bell-labs.com
20
Is there a definition of entity identity that
is user- and purpose- independent?

?
Bell Telephone Labs

21
When are two entities are the same?
Buddhism rejects the key element in folk
psychology the idea of a self (a unified
personal identity that is continuous through
time) King Milinda and Nagasena (the Buddhist
sage) discuss personal identity Milinda
gradually realizes that "Nagasena" (the word)
does not stand for anything he can point to
not the hairs on Nagasena's head, nor the hairs
of the body, nor the "nails, teeth, skin,
muscles, sinews, bones, marrow, kidneys, ..."
etc Milinda concludes that "Nagasena" doesn't
stand for anything If we can't say what a person
is, then how do we know a person is the same
person through time? There's really no you,
and if there's no you, there are no beliefs or
desires for you to have The folk psychology
picture is profoundly misleading and believing it
will make you miserable. -S. LaFave
22
Traditional approach
Linkage
Queries
Uncertainty about what to link must be decided by
the integration system, not the end user
23
WHIRL vision
Strongest links those agreeable to most users
Weaker links those agreeable to some users
even weaker links
24
WHIRL vision
DB1 DB2 ? DB
Link items as needed by Q
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
25
Outline

Motivation why Im interested in
structured data that is partly text similarity!
structured data represented as graphs
measuring similarity of nodes in graphs
Contributions
a simple query language for graphs
experiments on natural types of queries
techniques for learning to answer queries of a
certain type better

There are general-purpose, fast, robust
similarity measures for text, which are useful
for data integration....and hence, combining
information from multiple sources.
26
Limitations of structured data
What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?

Diversity many different types of information
from many different sources, that arise to fill
many different needs.
Uncertainty information from many sources (like
IE programs or the web) need not be correct.
Complexity of interaction formulating
information needs as queries to a DB can be
difficult...especially a heterogeneous one.

?
How can you exploit structure without
understanding the structure?
27
Schema-free structured search

DataSpot (DTL)/Mercado Intuifind VLDB 98
Proximity Search VLDB98
Information units (linked Web pages) WWW10
Microsoft DBExplorer, Microsoft English query
BANKS (Browsing ANd Keyword Search) Chakrabarti
others, VLDB 02, VLDB 05

28
BANKS Basic Data Model

Database is modeled as a graph
Nodes tuples
Edges references between tuples
edges are directed.
foreign key, inclusion dependencies, ..

User need not know organization of database to
formulate queries.
MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
29
BANKS Answer to Query
Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
30
BANKS Basic Data Model

Database is modeled as a graph
Nodes tuples
Edges references between tuples
edges are directed.
foreign key, inclusion dependencies, ..

31
BANKS Basic Data Model
not quite so basic

Database All information is modeled as a graph
Nodes tuples or documents or strings or words
Edges references between tuples nodes
edges are directed, labeled and weighted
foreign key, inclusion dependencies, ...
doc/string D to word contained by D (TFIDF
weighted, perhaps)
word W to doc/string containing W (inverted
index)
string S to strings similar to S

32
Outline

Motivation why Im interested in
structured data that is partly text similarity!
structured data represented as graphs all sorts
of information can be poured into this model.
measuring similarity of nodes in graphs
Contributions
a simple query language for graphs
experiments on natural types of queries
techniques for learning to answer queries of a
certain type better

33
Yet another schema-free query language

Assume data is encoded in a graph with
a node for each object x
a type of each object x, T(x)
an edge for each binary relation rx ? y
Queries are of this form
Given type t and node x, find yT(y)t and yx.
Wed like to construct a general-purpose
similarity function xy for objects in the
graph
Wed also like to learn many such functions for
different specific tasks (like who should attend
a meeting)

Node similarity
34
Similarity of Nodes in Graphs

Given type t and node x, find yT(y)t and yx.
Similarity defined by damped version of
PageRank
Similarity between nodes x and y
Random surfer model from a node z,
with probability a, stop and output z
pick an edge label r using Pr(r z) ... e.g.
uniform
pick a y uniformly from y z ? y with label r
repeat from node y ....
Similarity xy Pr( output y start at x)
Intuitively, xy is summation of weight of all
paths from x to y, where weight of path decreases
exponentially with length.

35
BANKS Basic Data Model
not quite so basic

Database All information is modeled as a graph
Nodes tuples or documents or strings or words
Edges references between tuples nodes
edges are directed, labeled and weighted
foreign key, inclusion dependencies, ...
doc/string D to word contained by D (TFIDF
weighted, perhaps)
word W to doc/string containing W (inverted
index)
string S to strings similar to S

William W. Cohen, CMU
cohen
optionalstrings that are similar in TFIDF/cosine
distance will still be nearby in graph
(connected by many length2 paths)
william
w
cmu
dr
Dr. W. W. Cohen
36
Similarity of Nodes in Graphs

Random surfer on graphs
natural extension to PageRank
closely related to Laffertys heat diffusion
kernel
but generalized to directed graphs
somewhat amenable to learning parameters of the
walk (gradient search, w/ various optimization
metrics)
Toutanova, Manning NG, ICML2004
Nie et al, WWW2005
Xi et al, SIGIR 2005
can be sped up and adapted to longer walks by
sampling approaches to matrix multiplication
(e.g. Lewis E. Cohen, SODA 1998), similar to
particle filtering
our current implementation (GHIRL) Lucene
Sleepycat with extensive use of memory caching
(sampling approaches visit many nodes repeatedly)

37
Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
38
y paper(y) yroy
w paper(y) wroy
AND
Query sudarshan roy Answer subtree from
graph
39
Evaluation on Personal Information Management
Tasks
Minkov et al, SIGIR 2006
Many tasks can be expressed as simple,
non-conjunctive search queries in this framework.

Such as
Person Name Disambiguation in Email
Threading
Finding email-address aliases given a persons
name
Finding relevant meeting attendees

What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?
novel
eg Diehl, Getoor, Namata, 2006
eg Lewis Knowles 97
novel
Also consider a generalization x ? Vq Vq is a
distribution over nodes x
novel
40
Email as a graph
sent_date
date2
sent_to
alias
Email address1
person name1
a_inv
1_day
person name2
Email address2
sf_inv
Sent_from
Sent_to
st_inv
sent_date
file1
Email address3
person name3
file2
date1
sd_Inv
sent_from
Email address4
person name4
in_file
in_subj
sent_to
If_inv
is_inv
Email address5
person name5
term8
term9
term2
term3
term1
term7
term6
term4
term11
term5
term10
41
Person Name Disambiguation
file
Person
file
Person Andrew Johns

Q who is Andy?
Given a term that is not mentioned as is in
header (otherwise, easy), that is known to be a
personal name
Output ranked person nodes.

file
termandy
Person
This task is complementary to person name
annotation in email (E. Minkov, R. Wang,
W.Cohen, Extracting Personal Names from Emails
Applying Named Entity Recognition to Informal
Text, HLT/EMNLP 2005)
42
Corpora and Datasets
a. Corpora
Example nicknames Dave for David, Kai for
Keiko, Jenny for Qing
b. Types of names
43
Person Name Disambiguation

1. Baseline String matching ( common nicknames)
Find persons that are similar to the name term
(Jaro)
Successful in many cases
Not successful for some nicknames
Can not handle ambiguity (arbitrary)

3. Graph walk termfile
Vq name term file nodes (2 steps)
The file node is natural available context
Solves the ambiguity problem!
But, incorporates additional noise.

4. Graph walk termfile, reranked using learning
Re-rank the output of (3), using
path-describing features
source count do the paths originate from a
single or two source nodes
string similarity

2. Graph walk term
Vq name term node (2 steps)
Models co-occurrences.
Can not handle ambiguity (dominant)

44
Results
45
Results
after learning-to-rank
graph walk from name,file
graph walk from name
baseline string match, nicknames
46
Results
Enron execs
47
Results
48
Learning

There is no single best measure of similarity
How can you learn how to better rank graph nodes,
for a particular task?
Learning methods for graph walks
The parameters can be adjusted using gradient
descent methods (Diligenti et-al, IJCAI 2005)
We explored a node re-ranking approach which
can take advantage of a wider range of features
features (and is complementary to parameter
tuning)
Features of candidate answer y describe the set
of paths from query x to y

49
Re-ranking overview

Boosting-based reranking, following (Collins and
Koo, Computational Linguistics, 2005)
A training example includes
a ranked list of li nodes.
Each node is represented through m features
At least one known correct node
Scoring functionFind w that minimizes
(boosted version)Requires binary features
and has a closed form formula to find best
feature and delta in each iteration.

linear combination of features
original score yx
, where
50
Path describing Features

The set of paths to a target node in step k is
recovered in full.

X1
Edge unigram featureswas edge type l used in
reaching x from Vq.
X2
X3
X4
Edge bigram featureswere edge types l1 and l2
used (in that order) in reaching x from Vq.
X5
K0
K1
K2
Top edge bigram featureswere edge types l1
and l2 used (in that order) in reaching x from
Vq, among the top two highest scoring paths.

Paths (x3, k2)
x2 ? x1 ? x3
x4 ? x1 ? x3
x2 ? x2 ? x3
x2 ? x3

51
Results
52
Threading

Threading is an interesting problem, because
There are often irregularities in thread
structural information, thus threads discourse
should be captured using an intelligent approach
(D.E. Lewis and K.A. Knowles, Threading email A
preliminary study, Information Processing and
Management, 1997)
Threading information can improve message
categorization into topical folders (B. Klimt
and Y. Yang, The Enron corpus A new dataset for
email classification research, ECML, 2004)
Adjacent messages in a thread can be assumed to
be most similar to each other in the corpus.
Therefore, threading is related to the general
problem of finding similar messages in a corpus.

The task given a message, retrieve adjacent
messages in the thread
53
Some intuition ?
filex
54
Some intuition ?
filex
Shared content
55
Some intuition ?
filex
Shared content
Social network
56
Some intuition ?
filex
Shared content
Social network
Timeline
57
Threading experiments

Baseline TF-IDF SimilarityConsider all the
available information (header body) as text

Graph walk uniformStart from the file node, 2
steps, uniform edge weights

Graph walk random
Start from the file node, 2 steps, random edge
weights (best out of 10)

Graph walk reranked
Rerank the output of (3) using the
graph-describing features

58
Results

Highly-ranked edge-bigrams
sent-from ? sent-to -1
date-of ? date-of -1
has-term ? has-term -1

59
Finding email-aliases given a name
Given a persons name (term)Retrieve the full
set of relevant email-addresses (email-address)
60
Finding Meeting Attendees

Extended graph contains 2 months of calendar data

61
Main Contributions

Presented an extended similarity measure
incorporating non-textual objects
Finite lazy random walks to perform typed search
A re-ranking paradigm to improve on graph walk
results
Instantiation of this framework for email
Defined and evaluated novel tasks for email

62
Another Task that Can be Formulated as a Graph
Query GeneId-Ranking

Given
a biomedical paper abstract
Find
the geneId for every gene mentioned in the
abstract
Method
from paper x, ranked list of geneId y xy
Background resources
a synonym list geneId ? name1, name2, ...
one or more protein NER systems
training/evaluation data pairs of (paper,
geneId1, ...., geneIdn)

63
Sample abstracts and synonyms

MGI96273
Htr1a
5-hydroxytryptamine (serotonin) receptor 1A
5-HT1A receptor
MGI104886
Gpx5
glutathione peroxidase 5
Arep
...
52,000 for mouse, 35,000 for fly

true labels
NER extractor
64
Graph for the task....
abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
synonym
MGI95298
MGI46273
geneIds
...
65
abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
MGI95298
MGI46273
geneIds
...
noisy training abstracts
filedoc214
filedoc523
filedoc6273
...
66
Experiments

Data Biocreative Task 1B
mouse 10,000 train abstracts, 250 devtest, using
first 150 for now 50,000 geneIds graph has
525,000 nodes
NER systems
likelyProtein trained on yapex.train using
off-the-shelf NER systems (Minorthird)
possibleProtein same, but modified (on
yapex.test) to optimize F3, not F1 (rewards
recall over precision)

67
Experiments with NER
likely
yapex.test
possible
likely
possible
mouse
dictionary
68
Experiments with Graph Search

Baseline method
extract entities of type x
for each string of type x, find best-matching
synonym, and then its geneId
consider only synonyms sharing gt1 token
Soft/TFIDF distance
break ties randomly
rank geneIds by number of times they are reached
rewards multiple mentions (even via alternate
synonyms)
Evaluation
average, over 50 test documents, of
non-interpolated average precision (plausible for
curators)
max F1 over all cutoffs

69
Experiments with Graph Search
70
Baseline vs Graphwalk

Baseline includes
softTFIDF distances from NER entity to gene
synonyms
knowledge that shortcut path doc?entity?synonym?
geneId is important
Graph includes
IDF effects, correlations, training data, etc
Proposed graph extension
add softTFIDF and shortcut edges
Learning and reranking
start with local features fi(e) of edges eu?v
for answer y, compute expectations E( fi(e)
startx,endy)
use expectations as feature values and voted
perceptron (Collins, 2002) as learning-to-rank
method.

71
Experiments with Graph Search
72
Experiments with Graph Search
73
Hot off the presses

Ongoing work learn NER system from pairs of
(document,geneIdList)
much easier to obtain training data than
documents in which every occurrence of every gene
name is highlighted (usual NER training data)
obtains F1 of 71.4 on mouse data (vs 45.3 by
training on YAPEX data, which is from different
distribution)
Joint work with Richard Wang, Bob Frederking,
Anthony Tomasic

74
Experiments with Graph Search
75
Summary

Contributions
a very simple query language for graphs, based on
a diffusion-kernel (damped PageRank,...)
similarity metric
experiments on natural types of queries
finding likely meeting attendees
finding related documents (email threading)
disambiguating person and gene/protein entity
names
techniques for learning to answer queries
reranking using expectations of simple, local
features
tune performance to a particular similarity

76
Summary

Some open problems
scalability efficiency
K-step walk on node-node graph with fan-out b is
O(KbN)
accurate sampling is O(1min) for 10-steps with
O(106) nodes.
faster, better learning methods
combine re-ranking with learning parameters of
graph walk
add language modeling, topic modeling
extend graph to include models as well as data

Write a Comment

User Comments (0)

About PowerShow.com

On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data PowerPoint PPT Presentation