Title: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data
1On Beyond Hypertext Searching in Graphs
Containing Documents, Words, and Data
- William W. Cohen
- Center for Automated Learning and Discovery
- Language Technology Institute
- Center for Bioimage Informatics
- Joint CMU-Pitt Program in Bioinformatics
- Carnegie Mellon University
2On Beyond Hypertext Searching in Graphs
Containing Documents, Words, and Data
- William W. Cohen
- Machine Learning Department
- Language Technology Institute
- Center for Bioimage Informatics
- Joint CMU-Pitt Program in Bioinformatics
- Carnegie Mellon University
3On Beyond Hypertext Searching in Graphs
Containing Documents, Words, and Data
- William W. Cohen
- Carnegie Mellon University
- joint work with
- Einat Minkov (CMU)
- Andrew Ng (Stanford)
4Outline
- Motivation why Im interested in
- structured data that is partly text
- structured data represented as graphs
- measuring similarity of nodes in graphs
- Contributions
- a simple query language for graphs
- experiments on natural types of queries
- techniques for learning to answer queries of a
certain type better
5A Little Knowledge is A Dangerous Thing A.
Pope, 1709
- Three centuries later, weve learned that a lot
of knowledge is also sort of dangerous.... - ... so how do we deal with information overload?
6One approach adding structure to unstructured
information
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
... by recognizing entity names... ... and
relationships between them...
7One approach adding structure to unstructured
information
Carvalho, Cohen SIGIR05 Cohen, Carvalho,
Mitchell EMNLP 04
8One approach adding structure to unstructured
information
Mitchell et al CEAS 2004
9One approach adding structure to unstructured
information
10One approach adding structure to unstructured
information
McCallum et al IJCAI05
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17 Is converting unstructured data to structured
data enough?
18Limitations of structured data
What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?
- Diversity many different types of information
from many different sources, that arise to fill
many different needs. - Uncertainty information from many sources (like
IE programs or the web) need not be correct. - Complexity of interaction formulating
information needs as queries to a DB can be
difficult...especially a heterogeneous DB, with a
complex/changing schema.
How do you discover access the tens or hundreds
of structured databases? How do you understand
combine the hundreds of schemata, with thousands
of fields? How do you relate the thousands or
millions or ... of entity identifiers from the
different databases?
?
How can you include many diverse sources of
information in single database?
19When are two entities the same?When is
referent(oid1)referent(oid2) ?
- Bell Labs
- Bell Telephone Labs
- ATT Bell Labs
- ATT Labs
- ATT LabsResearch
- ATT Labs Research, Shannon Laboratory
- Shannon Labs
- Bell Labs Innovations
- Lucent Technologies/Bell Labs Innovations
1925
History of Innovation From 1925 to today, ATT
has attracted some of the world's greatest
scientists, engineers and developers.
www.research.att.com
Bell Labs Facts Bell Laboratories, the research
and development arm of Lucent Technologies, has
been operating continuously since 1925
bell-labs.com
20Is there a definition of entity identity that
is user- and purpose- independent?
?
Bell Telephone Labs
21When are two entities are the same?
Buddhism rejects the key element in folk
psychology the idea of a self (a unified
personal identity that is continuous through
time) King Milinda and Nagasena (the Buddhist
sage) discuss personal identity Milinda
gradually realizes that "Nagasena" (the word)
does not stand for anything he can point to
not the hairs on Nagasena's head, nor the hairs
of the body, nor the "nails, teeth, skin,
muscles, sinews, bones, marrow, kidneys, ..."
etc Milinda concludes that "Nagasena" doesn't
stand for anything If we can't say what a person
is, then how do we know a person is the same
person through time? There's really no you,
and if there's no you, there are no beliefs or
desires for you to have The folk psychology
picture is profoundly misleading and believing it
will make you miserable. -S. LaFave
22Traditional approach
Linkage
Queries
Uncertainty about what to link must be decided by
the integration system, not the end user
23WHIRL vision
Strongest links those agreeable to most users
Weaker links those agreeable to some users
even weaker links
24WHIRL vision
DB1 DB2 ? DB
Link items as needed by Q
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
25Outline
- Motivation why Im interested in
- structured data that is partly text similarity!
- structured data represented as graphs
- measuring similarity of nodes in graphs
- Contributions
- a simple query language for graphs
- experiments on natural types of queries
- techniques for learning to answer queries of a
certain type better
There are general-purpose, fast, robust
similarity measures for text, which are useful
for data integration....and hence, combining
information from multiple sources.
26Limitations of structured data
What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?
- Diversity many different types of information
from many different sources, that arise to fill
many different needs. - Uncertainty information from many sources (like
IE programs or the web) need not be correct. - Complexity of interaction formulating
information needs as queries to a DB can be
difficult...especially a heterogeneous one.
?
How can you exploit structure without
understanding the structure?
27Schema-free structured search
- DataSpot (DTL)/Mercado Intuifind VLDB 98
- Proximity Search VLDB98
- Information units (linked Web pages) WWW10
- Microsoft DBExplorer, Microsoft English query
- BANKS (Browsing ANd Keyword Search) Chakrabarti
others, VLDB 02, VLDB 05
28BANKS Basic Data Model
- Database is modeled as a graph
- Nodes tuples
- Edges references between tuples
- edges are directed.
- foreign key, inclusion dependencies, ..
User need not know organization of database to
formulate queries.
MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
29BANKS Answer to Query
Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
30BANKS Basic Data Model
- Database is modeled as a graph
- Nodes tuples
- Edges references between tuples
- edges are directed.
- foreign key, inclusion dependencies, ..
31BANKS Basic Data Model
not quite so basic
- Database All information is modeled as a graph
- Nodes tuples or documents or strings or words
- Edges references between tuples nodes
- edges are directed, labeled and weighted
- foreign key, inclusion dependencies, ...
- doc/string D to word contained by D (TFIDF
weighted, perhaps) - word W to doc/string containing W (inverted
index) - string S to strings similar to S
32Outline
- Motivation why Im interested in
- structured data that is partly text similarity!
- structured data represented as graphs all sorts
of information can be poured into this model. - measuring similarity of nodes in graphs
- Contributions
- a simple query language for graphs
- experiments on natural types of queries
- techniques for learning to answer queries of a
certain type better
33Yet another schema-free query language
- Assume data is encoded in a graph with
- a node for each object x
- a type of each object x, T(x)
- an edge for each binary relation rx ? y
- Queries are of this form
- Given type t and node x, find yT(y)t and yx.
- Wed like to construct a general-purpose
similarity function xy for objects in the
graph - Wed also like to learn many such functions for
different specific tasks (like who should attend
a meeting)
Node similarity
34Similarity of Nodes in Graphs
- Given type t and node x, find yT(y)t and yx.
- Similarity defined by damped version of
PageRank - Similarity between nodes x and y
- Random surfer model from a node z,
- with probability a, stop and output z
- pick an edge label r using Pr(r z) ... e.g.
uniform - pick a y uniformly from y z ? y with label r
- repeat from node y ....
- Similarity xy Pr( output y start at x)
- Intuitively, xy is summation of weight of all
paths from x to y, where weight of path decreases
exponentially with length.
35BANKS Basic Data Model
not quite so basic
- Database All information is modeled as a graph
- Nodes tuples or documents or strings or words
- Edges references between tuples nodes
- edges are directed, labeled and weighted
- foreign key, inclusion dependencies, ...
- doc/string D to word contained by D (TFIDF
weighted, perhaps) - word W to doc/string containing W (inverted
index) - string S to strings similar to S
William W. Cohen, CMU
cohen
optionalstrings that are similar in TFIDF/cosine
distance will still be nearby in graph
(connected by many length2 paths)
william
w
cmu
dr
Dr. W. W. Cohen
36Similarity of Nodes in Graphs
- Random surfer on graphs
- natural extension to PageRank
- closely related to Laffertys heat diffusion
kernel - but generalized to directed graphs
- somewhat amenable to learning parameters of the
walk (gradient search, w/ various optimization
metrics) - Toutanova, Manning NG, ICML2004
- Nie et al, WWW2005
- Xi et al, SIGIR 2005
- can be sped up and adapted to longer walks by
sampling approaches to matrix multiplication
(e.g. Lewis E. Cohen, SODA 1998), similar to
particle filtering - our current implementation (GHIRL) Lucene
Sleepycat with extensive use of memory caching
(sampling approaches visit many nodes repeatedly)
37Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
38y paper(y) yroy
w paper(y) wroy
AND
Query sudarshan roy Answer subtree from
graph
39Evaluation on Personal Information Management
Tasks
Minkov et al, SIGIR 2006
Many tasks can be expressed as simple,
non-conjunctive search queries in this framework.
- Such as
- Person Name Disambiguation in Email
- Threading
- Finding email-address aliases given a persons
name - Finding relevant meeting attendees
What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?
novel
eg Diehl, Getoor, Namata, 2006
eg Lewis Knowles 97
novel
Also consider a generalization x ? Vq Vq is a
distribution over nodes x
novel
40Email as a graph
sent_date
date2
sent_to
alias
Email address1
person name1
a_inv
1_day
person name2
Email address2
sf_inv
Sent_from
Sent_to
st_inv
sent_date
file1
Email address3
person name3
file2
date1
sd_Inv
sent_from
Email address4
person name4
in_file
in_subj
sent_to
If_inv
is_inv
Email address5
person name5
term8
term9
term2
term3
term1
term7
term6
term4
term11
term5
term10
41Person Name Disambiguation
file
Person
file
Person Andrew Johns
- Q who is Andy?
- Given a term that is not mentioned as is in
header (otherwise, easy), that is known to be a
personal name - Output ranked person nodes.
file
termandy
Person
This task is complementary to person name
annotation in email (E. Minkov, R. Wang,
W.Cohen, Extracting Personal Names from Emails
Applying Named Entity Recognition to Informal
Text, HLT/EMNLP 2005)
42Corpora and Datasets
a. Corpora
Example nicknames Dave for David, Kai for
Keiko, Jenny for Qing
b. Types of names
43Person Name Disambiguation
- 1. Baseline String matching ( common nicknames)
- Find persons that are similar to the name term
(Jaro) - Successful in many cases
- Not successful for some nicknames
- Can not handle ambiguity (arbitrary)
- 3. Graph walk termfile
- Vq name term file nodes (2 steps)
- The file node is natural available context
- Solves the ambiguity problem!
- But, incorporates additional noise.
- 4. Graph walk termfile, reranked using learning
- Re-rank the output of (3), using
- path-describing features
- source count do the paths originate from a
single or two source nodes - string similarity
- 2. Graph walk term
- Vq name term node (2 steps)
- Models co-occurrences.
- Can not handle ambiguity (dominant)
44Results
45Results
after learning-to-rank
graph walk from name,file
graph walk from name
baseline string match, nicknames
46Results
Enron execs
47Results
48Learning
- There is no single best measure of similarity
- How can you learn how to better rank graph nodes,
for a particular task? - Learning methods for graph walks
- The parameters can be adjusted using gradient
descent methods (Diligenti et-al, IJCAI 2005) - We explored a node re-ranking approach which
can take advantage of a wider range of features
features (and is complementary to parameter
tuning) - Features of candidate answer y describe the set
of paths from query x to y
49Re-ranking overview
- Boosting-based reranking, following (Collins and
Koo, Computational Linguistics, 2005) - A training example includes
- a ranked list of li nodes.
- Each node is represented through m features
- At least one known correct node
- Scoring functionFind w that minimizes
(boosted version)Requires binary features
and has a closed form formula to find best
feature and delta in each iteration.
linear combination of features
original score yx
, where
50Path describing Features
- The set of paths to a target node in step k is
recovered in full.
X1
Edge unigram featureswas edge type l used in
reaching x from Vq.
X2
X3
X4
Edge bigram featureswere edge types l1 and l2
used (in that order) in reaching x from Vq.
X5
K0
K1
K2
Top edge bigram featureswere edge types l1
and l2 used (in that order) in reaching x from
Vq, among the top two highest scoring paths.
- Paths (x3, k2)
- x2 ? x1 ? x3
- x4 ? x1 ? x3
- x2 ? x2 ? x3
- x2 ? x3
51Results
52Threading
- Threading is an interesting problem, because
- There are often irregularities in thread
structural information, thus threads discourse
should be captured using an intelligent approach
(D.E. Lewis and K.A. Knowles, Threading email A
preliminary study, Information Processing and
Management, 1997) - Threading information can improve message
categorization into topical folders (B. Klimt
and Y. Yang, The Enron corpus A new dataset for
email classification research, ECML, 2004) - Adjacent messages in a thread can be assumed to
be most similar to each other in the corpus.
Therefore, threading is related to the general
problem of finding similar messages in a corpus.
The task given a message, retrieve adjacent
messages in the thread
53Some intuition ?
filex
54Some intuition ?
filex
Shared content
55Some intuition ?
filex
Shared content
Social network
56Some intuition ?
filex
Shared content
Social network
Timeline
57Threading experiments
- Baseline TF-IDF SimilarityConsider all the
available information (header body) as text
- Graph walk uniformStart from the file node, 2
steps, uniform edge weights
- Graph walk random
- Start from the file node, 2 steps, random edge
weights (best out of 10)
- Graph walk reranked
- Rerank the output of (3) using the
graph-describing features
58Results
- Highly-ranked edge-bigrams
- sent-from ? sent-to -1
- date-of ? date-of -1
- has-term ? has-term -1
59Finding email-aliases given a name
Given a persons name (term)Retrieve the full
set of relevant email-addresses (email-address)
60Finding Meeting Attendees
- Extended graph contains 2 months of calendar data
61Main Contributions
- Presented an extended similarity measure
incorporating non-textual objects - Finite lazy random walks to perform typed search
- A re-ranking paradigm to improve on graph walk
results - Instantiation of this framework for email
- Defined and evaluated novel tasks for email
62Another Task that Can be Formulated as a Graph
Query GeneId-Ranking
- Given
- a biomedical paper abstract
- Find
- the geneId for every gene mentioned in the
abstract - Method
- from paper x, ranked list of geneId y xy
- Background resources
- a synonym list geneId ? name1, name2, ...
- one or more protein NER systems
- training/evaluation data pairs of (paper,
geneId1, ...., geneIdn)
63Sample abstracts and synonyms
- MGI96273
- Htr1a
- 5-hydroxytryptamine (serotonin) receptor 1A
- 5-HT1A receptor
- MGI104886
- Gpx5
- glutathione peroxidase 5
- Arep
- ...
- 52,000 for mouse, 35,000 for fly
true labels
NER extractor
64Graph for the task....
abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
synonym
MGI95298
MGI46273
geneIds
...
65abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
MGI95298
MGI46273
geneIds
...
noisy training abstracts
filedoc214
filedoc523
filedoc6273
...
66Experiments
- Data Biocreative Task 1B
- mouse 10,000 train abstracts, 250 devtest, using
first 150 for now 50,000 geneIds graph has
525,000 nodes - NER systems
- likelyProtein trained on yapex.train using
off-the-shelf NER systems (Minorthird) - possibleProtein same, but modified (on
yapex.test) to optimize F3, not F1 (rewards
recall over precision)
67Experiments with NER
likely
yapex.test
possible
likely
possible
mouse
dictionary
68Experiments with Graph Search
- Baseline method
- extract entities of type x
- for each string of type x, find best-matching
synonym, and then its geneId - consider only synonyms sharing gt1 token
- Soft/TFIDF distance
- break ties randomly
- rank geneIds by number of times they are reached
- rewards multiple mentions (even via alternate
synonyms) - Evaluation
- average, over 50 test documents, of
- non-interpolated average precision (plausible for
curators) - max F1 over all cutoffs
69Experiments with Graph Search
70Baseline vs Graphwalk
- Baseline includes
- softTFIDF distances from NER entity to gene
synonyms - knowledge that shortcut path doc?entity?synonym?
geneId is important - Graph includes
- IDF effects, correlations, training data, etc
- Proposed graph extension
- add softTFIDF and shortcut edges
- Learning and reranking
- start with local features fi(e) of edges eu?v
- for answer y, compute expectations E( fi(e)
startx,endy) - use expectations as feature values and voted
perceptron (Collins, 2002) as learning-to-rank
method.
71Experiments with Graph Search
72Experiments with Graph Search
73Hot off the presses
- Ongoing work learn NER system from pairs of
(document,geneIdList) - much easier to obtain training data than
documents in which every occurrence of every gene
name is highlighted (usual NER training data) - obtains F1 of 71.4 on mouse data (vs 45.3 by
training on YAPEX data, which is from different
distribution) - Joint work with Richard Wang, Bob Frederking,
Anthony Tomasic
74Experiments with Graph Search
75Summary
- Contributions
- a very simple query language for graphs, based on
a diffusion-kernel (damped PageRank,...)
similarity metric - experiments on natural types of queries
- finding likely meeting attendees
- finding related documents (email threading)
- disambiguating person and gene/protein entity
names - techniques for learning to answer queries
- reranking using expectations of simple, local
features - tune performance to a particular similarity
76Summary
- Some open problems
- scalability efficiency
- K-step walk on node-node graph with fan-out b is
O(KbN) - accurate sampling is O(1min) for 10-steps with
O(106) nodes. - faster, better learning methods
- combine re-ranking with learning parameters of
graph walk - add language modeling, topic modeling
- extend graph to include models as well as data