Learning Similarity Measures Based on Random Walks in Graphs

About This Presentation

Title:

Learning Similarity Measures Based on Random Walks in Graphs

Description:

Learning Similarity Measures Based on Random Walks in Graphs William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science – PowerPoint PPT presentation

Number of Views:293

Avg rating:3.0/5.0

Slides: 57

Provided by: William797

Category:

more less

Transcript and Presenter's Notes

Title: Learning Similarity Measures Based on Random Walks in Graphs

1
Learning Similarity Measures Based on Random
Walks in Graphs

William W. Cohen
Machine Learning Department and Language
Technologies Institute
School of Computer Science
Carnegie Mellon University
joint work with
Ni Lao, CMU?Google
Tom Mitchell, CMU
Einat Minkov, Univ Haifa
Amarnag Subramanya, Fernando Pereira, Google

2
Motivation The simple and the complex

In computer science there is a tension between
The elegant, simple and general
The messy, complex and problem-specific
Graphs are
Simple so they are easy to analyze and store
General so
They appear in many contexts
They are often a natural representation of
important aspects of information
Well-understood

3
Motivation The simple and the complex

The real world is complex
learning is a way to incorporate that
complexity in our models without sacrificing
elegance and generality

4
Motivation The simple and the complex

This talk Learning Similarity Measures Based on
Random Walks in Graphs
Many fundamental tasks in computer science map an
input to an output
i.e., the task can be modeled as a relation
between input and output
and further the relation can often be viewed as
a similarity relation the desired outputs are
similar to the input (query)
we want to learn this relationship
even if (especially if) it is complex
even if it is described by a multi-step process
Here one line of work on learning complex
relationships

5
Motivation The simple and the complex

This talk
One line of work on learning complex
relationships
Not covered here
Minkov et al 2006, 2008, 2011 Similar framework
for personalized information management queries
and NLP relationships (e.g., synonyms) using
generative and reranking-based learning
strategies
Backstrom and Leskovec, 2011 Alternative, very
expressive parameterization of learning complex
similarity metrics in graphs with feature vectors
on the edges.

6
Similarity Queries on Graphs

1) Given type t and node x in G, find yT(y)t
and yx.
2) Given type t and node set X, find yT(y)t
and yX.
Nearest-neighbor classification
G contains feature nodes and instance nodes
A link (x,f) means feature f is true for instance
x
x is a query instance, yx means y likely of
same class as x
Information retrieval
G contains word nodes and document nodes
A link (w,d) means word w is in document d
X is a set of keywords, yX means y likely to be
relevant to X
Database retrieval
G encodes a database
?

7
BANKS Browsing and Keyword Search
Aditya et al, VLDB 2002

Database is modeled as a graph
Nodes tuples
Edges references between tuples
edges are directed and indicate foreign key,
inclusion dependencies, ..

paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
8
Query sudarshan, roy Answer subtree
from graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
9
y paper(y) sudarshan
w paper(y) wroy
AND
Query sudarshan, roy Answer subtree from
graph
10
(No Transcript)
11
Similarity Queries on Graphs

1) Given type t and node x in G, find yT(y)t
and yx.
2) Given type t and node set X, find yT(y)t
and yX.
Nearest-neighbor classification
Information retrieval
Database retrieval
Evaluation specific families of tasks for
scientific publications
Citation recommendation for a paper (given
title, year, , of paper p, what papers should
be cited by p?)
Expert-finding (given keywords, genes, suggest
a possible author)
Entity recommendation (given title, author,
year, predict entities mentioned in a paper,
e.g. gene-protein entities) can improve NER
Literature recommendation given researcher and
year, suggest papers to read that year
Evaluation Inference in a DB of
automatically-extracted facts

Core tasks in CS
12
Similarity Queries on Graphs
For each task
query 1, ans 1 query 2, ans 2 .
LEARNER
Sim(s,p) mapping from query ? ans
variant of PPR
may use PPR

Evaluation specific families of tasks for
scientific publications
Citation recommendation for a paper (given
title, year, , of paper p, what papers should
be cited by p?)
Expert-finding (given keywords, genes, suggest
a possible author)
Entity recommendation (given title, author,
year, predict entities mentioned in a paper,
e.g. gene-protein entities)
Literature recommendation given researcher and
year, suggest papers to read that year
Evaluation Inference in a DB of
automatically-extracted facts

13
Outline

Motivation for Learning Similarity in Graphs
A Baseline Similarity Metric
Some Literature-related Tasks
The Path Ranking Algorithm (Learning Method)
Motivation
Details
Results BioLiterature tasks
Results KB Inference tasks

14
Defining Similarity on Graphs PPR/RWR
Personalized PageRank 1999

Given type t and node x, find yT(y)t and yx.
Similarity defined by damped version of
PageRank
Similarity between nodes x and y
Random surfer model from a node z,
with probability a, teleport back to x (reset)
Else pick a y uniformly from y z ? y
repeat from node y ....
Similarity xy Pr( surfer is at y restart is
always to x )
Intuitively, xy is sum of weight of all paths
from x to y, where weight of path decreases
exponentially with length (and fanout)
Can easily extend to a query set Xx1,,xk
Disadvantages ?

15
Some BioLiterature Retrieval Tasks

Data used in this study
Yeast 0.2M nodes, 5.5M links
Fly 0.8M nodes, 3.5M links
E.g. the fly graph

16
Learning Proximity Measures for BioLiterature
Retrieval Tasks

Tasks
Gene recommendation author, year?gene
Reference recommendation words,year?paper
Expert-finding words, genes?author
Literature-recommendation author,
papers read in past
Baseline method
Typed RWR proximity methods
Baseline learning method
parameterize Prob(walk edgeedge labelL) and
tune the parameters for each label L (somehow)

P(Lcite) a
P(bindTo) d
P(express) d
P(NE) c
P(write)b
17
Path-based vs Edge-label based learning

RWR is a very robust and useful similarity metric
Learning one-parameter-per-edge label is very
limited
In many cases, there arent enough parameters to
express a complex relationship

18
Path-based vs Edge-label based learning

Learning one-parameter-per-edge label is limited
because the context in which an edge label
appears is ignored
E.g. (observed from real data task, find papers
to read)
Instead, we will learn path-specific parameters

Path Comments
Don't read about genes Ive already read about
Do read papers from my favorite authors
author read? paper contain?gene-contain-1?
paper
author read? paper write-1?author-write?pa
per

Paths will be interpreted as constrained random
walks that give a similarity-like weight to every
reachable node
Step 0 D0 a Start at author a
Step 1 D1 Uniform over all papers p read by a
Step 2 D2 Author a of papers in D1 weighted by
number of papers in D1 published by a
Step 3 D3 Papers p written by a weighted by
....

19
A Limitation of RWR Learning Methods

Learning one-parameter-per-edge label is limited
because the context in which an edge label
appears is ignored
E.g. (observed from real data task, find papers
to read)
Instead, we will learn path-specific parameters

Path Comments
Don't read about genes Ive already read about
Do read papers from my favorite authors
author read? paper contain?gene-contain-1?
paper
author read? paper write-1?author-write?pa
per
Path Comments
Do read about the genes Im working on
Don't read papers from my own lab
author write? paper contain?gene-contain-1
?paper
author write? paper publish-1?institute-pub
llish?paper
20
Path Constrained Random Walksas Basis of a
Proximity Measure

Our work (Lao Cohen, ECML 2010)
learn a weighted combination of simple path
experts, each of which corresponds to a
particular labeled path through the graph
Citation recommendation--an example
In the TREC-CHEM Prior Art Search Task,
researchers found that it is more effective to
?rst ?nd patents about the topic, then aggregate
their citations
Our proposed model can discover this kind of
retrieval schemes and assign proper weights to
combine them. E.g.

Weighted Paths
21
Definitions

An graph G(T,R,X,E), is
a set of entity types TT and a set of
relations RR
a set of entities (nodes) Xx, where each node
x has a type from T
a set of edges e(x,y), where each edge has a
relation label from R
A path P(R1, ,Rn) is a sequence of relations
Path Constrained Random Walk
Given a query set S of source nodes
Distribution D0 at time 0 is uniform over s in S
Distribution Dt at time tgt0 is formed by
Pick x from Dt-1
Pick y uniformly from all things related to x
by an edge labeled Rt
Notation fP(s,t) Prob(s?t P)

21
22
x AthletePlaysForTeam?y TeamPlaysInLeague?z
23
Path Ranking Algorithm (PRA)
Lao Cohen, ECML 2010

A PRA model scores a source-target node pair by a
linear function of their path features
where P is the set of all relation paths with
length L (with support on data, in some cases
see Lao and Cohen EMNLP 2011)
For a relation R and a set of node pairs (si,
ti), we construct a training dataset D (xi,
yi), where xi is a vector of all the path
features for (si, ti), and yi indicates whether
R(si, ti) is true or not
? is estimated using L1,L2-regularized logistic
regression
Weve gone from a small parameter space to a huge
one

24
Supervised PCRW Retrieval Model

A Retrieval Model ranks target entities by
linearly combining the distributions of different
paths
This mode can be optimized by maximizing the
probability of the observed relevance
Given a set of training data D(q(m), A(m),
y(m)), ye(m)1/0

25
Parameter Estimation (Details)

Given a set of training data
D(q(m), A(m), y(m)) m1M, y(m)(e)1/0
We can define a regularized objective function
Use average log-likelihood as the objective om(?)
P(m) the index set or relevant entities,
N(m) the index set of irrelevant entities (how
to choose them will be discussed later)

25
26
Parameter Estimation (Details)

Selecting the negative entity set Nm
Few positive entities vs. thousands (or millions)
of negative entities?
First sort all the negative entities with an
uniform-weight RWR model
Then take negative entities at the k(k1)/2-th
position, for k1,2,.
The gradient is simple
Use orthant-wise L-BFGS (Andrew Gao, 2007) to
estimate ?
Efficient, Can deal with L1 regularization

27
L2 Regularization

Improves retrieval quality
On the citation recommendation task

28
L1 Regularization

Does not improve retrieval quality

29
L1 Regularization

but can help reduce number of features

30
Another potential regularization approximate RWR
31
Experiment Setup for BioLiterature

Data sources for bio-informatics
PubMed on-line archive of over 18 million
biological abstracts
PubMed Central (PMC) full-text copies of over 1
million of these papers
Saccharomyces Genome Database (SGD) a database
for yeast
Flymine a database for fruit flies
Tasks
Gene recommendation author, year?gene
Venue recommendation genes, title words?journal
Citation recommendation title words,year?paper
Expert-finding title words, genes?author
Data split
2000 training, 2000 tuning, 2000 test
Time variant graph
each edge is tagged with a time stamp (year)
only consider edges that are earlier than the
query, during random walk

31
32
BioLiterature Some Results

Compare the mean average precision (MAP) of PRA
to
RWR model
RWR trained with one-parameter per link

Except these , all improvements are
statistically signi?cant at plt0.05 using paired
t-test
33
Example Path Features and their Weights

A PRAqippop model trained for the citation
recommendation task on the yeast data

1) papers co-cited with on-topic papers
6) approx. standard IR retrieval
7,8) papers cited during the past two years
9) well cited papers
10,11) key early papers about specific genes
12,13) papers published during the past two years
14) old papers
34
Extension 1 Query Independent Paths

PageRank (and other query-independent rankings)
assign an importance score (query independent) to
each web page
later combined with relevance score (query
dependent)
We generalize pagerank to heterogeneous graphs
We include to each query a special entity e0 of
special type T0
T0 is related to all other entity types, and each
type is related to all instances of that type
This defines a set of PageRank-like query
independent relation paths
Compute f(?tP) offline for efficiency
Example

well cited papers
all papers
productive authors
all authors
34
35
Extension 2 Entity-specific rankings

There are entity-specific characteristics which
cannot be captured by a general model
Some items are interesting to the users because
of features not captured in the data
To model this, assume the identity of the entity
matters
Introduce new features f(s?t Ps,t) to account
for jumping from s to t and new features f(?t
P,t)
At each gradient step, add a few new features of
this sort with highest gradient, count on
regularization to avoid overfitting

36
BioLiterature Some Results

Compare the MAP of PRA to
RWR model
query independent paths (qip)
popular entity biases (pop)

Except these , all improvements are
statistically signi?cant at plt0.05 using paired
t-test
37
Example Path Features and their Weights

A PRAqippop model trained for the citation
recommendation task on the yeast data

Random Walk With Reset/Personalized PageRank
What is it?
Similarity Queries
Learning How to Tune Similarity Functions for
An Application/Subdomains
Applications and Results
BioLiterature
Knowledge Base Inference

39
Outline

Motivation for Learning Similarity in Graphs
A Baseline Similarity Metric
Some Literature-related Tasks
The Path Ranking Algorithm (Learning Method)
Motivation
Details
Results BioLiterature tasks
Results KB Inference tasks
Lao, Mitchell, Cohen, EMNLP 2011

40
Large Scale Knowledge-Bases

Large-Scale Collections of Automatically
Extracted Knowledge
KnowItAll (Univ. Washington)
0.5B facts extracted from 0.1B web pages
DBpedia (Univ. Leipzig)
3.5M entities 0.7B facts extracted from wikipedia
YAGO (Max-Planck-Institute)
2M entities 20M facts extracted from Wikipedia
and wordNet
FreeBase
20M entities 0.3B links, integrated from
different data sources and human judgments
NELL (Never-Ending Language Learning, CMU)
0.85M facts extracted from 0.5B webpages

41
Inference in Noisy Knowledge Bases

Challenges
Robustness extracted knowledge is incomplete and
noisy
Scalability the size of knowledge base is large

42
The NELL Case Study

Never-Ending Language Learning a never-ending
learning system that operates 24 hours per day,
for years, to continuously improve its ability to
read (extract structured facts from) the web
(Carlson et al., 2010)
Closed domain, semi-supervised extraction
Combines multiple strategies morphological
patterns, textual context, html patterns,
logical inference
Example beliefs

43
A Link Prediction Task

We consider 48 relations for which NELL database
has more than 100 instances
We create two link prediction tasks for each
relation
AthletePlaysInLeague(HinesWard,?)
AthletePlaysInLeague(?, NFL)
The actual nodes y known to satisfy R(x ?) are
treated as labeled positive examples, and all
other nodes are treated as negative examples

44
Current NELL method (baseline)

FOIL (Quinlan and Cameron-Jones, 1993) is a
learning algorithm similar to decision trees, but
in relational domains
NELL implements two assumptions for efficient
learning
The predicates are functional --e.g. an athlete
plays in at most one league
Only find clauses that correspond to
bounded-length paths of binary relations --
relational pathfinding (Richards Mooney, 1992)

45
Current NELL method (baseline)

FOL not great for handling uncertainty
FOIL can only combine rules with disjunctions,
therefore cannot leverage low accuracy rules
E.g. rules for teamPlaysSports

High accuracy but low recall
46
Experiments - Cross Validation on KB data(for
parameter setting, etc)

RWR Random Walk with Restart (PPR)
Paired t-test give p-values 7x10-3, 9x10-4,
9x10-8, 4x10-4
47
Example Paths
Synonyms of the query team
48
Evaluation by Mechanical Turk

There are many test queries per predicate
All entities of a predicates domain/range, e.g.
WorksFor(person, organization)
On average 7,000 test queries for each functional
predicate, and 13,000 for each non-functional
predicate
Sampled evaluation
We only evaluate the top ranked result for each
query
We sort the queries for each predicate according
to the scores of their top ranked results, and
then evaluate precisions at top 10, 100 and 1000
queries
Each belief is voted by 5 workers
Workers are given assertions like Hines Ward
plays for the team Steelers, as well as Google
search links for each entity

49
Evaluation by Mechanical Turk

On 8 functional predicates where N-FOIL can
successfully learn
PRA is comparable to N-FOIL for p_at_10, but has
significantly better p_at_100
On 8 randomly sampled non-functional (one-many)
predicates
Slightly lower accuracy than functional
predicates

Task Rules N-FOILp_at_10 p_at_100 Paths PRAp_at_10 p_at_100
Functional Predicates 2.1(37) 0.76 0.380 43 0.79 0.668
Non-functional Predicates ---- ---- ---- 92 0.65 0.620
PRA Path Ranking Algorithm
50
Beyond Pure KB Inference

Following Minkov et al, 2008
Learn paths in a graph composed of multiple
dependency treesto find synonyms, etc.

51
Learning Lexico-Syntactic Patterns

Following Minkov et al, 2008
Learn paths in a graph composed of text and
knowledge Lao et al, EMNLP 2011

52
Beyond Pure KB Inference

Following Minkov et al, 2008
Learn paths in a graph composed of text and
knowledge Lao et al, EMNLP 2011

53
Learning Lexico-Syntactic Patterns
54
Learning Lexico-Syntactic Patterns
55
Outline

Motivation for Learning Similarity in Graphs
A Baseline Similarity Metric
Some Literature-related Tasks
The Path Ranking Algorithm (Learning Method)
Motivation
Details
Results BioLiterature tasks
Results KB Inference tasks
Conclusions

56
Summary/Conclusion

Learning is the way to make a clean, elegant
formulation of a task work in the messy,
complicated real world
Learning how to navigate graphs is a significant,
core task that models
Recommendation, expert-finding,
Information retrieval
Inference in KBs
It includes significant, core learning problems
Regularization/search of huge feature space
Discovery long paths, lexicalized paths,
Incorporating knowledge of graph structure
.

57
Looking Forward

PRA learns very restricted inference rules
desiredResult(Query,Result) ?
p1(Query,X1), p2(X1,X2), pk(Xk-1,Result)
Can you generalize from these to a larger set of
inference rules?
Can you generalize from binary to n-ary
relationships?
Can you jointly learn several relationships at
once?
PRA learns to navigate real graphs
What about graphs that are built on-the-fly?
E.g., Graphs that summarize a programs
execution, or a theorem-provers behavior?
Future work?

Thanks to
My co-authors on this work
All of you for being here
NSF grant IIS-0811562
NIH grant R01GM081293
Gifts from Google
CIKM Organizers!

Write a Comment

User Comments (0)

About PowerShow.com

Learning Similarity Measures Based on Random Walks in Graphs - PowerPoint PPT Presentation

Learning Similarity Measures Based on Random Walks in Graphs

Learning Similarity Measures Based on Random Walks in Graphs William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science – PowerPoint PPT presentation