Learning Similarity Measures Based on Random Walks in Graphs - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Learning Similarity Measures Based on Random Walks in Graphs

Description:

Learning Similarity Measures Based on Random Walks in Graphs William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science – PowerPoint PPT presentation

Number of Views:289
Avg rating:3.0/5.0
Slides: 57
Provided by: William797
Category:

less

Transcript and Presenter's Notes

Title: Learning Similarity Measures Based on Random Walks in Graphs


1
Learning Similarity Measures Based on Random
Walks in Graphs
  • William W. Cohen
  • Machine Learning Department and Language
    Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • joint work with
  • Ni Lao, CMU?Google
  • Tom Mitchell, CMU
  • Einat Minkov, Univ Haifa
  • Amarnag Subramanya, Fernando Pereira, Google

2
Motivation The simple and the complex
  • In computer science there is a tension between
  • The elegant, simple and general
  • The messy, complex and problem-specific
  • Graphs are
  • Simple so they are easy to analyze and store
  • General so
  • They appear in many contexts
  • They are often a natural representation of
    important aspects of information
  • Well-understood

3
Motivation The simple and the complex
  • The real world is complex
  • learning is a way to incorporate that
    complexity in our models without sacrificing
    elegance and generality

4
Motivation The simple and the complex
  • This talk Learning Similarity Measures Based on
    Random Walks in Graphs
  • Many fundamental tasks in computer science map an
    input to an output
  • i.e., the task can be modeled as a relation
    between input and output
  • and further the relation can often be viewed as
    a similarity relation the desired outputs are
    similar to the input (query)
  • we want to learn this relationship
  • even if (especially if) it is complex
  • even if it is described by a multi-step process
  • Here one line of work on learning complex
    relationships

5
Motivation The simple and the complex
  • This talk
  • One line of work on learning complex
    relationships
  • Not covered here
  • Minkov et al 2006, 2008, 2011 Similar framework
    for personalized information management queries
    and NLP relationships (e.g., synonyms) using
    generative and reranking-based learning
    strategies
  • Backstrom and Leskovec, 2011 Alternative, very
    expressive parameterization of learning complex
    similarity metrics in graphs with feature vectors
    on the edges.

6
Similarity Queries on Graphs
  • 1) Given type t and node x in G, find yT(y)t
    and yx.
  • 2) Given type t and node set X, find yT(y)t
    and yX.
  • Nearest-neighbor classification
  • G contains feature nodes and instance nodes
  • A link (x,f) means feature f is true for instance
    x
  • x is a query instance, yx means y likely of
    same class as x
  • Information retrieval
  • G contains word nodes and document nodes
  • A link (w,d) means word w is in document d
  • X is a set of keywords, yX means y likely to be
    relevant to X
  • Database retrieval
  • G encodes a database
  • ?

7
BANKS Browsing and Keyword Search
Aditya et al, VLDB 2002
  • Database is modeled as a graph
  • Nodes tuples
  • Edges references between tuples
  • edges are directed and indicate foreign key,
    inclusion dependencies, ..

paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
8
Query sudarshan, roy Answer subtree
from graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
9
y paper(y) sudarshan
w paper(y) wroy
AND
Query sudarshan, roy Answer subtree from
graph
10
(No Transcript)
11
Similarity Queries on Graphs
  • 1) Given type t and node x in G, find yT(y)t
    and yx.
  • 2) Given type t and node set X, find yT(y)t
    and yX.
  • Nearest-neighbor classification
  • Information retrieval
  • Database retrieval
  • Evaluation specific families of tasks for
    scientific publications
  • Citation recommendation for a paper (given
    title, year, , of paper p, what papers should
    be cited by p?)
  • Expert-finding (given keywords, genes, suggest
    a possible author)
  • Entity recommendation (given title, author,
    year, predict entities mentioned in a paper,
    e.g. gene-protein entities) can improve NER
  • Literature recommendation given researcher and
    year, suggest papers to read that year
  • Evaluation Inference in a DB of
    automatically-extracted facts

Core tasks in CS
12
Similarity Queries on Graphs
For each task
query 1, ans 1 query 2, ans 2 .
LEARNER
Sim(s,p) mapping from query ? ans
variant of PPR
may use PPR
  • Evaluation specific families of tasks for
    scientific publications
  • Citation recommendation for a paper (given
    title, year, , of paper p, what papers should
    be cited by p?)
  • Expert-finding (given keywords, genes, suggest
    a possible author)
  • Entity recommendation (given title, author,
    year, predict entities mentioned in a paper,
    e.g. gene-protein entities)
  • Literature recommendation given researcher and
    year, suggest papers to read that year
  • Evaluation Inference in a DB of
    automatically-extracted facts

13
Outline
  • Motivation for Learning Similarity in Graphs
  • A Baseline Similarity Metric
  • Some Literature-related Tasks
  • The Path Ranking Algorithm (Learning Method)
  • Motivation
  • Details
  • Results BioLiterature tasks
  • Results KB Inference tasks

14
Defining Similarity on Graphs PPR/RWR
Personalized PageRank 1999
  • Given type t and node x, find yT(y)t and yx.
  • Similarity defined by damped version of
    PageRank
  • Similarity between nodes x and y
  • Random surfer model from a node z,
  • with probability a, teleport back to x (reset)
  • Else pick a y uniformly from y z ? y
  • repeat from node y ....
  • Similarity xy Pr( surfer is at y restart is
    always to x )
  • Intuitively, xy is sum of weight of all paths
    from x to y, where weight of path decreases
    exponentially with length (and fanout)
  • Can easily extend to a query set Xx1,,xk
  • Disadvantages ?

15
Some BioLiterature Retrieval Tasks
  • Data used in this study
  • Yeast 0.2M nodes, 5.5M links
  • Fly 0.8M nodes, 3.5M links
  • E.g. the fly graph

16
Learning Proximity Measures for BioLiterature
Retrieval Tasks
  • Tasks
  • Gene recommendation author, year?gene
  • Reference recommendation words,year?paper
  • Expert-finding words, genes?author
  • Literature-recommendation author,
    papers read in past
  • Baseline method
  • Typed RWR proximity methods
  • Baseline learning method
  • parameterize Prob(walk edgeedge labelL) and
    tune the parameters for each label L (somehow)

P(Lcite) a
P(bindTo) d
P(express) d
P(NE) c
P(write)b
17
Path-based vs Edge-label based learning
  • RWR is a very robust and useful similarity metric
  • Learning one-parameter-per-edge label is very
    limited
  • In many cases, there arent enough parameters to
    express a complex relationship

18
Path-based vs Edge-label based learning
  • Learning one-parameter-per-edge label is limited
    because the context in which an edge label
    appears is ignored
  • E.g. (observed from real data task, find papers
    to read)
  • Instead, we will learn path-specific parameters

Path Comments
Don't read about genes Ive already read about
Do read papers from my favorite authors
author read? paper contain?gene-contain-1?
paper
author read? paper write-1?author-write?pa
per
  • Paths will be interpreted as constrained random
    walks that give a similarity-like weight to every
    reachable node
  • Step 0 D0 a Start at author a
  • Step 1 D1 Uniform over all papers p read by a
  • Step 2 D2 Author a of papers in D1 weighted by
    number of papers in D1 published by a
  • Step 3 D3 Papers p written by a weighted by
    ....

19
A Limitation of RWR Learning Methods
  • Learning one-parameter-per-edge label is limited
    because the context in which an edge label
    appears is ignored
  • E.g. (observed from real data task, find papers
    to read)
  • Instead, we will learn path-specific parameters

Path Comments
Don't read about genes Ive already read about
Do read papers from my favorite authors
author read? paper contain?gene-contain-1?
paper
author read? paper write-1?author-write?pa
per
Path Comments
Do read about the genes Im working on
Don't read papers from my own lab
author write? paper contain?gene-contain-1
?paper
author write? paper publish-1?institute-pub
llish?paper
20
Path Constrained Random Walksas Basis of a
Proximity Measure
  • Our work (Lao Cohen, ECML 2010)
  • learn a weighted combination of simple path
    experts, each of which corresponds to a
    particular labeled path through the graph
  • Citation recommendation--an example
  • In the TREC-CHEM Prior Art Search Task,
    researchers found that it is more effective to
    ?rst ?nd patents about the topic, then aggregate
    their citations
  • Our proposed model can discover this kind of
    retrieval schemes and assign proper weights to
    combine them. E.g.

Weighted Paths
21
Definitions
  • An graph G(T,R,X,E), is
  • a set of entity types TT and a set of
    relations RR
  • a set of entities (nodes) Xx, where each node
    x has a type from T
  • a set of edges e(x,y), where each edge has a
    relation label from R
  • A path P(R1, ,Rn) is a sequence of relations
  • Path Constrained Random Walk
  • Given a query set S of source nodes
  • Distribution D0 at time 0 is uniform over s in S
  • Distribution Dt at time tgt0 is formed by
  • Pick x from Dt-1
  • Pick y uniformly from all things related to x
  • by an edge labeled Rt
  • Notation fP(s,t) Prob(s?t P)

21
22
x AthletePlaysForTeam?y TeamPlaysInLeague?z
23
Path Ranking Algorithm (PRA)
Lao Cohen, ECML 2010
  • A PRA model scores a source-target node pair by a
    linear function of their path features
  • where P is the set of all relation paths with
    length L (with support on data, in some cases
    see Lao and Cohen EMNLP 2011)
  • For a relation R and a set of node pairs (si,
    ti), we construct a training dataset D (xi,
    yi), where xi is a vector of all the path
    features for (si, ti), and yi indicates whether
    R(si, ti) is true or not
  • ? is estimated using L1,L2-regularized logistic
    regression
  • Weve gone from a small parameter space to a huge
    one

24
Supervised PCRW Retrieval Model
  • A Retrieval Model ranks target entities by
    linearly combining the distributions of different
    paths
  • This mode can be optimized by maximizing the
    probability of the observed relevance
  • Given a set of training data D(q(m), A(m),
    y(m)), ye(m)1/0

25
Parameter Estimation (Details)
  • Given a set of training data
  • D(q(m), A(m), y(m)) m1M, y(m)(e)1/0
  • We can define a regularized objective function
  • Use average log-likelihood as the objective om(?)
  • P(m) the index set or relevant entities,
  • N(m) the index set of irrelevant entities (how
    to choose them will be discussed later)

25
26
Parameter Estimation (Details)
  • Selecting the negative entity set Nm
  • Few positive entities vs. thousands (or millions)
    of negative entities?
  • First sort all the negative entities with an
    uniform-weight RWR model
  • Then take negative entities at the k(k1)/2-th
    position, for k1,2,.
  • The gradient is simple
  • Use orthant-wise L-BFGS (Andrew Gao, 2007) to
    estimate ?
  • Efficient, Can deal with L1 regularization

27
L2 Regularization
  • Improves retrieval quality
  • On the citation recommendation task

28
L1 Regularization
  • Does not improve retrieval quality

29
L1 Regularization
  • but can help reduce number of features

30
Another potential regularization approximate RWR
31
Experiment Setup for BioLiterature
  • Data sources for bio-informatics
  • PubMed on-line archive of over 18 million
    biological abstracts
  • PubMed Central (PMC) full-text copies of over 1
    million of these papers
  • Saccharomyces Genome Database (SGD) a database
    for yeast
  • Flymine a database for fruit flies
  • Tasks
  • Gene recommendation author, year?gene
  • Venue recommendation genes, title words?journal
  • Citation recommendation title words,year?paper
  • Expert-finding title words, genes?author
  • Data split
  • 2000 training, 2000 tuning, 2000 test
  • Time variant graph
  • each edge is tagged with a time stamp (year)
  • only consider edges that are earlier than the
    query, during random walk

31
32
BioLiterature Some Results
  • Compare the mean average precision (MAP) of PRA
    to
  • RWR model
  • RWR trained with one-parameter per link

Except these , all improvements are
statistically signi?cant at plt0.05 using paired
t-test
33
Example Path Features and their Weights
  • A PRAqippop model trained for the citation
    recommendation task on the yeast data

1) papers co-cited with on-topic papers
6) approx. standard IR retrieval
7,8) papers cited during the past two years
9) well cited papers
10,11) key early papers about specific genes
12,13) papers published during the past two years
14) old papers
34
Extension 1 Query Independent Paths
  • PageRank (and other query-independent rankings)
  • assign an importance score (query independent) to
    each web page
  • later combined with relevance score (query
    dependent)
  • We generalize pagerank to heterogeneous graphs
  • We include to each query a special entity e0 of
    special type T0
  • T0 is related to all other entity types, and each
    type is related to all instances of that type
  • This defines a set of PageRank-like query
    independent relation paths
  • Compute f(?tP) offline for efficiency
  • Example

well cited papers
all papers
productive authors
all authors
34
35
Extension 2 Entity-specific rankings
  • There are entity-specific characteristics which
    cannot be captured by a general model
  • Some items are interesting to the users because
    of features not captured in the data
  • To model this, assume the identity of the entity
    matters
  • Introduce new features f(s?t Ps,t) to account
    for jumping from s to t and new features f(?t
    P,t)
  • At each gradient step, add a few new features of
    this sort with highest gradient, count on
    regularization to avoid overfitting

36
BioLiterature Some Results
  • Compare the MAP of PRA to
  • RWR model
  • query independent paths (qip)
  • popular entity biases (pop)

Except these , all improvements are
statistically signi?cant at plt0.05 using paired
t-test
37
Example Path Features and their Weights
  • A PRAqippop model trained for the citation
    recommendation task on the yeast data

1) papers co-cited with on-topic papers
6) approx. standard IR retrieval
7,8) papers cited during the past two years
9) well cited papers
10,11) key early papers about specific genes
12,13) papers published during the past two years
14) old papers
38
Outline
  • Random Walk With Reset/Personalized PageRank
  • What is it?
  • Similarity Queries
  • Learning How to Tune Similarity Functions for
    An Application/Subdomains
  • Applications and Results
  • BioLiterature
  • Knowledge Base Inference

39
Outline
  • Motivation for Learning Similarity in Graphs
  • A Baseline Similarity Metric
  • Some Literature-related Tasks
  • The Path Ranking Algorithm (Learning Method)
  • Motivation
  • Details
  • Results BioLiterature tasks
  • Results KB Inference tasks
  • Lao, Mitchell, Cohen, EMNLP 2011

40
Large Scale Knowledge-Bases
  • Large-Scale Collections of Automatically
    Extracted Knowledge
  • KnowItAll (Univ. Washington)
  • 0.5B facts extracted from 0.1B web pages
  • DBpedia (Univ. Leipzig)
  • 3.5M entities 0.7B facts extracted from wikipedia
  • YAGO (Max-Planck-Institute)
  • 2M entities 20M facts extracted from Wikipedia
    and wordNet
  • FreeBase
  • 20M entities 0.3B links, integrated from
    different data sources and human judgments
  • NELL (Never-Ending Language Learning, CMU)
  • 0.85M facts extracted from 0.5B webpages

41
Inference in Noisy Knowledge Bases
  • Challenges
  • Robustness extracted knowledge is incomplete and
    noisy
  • Scalability the size of knowledge base is large

42
The NELL Case Study
  • Never-Ending Language Learning a never-ending
    learning system that operates 24 hours per day,
    for years, to continuously improve its ability to
    read (extract structured facts from) the web
    (Carlson et al., 2010)
  • Closed domain, semi-supervised extraction
  • Combines multiple strategies morphological
    patterns, textual context, html patterns,
    logical inference
  • Example beliefs

43
A Link Prediction Task
  • We consider 48 relations for which NELL database
    has more than 100 instances
  • We create two link prediction tasks for each
    relation
  • AthletePlaysInLeague(HinesWard,?)
  • AthletePlaysInLeague(?, NFL)
  • The actual nodes y known to satisfy R(x ?) are
    treated as labeled positive examples, and all
    other nodes are treated as negative examples

44
Current NELL method (baseline)
  • FOIL (Quinlan and Cameron-Jones, 1993) is a
    learning algorithm similar to decision trees, but
    in relational domains
  • NELL implements two assumptions for efficient
    learning
  • The predicates are functional --e.g. an athlete
    plays in at most one league
  • Only find clauses that correspond to
    bounded-length paths of binary relations --
    relational pathfinding (Richards Mooney, 1992)

45
Current NELL method (baseline)
  • FOL not great for handling uncertainty
  • FOIL can only combine rules with disjunctions,
    therefore cannot leverage low accuracy rules
  • E.g. rules for teamPlaysSports

High accuracy but low recall
46
Experiments - Cross Validation on KB data(for
parameter setting, etc)




RWR Random Walk with Restart (PPR)
Paired t-test give p-values 7x10-3, 9x10-4,
9x10-8, 4x10-4
47
Example Paths
Synonyms of the query team
48
Evaluation by Mechanical Turk
  • There are many test queries per predicate
  • All entities of a predicates domain/range, e.g.
  • WorksFor(person, organization)
  • On average 7,000 test queries for each functional
    predicate, and 13,000 for each non-functional
    predicate
  • Sampled evaluation
  • We only evaluate the top ranked result for each
    query
  • We sort the queries for each predicate according
    to the scores of their top ranked results, and
    then evaluate precisions at top 10, 100 and 1000
    queries
  • Each belief is voted by 5 workers
  • Workers are given assertions like Hines Ward
    plays for the team Steelers, as well as Google
    search links for each entity

49
Evaluation by Mechanical Turk
  • On 8 functional predicates where N-FOIL can
    successfully learn
  • PRA is comparable to N-FOIL for p_at_10, but has
    significantly better p_at_100
  • On 8 randomly sampled non-functional (one-many)
    predicates
  • Slightly lower accuracy than functional
    predicates

Task Rules N-FOILp_at_10 p_at_100 Paths PRAp_at_10 p_at_100
Functional Predicates 2.1(37) 0.76 0.380 43 0.79 0.668
Non-functional Predicates ---- ---- ---- 92 0.65 0.620
PRA Path Ranking Algorithm
50
Beyond Pure KB Inference
  • Following Minkov et al, 2008
  • Learn paths in a graph composed of multiple
    dependency treesto find synonyms, etc.

51
Learning Lexico-Syntactic Patterns
  • Following Minkov et al, 2008
  • Learn paths in a graph composed of text and
    knowledge Lao et al, EMNLP 2011

52
Beyond Pure KB Inference
  • Following Minkov et al, 2008
  • Learn paths in a graph composed of text and
    knowledge Lao et al, EMNLP 2011

53
Learning Lexico-Syntactic Patterns
54
Learning Lexico-Syntactic Patterns
55
Outline
  • Motivation for Learning Similarity in Graphs
  • A Baseline Similarity Metric
  • Some Literature-related Tasks
  • The Path Ranking Algorithm (Learning Method)
  • Motivation
  • Details
  • Results BioLiterature tasks
  • Results KB Inference tasks
  • Conclusions

56
Summary/Conclusion
  • Learning is the way to make a clean, elegant
    formulation of a task work in the messy,
    complicated real world
  • Learning how to navigate graphs is a significant,
    core task that models
  • Recommendation, expert-finding,
  • Information retrieval
  • Inference in KBs
  • It includes significant, core learning problems
  • Regularization/search of huge feature space
  • Discovery long paths, lexicalized paths,
  • Incorporating knowledge of graph structure
  • .

57
Looking Forward
  • PRA learns very restricted inference rules
  • desiredResult(Query,Result) ?
  • p1(Query,X1), p2(X1,X2), pk(Xk-1,Result)
  • Can you generalize from these to a larger set of
    inference rules?
  • Can you generalize from binary to n-ary
    relationships?
  • Can you jointly learn several relationships at
    once?
  • PRA learns to navigate real graphs
  • What about graphs that are built on-the-fly?
  • E.g., Graphs that summarize a programs
    execution, or a theorem-provers behavior?
  • Future work?

58
  • Thanks to
  • My co-authors on this work
  • All of you for being here
  • NSF grant IIS-0811562
  • NIH grant R01GM081293
  • Gifts from Google
  • CIKM Organizers!

58
Write a Comment
User Comments (0)
About PowerShow.com