Title: Statistical Learning from Relational Data
1Statistical Learning from Relational Data
- Daphne Koller
- Stanford University
- Joint work with many many people
2Relational Data is Everywhere
- The web
- Webpages ( the entities they represent),
hyperlinks - Social networks
- People, institutions, friendship links
- Biological data
- Genes, proteins, interactions, regulation
- Bibliometrics
- Papers, authors, journals, citations
- Corporate databases
- Customers, products, transactions
3Relational Data is Different
- Data instances not independent
- Topics of linked webpages are correlated
- Data instances are not identically distributed
- Heterogeneous instances (papers, authors)
No IID assumption ?
This is a good thing ?
4New Learning Tasks
- Collective classification of related instances
- Labeling an entire website of related webpages
- Relational clustering
- Finding coherent clusters in the genome
- Link prediction classification
- Predicting when two people are likely to be
friends - Pattern detection in network of related objects
- Finding groups (research groups, terrorist groups)
5Probabilistic Models
- Uncertainty model
- space of possible worlds
- probability distribution over this space.
- Worlds often defined via a set of state
variables - medical diagnosis diseases, symptoms, findings,
- each world an assignment of values to variables
- Number of worlds is exponential in of vars
- 2n if we have n binary variables
6Outline
- Relational Bayesian networks
- Relational Markov networks
- Collective Classification
- Relational clustering
with Avi Pfeffer, Nir Friedman, Lise Getoor
7Bayesian Networks
Difficulty
Intelligence
Grade
nodes variables edges direct influence
SAT
Job
Graph structure encodes independence
assumptions Letter conditionally independent
of Intelligence given Grade
8BN semantics
conditional independencies in BN structure
local probability models
full joint distribution over domain
- Compact natural representation
- nodes have ? k parents ?? 2kn vs. 2n params
- parameters natural and easy to elicit
9Reasoning using BNs
Difficulty
Intelligence
Grade
SAT
Letter
Letter
SAT
Full joint distribution specifies answer to any
query P(variable evidence about others)
10Bayesian Networks Problem
- Bayesian nets use propositional representation
- Real world has objects, related to each other
Intelligence
Difficulty
These instances are not independent
A
C
Grade
11Relational Schema
- Specifies types of objects in domain, attributes
of each type of object types of relations
between objects
Classes
Student
Professor
Intelligence
Teaching-Ability
Teach
Take
Attributes
Relations
In
Course
Difficulty
12St. Nordaf University
World ?
Prof. Smith
Prof. Jones
Teaches
Teaches
Grade
In-course
Registered
Satisfac
George
Grade
Registered
Satisfac
In-course
Welcome to
CS101
Grade
Registered
Jane
Satisfac
In-course
13Relational Bayesian Networks
- Universals Probabilistic patterns hold for all
objects in class - Locality Represent direct probabilistic
dependencies - Links define potential interactions
K. Pfeffer Poole Ngo Haddawy
14RBN Semantics
- Ground model
- variables attributes of all objects
- dependencies determined by
- relational links template model
Prof. Smith
Prof. Jones
George
Welcome to
Welcome to
CS101
CS101
Jane
15The Web of Influence
easy / hard
low / high
16Likelihood Function
- Likelihood of a BN with shared parameters
- Joint likelihood is a product of likelihood terms
- One for each attribute X.A and its family
- For each X.A, the likelihood function aggregates
counts from all occurrences x.A in world ?
Friedman, Getoor, K., Pfeffer, 1999
17Likelihood Function Multinomials
Log-likelihood
Sufficient statistics
18RBN Parameter Estimation
- MLE parameters
- Bayesian estimation
- Prior for each attribute X.A
- Posterior uses aggregated sufficient statistics
aggregated sufficient statistics
19Learning RBN Structure
- Define set of legal RBN structures
- Ones with legal class dependency graphs
- Define scoring function ? Bayesian score
- Product of family scores
- One for each X.A
- Uses aggregated sufficient statistics
- Search for high-scoring legal structure
Friedman, Getoor, K., Pfeffer, 1999
20Learning RBN Structure
- All operations done at class level
- Dependency structure parents for X.A
- Acyclicity checked using class dependency graph
- Score computed at class level
- Individual objects only contribute to sufficient
statistics - Can be obtained efficiently using standard DB
queries
21Outline
- Relational Bayesian networks
- Relational Markov networks
- Collective Classification
- Relational clustering
with Avi Pfeffer, Nir Friedman, Lise Getoor
with Ben Taskar, Pieter Abbeel
22Why Undirected Models?
- Symmetric, non-causal interactions
- E.g., web categories of linked pages are
correlated - Cannot introduce direct edges because of cycles
- Patterns involving multiple entities
- E.g., web triangle patterns
- Directed edges not appropriate
- Solution Impose arbitrary direction
- Not clear how to parameterize CPD for variables
involved in multiple interactions - Very difficult within a class-based
parameterization
Taskar, Abbeel, K. 2001
23Markov Networks
- A Markov network is an undirected graph over some
set of variables V - Graph associated with a set of potentials ?i
- Each potential is factor over subset Vi
- Variables in Vi must be a (sub)clique in network
24Markov Networks
James
Mary
Kyle
Noah
Laura
25Relational Markov Networks
- Universals Probabilistic patterns hold for all
groups of objects - Locality Represent local probabilistic
dependencies - Sets of links give us possible interactions
26RMN Semantics
Intelligence
Grade
Geo Study Group
George
Grade
Welcome to
CS101
Intelligence
Grade
Jane
CS Study Group
Grade
Intelligence
Jill
27Outline
- Relational Bayesian Networks
- Relational Markov Networks
- Collective Classification
- Discriminative training
- Web page classification
- Link prediction
- Relational clustering
with Ben Taskar, Carlos Guestrin, Ming Fai
Wong, Pieter Abbeel
28Collective Classification
Probabilistic Relational Model
Training Data
Features ?.x Labels ?.y
Learning
Model Structure
New Data
Conclusions
Inference
Features ?.x
Labels ?.y
Example
- Train on one year of student intelligence, course
difficulty, and grades - Given only grades in following year, predict all
students intelligence
29Learning RMN Parameters
Parameterize potentials as log-linear model
Template potential ?
30Max Likelihood Estimation
We dont care about the joint distribution
P(?.x, ?.y)
Estimation
Classification
maximizew
argmaxy
?.x ?.y
31Web ? KB
Craven et al.
32Web Classification Experiments
- WebKB dataset
- Four CS department websites
- Bag of words on each page
- Links between pages
- Anchor text for links
- Experimental setup
- Trained on three universities
- Tested on fourth
- Repeated for all four combinations
33Standard Classification
Page
Categories faculty course project student other
Professor department extract information computer
science machine learning
34Standard Classification
Page
working with Tom Mitchell
Discriminatively trained naïve Markov Logistic
Regression
test set error
4-fold CV Trained on 3 universities Tested on 4th
35Power of Context
Professor?
Student?
Post-doc?
36Collective Classification
37Collective Classification
Classify all pages collectively, maximizing the
joint label probability
test set error
Taskar, Abbeel, K., 2002
38More Complex Structure
39More Complex Structure
40Collective Classification Results
35.4 error reduction over logistic
test set error
Taskar, Abbeel, K., 2002
Logistic
Links
Section
LinkSection
41Max Conditional Likelihood
We dont care about the conditional
distribution P(?.y ?.x)
Estimation
Classification
maximizew
argmaxy
?.x ?.y
42Max Margin Estimation
What we really want correct class labels
Quadratic program ?
margin
labeling mistakes in y
Exponentially many constraints ?
Taskar, Guestrin, K., 2003 (see also
Collins, 2002 Hoffman 2003)
43Max Margin Markov Networks
- We use structure of Markov network to provide
equivalent formulation of QP - Exponential only in tree width of network
- Complexity max-likelihood classification
- Can solve approximately in networks where induced
width is too large - Analogous to loopy belief propagation
- Can use kernel-based features!
- SVMs meet graphical models
Taskar, Guestrin, K., 2003
44WebKB Revisited
16.1 relative reduction in error relative to
cond. likelihood RMNs
45Predicting Relationships
Tom Mitchell Professor
WebKB Project
Sean Slattery Student
- Even more interesting relationships between
objects
46Predicting Relations
- Introduce exists/type attribute for each
potential link - Learn discriminative model for this attribute
- Collectively predict its value in new world
72.9 error reduction over flat
Page
To-
Page
From-
Category
Category
...
...
Word1
WordN
Word1
WordN
Relation
Exists/ Type
...
LinkWord1
LinkWordN
Taskar, Wong, Abbeel, K., 2003
47Outline
- Relational Bayesian Networks
- Relational Markov Networks
- Collective Classification
- Relational clustering
- Movie data
- Biological data
with Ben Taskar, Eran Segal
with Eran Segal, Nir Friedman, Aviv Regev, Dana
Peer, Haidong Wang, Micha Shapira, David
Botstein
48Relational Clustering
Probabilistic Relational Model
Unlabeled Relational Data
Learning
Clustering of instances
Model Structure
Example
- Given only students grades, cluster similar
students
49Learning w. Missing Data EM
- EM Algorithm applies essentially unchanged
- E-step computes expected sufficient statistics,
aggregated over all objects in class - M-step uses ML (or MAP) parameter estimation
- Key difference
- In general, the hidden variables are not
independent - Computation of expected sufficient statistics
requires inference over entire network
50Learning w. Missing Data EM
Dempster et al. 77
low / high
easy / hard
51Movie Data
Internet Movie Database http//www.imdb.com
52Discovering Hidden Types
Learn model using EM
Type
Type
Type
Taskar, Segal, K., 2001
53Discovering Hidden Types
Taskar, Segal, K., 2001
54Biology 101 Gene Expression
Swi5
DNA
Cells express different subsets of their
genes in different tissues and under different
conditions
55Gene Expression Microarrays
- Measure mRNA level for all genes in one condition
- Hundreds of experiments
- Highly noisy
Expression of gene i in experiment j
Experiments
Induced
Genes
Repressed
56Standard Analysis
- Cluster genes by similarity of expression
profiles - Manually examine clusters to understand whats
common to genes in cluster
57General Approach
- Expression level is a function of gene properties
and experiment properties - Learn model that best explains the data
- Observed properties gene sequence, array
condition, - Hidden properties gene cluster
- Assignment to hidden variables (e.g., module
assignment) - Expression level as function of properties
58Clustering as a PRM
Gene
Experiment
Cluster
ID
Level
Expression
59Modular Regulation
- Learn functional modules
- Clusters of genes that are similarly controlled
- Learn control program for modules
- Expression as function of control genes
60Module Network PRM
Gene
Experiment
Cluster
Controlk
Control2
Control1
Activity level of control gene in experiment
Level
Expression
Segal, Regev, Peer, Koller, Friedman, 2003
61Experimental Results
- Yeast Stress Data (Gasch et al.)
- 2355 genes that showed activity
- 173 experiments (microarrays)
- Diverse environmental stress conditions (e.g.
heat shock) - Learned module network with 50 modules
- Cluster assignments are hidden variables
- Structure of dependency trees unknown
- Learned model using structural EM algorithm
Segal et al., Nature Genetics, 2003
62Biological Evaluation
- Find sets of co-regulated genes (regulatory
module) - Find the regulators of each module
46/50
30/50
Segal et al., Nature Genetics, 2003
63Experimental Results
- Hypothesis Regulator X regulates process Y
- Experiment Knock out X and rerun the experiment
X
Segal et al., Nature Genetics, 2003
64Differentially Expressed Genes
Segal et al., Nature Genetics, 2003
65Biological Experiments Validation
- Were the differentially expressed genes predicted
as targets? - Rank modules by enrichment for diff. expressed
genes
Segal et al., Nature Genetics, 2003
66Biology 102 Pathways
- Pathways are sets of genes that act together to
achieve a common function
67Finding Pathways Attempt I
- Use protein-protein interaction data
68Finding Pathways Attempt I
- Use protein-protein interaction data
69Finding Pathways Attempt I
- Use protein-protein interaction data
- Problems
- Data is very noisy
- Structure is lost
- Large connected component in interaction graph
(3527/3589 genes)
70Finding Pathways Attempt II
- Use expression microarray clusters
Pathway I
- Problems
- Expression is only weak indicator of
interaction - Interacting pathways are not separable
Pathway II
71Finding Pathways Our Approach
- Use both types of data to find pathways
- Find active interactions using gene expression
- Find pathway-related co-expression using
interactions
Pathway I
Pathway III
Pathway II
Pathway IV
Segal, Wang, K., 2003
72Probabilistic Model
Gene
1
Pathway
...
ExpN
Exp1
Interacts
Expression level in N arrays
protein product interaction
Cluster all genes collectively, maximizing the
joint model likelihood
Compatibility potential
Segal, Wang, K., 2003
73Capturing Protein Complexes
- Independent data set of interacting proteins
400
Our method
350
Standard expression clustering
300
- 124 complexes covered at 50 for our method
- 46 complexes covered at 50 for clustering
250
200
Num Complexes
150
100
50
0
0
10
20
30
40
50
60
70
80
90
100
Segal, Wang, K., 2003
Complex Coverage ()
74RNAse Complex Pathway
YHR081W RRP40 RRP42 MTR3 RRP45 RRP4 RRP43 DIS3 TRM
7 SKI6 RRP46 CSL4
- Includes all 10 known pathway genes
- Only 5 genes found by clustering
RRP43
RRP40
TRM7
RRP42
DIS3
RRP45
CSL4
RRP46
YHR081W
SKI6
MTR3
RRP4
Segal, Wang, K., 2003
75Interaction Clustering
- RNAse complex found by interaction clustering as
part of cluster with 138 genes
Segal, Wang, K., 2003
76Truth in Advertising
- Huge graphical models
- 3000-50,000 hidden variables
- Hundreds of thousands of observed nodes
- Very densely connected
- Learning
- Multiple iterations of model updates
- Each requires running inference on the model
- Inference
- Exact inference is intractable
- Use belief propagation
- Single inference iteration 1-6 hours
- Algorithmic ideas key to scaling
77Relational Data A New Challenge
Opportunity
- Data consists of different types of instances
- Instances are related in complex networks
- Instances are not independent
- New tasks for machine learning
- Collective classification
- Relational clustering
- Link prediction
- Group detection
78Thank You!
http//robotics.stanford.edu/koller/