Title: Probabilistic Models of Relational Data
1Probabilistic Models of Relational Data
- Daphne Koller
- Stanford University
- Joint work with
Ben Taskar
Lise Getoor
Eran Segal
Pieter Abbeel
Ming-Fai Wong
Avi Pfeffer
Nir Friedman
2Why Relational?
- The real world is composed of objects that have
properties and are related to each other - Natural language is all about objects and how
they relate to each other - George got an A in Geography 101
3Attribute-Based Worlds
Smart students get As in easy classes
- World assignment of values to attributes /
truth values to propositional symbols
4Object-Relational Worlds
?x,y(Smart(x) Easy(y) Take(x,y)
? Grade(A,x,y))
- World relational interpretation
- Objects in the domain
- Properties of these objects
- Relations (links) between objects
5Why Probabilities?
- All universals are false
- Smart students get As in easy classes
- True universals are rarely useful
- Smart students get either A, B, C, D, or F
(almost)
The actual science of logic is conversant at
present only with things either certain,
impossible, or entirely doubtful
Therefore the true logic for this world is the
calculus of probabilities
James Clerk Maxwell
6Probable Worlds
- Probabilistic semantics
- A set of possible worlds
- Each world associated with a probability
hard smart A
hard weak A
easy smart A
easy weak A
hard smart B
hard weak B
easy smart B
easy weak B
hard smart C
hard weak C
easy smart C
easy weak C
7Representation Design Axes
n-gram models HMMs Prob. CFGs
Bayesian nets Markov nets
First-order logic Relational databases
Propositional logic CSPs
Automata Grammars
Attributes
Objects
Sequences
World state
8Outline
- Bayesian Networks
- Representation Semantics
- Reasoning
- Probabilistic Relational Models
- Collective Classification
- Undirected discriminative models
- Collective Classification Revisited
- PRMs for NLP
9Bayesian Networks
Difficulty
Intelligence
Grade
nodes variables edges direct influence
SAT
Letter
Graph structure encodes independence
assumptions Letter conditionally independent
of Intelligence given Grade
10BN semantics
conditional independencies in BN structure
local probability models
full joint distribution over domain
- Compact natural representation
- nodes have ? k parents ?? 2kn vs. 2n params
- parameters natural and easy to elicit
11Reasoning using BNs
Probability theory is nothing but common sense
reduced to calculation. Pierre Simon Laplace
Difficulty
Intelligence
Grade
SAT
Letter
Letter
SAT
Full joint distribution specifies answer to any
query P(variable evidence about others)
12BN Inference
- BN Inference is NP-hard
- Structure can use graph structure
- Graph separation ? conditional independence
- Do separate inference in parts
- Results combined over interface.
- Complexity exponential in largest separator
- Structured BNs allow effective inference
- Exact inference in dense BNs is intractable
13Approximate BN Inference
- Belief propagation is an iterative message
passing algorithm for approximate inference in
BNs - Each iteration (until convergence)
- Nodes pass beliefs as messages to neighboring
nodes - Cons
- Limited theoretical guarantees
- Might not converge
- Pros
- Linear time per iteration
- Works very well in practice, even for dense
networks
14Outline
- Bayesian Networks
- Probabilistic Relational Models
- Language Semantics
- Web of Influence
- Collective Classification
- Undirected discriminative models
- Collective Classification Revisited
- PRMs for NLP
15Bayesian Networks Problem
- Bayesian nets use propositional representation
- Real world has objects, related to each other
Intelligence
Difficulty
These instances are not independent
A
C
Grade
16Probabilistic Relational Models
- Combine advantages of relational logic BNs
- Natural domain modeling objects, properties,
relations - Generalization over a variety of situations
- Compact, natural probability models
- Integrate uncertainty with relational model
- Properties of domain entities can depend on
properties of related entities - Uncertainty over relational structure of domain
17St. Nordaf University
Prof. Smith
Prof. Jones
Teaches
Teaches
Grade
In-course
Registered
Satisfac
George
Grade
Registered
Satisfac
In-course
Grade
Registered
Jane
Satisfac
In-course
18Relational Schema
- Specifies types of objects in domain, attributes
of each type of object types of relations
between objects
Classes
Student
Professor
Intelligence
Teaching-Ability
Teach
Take
Attributes
Relations
In
Course
Difficulty
19Probabilistic Relational Models
- Universals Probabilistic patterns hold for all
objects in class - Locality Represent direct probabilistic
dependencies - Links define potential interactions
K. Pfeffer Poole Ngo Haddawy
20PRM Semantics
- Instantiated PRM ?BN
- variables attributes of all objects
- dependencies determined by
- links PRM
Prof. Smith
Prof. Jones
George
Jane
21The Web of Influence
easy / hard
low / high
22Outline
- Bayesian Networks
- Probabilistic Relational Models
- Collective Classification Clustering
- Learning models from data
- Collective classification of webpages
- Undirected discriminative models
- Collective Classification Revisited
- PRMs for NLP
23Learning PRMs
D
Relational Database
Learner
Expert knowledge
Friedman, Getoor, K., Pfeffer
24Learning PRMs
- Parameter estimation
- Probabilistic model with shared parameters
- Grades for all students share same model
- Can use standard techniques for max-likelihood or
Bayesian parameter estimation - Structure learning
- Define scoring function over structures
- Use combinatorial search to find high-scoring
structure
25Web ? KB
Craven et al.
26Web Classification Experiments
- WebKB dataset
- Four CS department websites
- Bag of words on each page
- Links between pages
- Anchor text for links
- Experimental setup
- Trained on three universities
- Tested on fourth
- Repeated for all four combinations
27Standard Classification
Categories faculty course project student other
Professor department extract information computer
science machine learning
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
words only
28Exploiting Links
Page
working with Tom Mitchell
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
words only
link words
29Collective Classification
To-
Exists
Classify all pages collectively, maximizing the
joint label probability
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Approx. inference belief propagation
Getoor, Segal, Taskar, Koller
words only
link words
collective
30Learning w. Missing Data EM
Dempster et al. 77
low / high
easy / hard
31Discovering Hidden Types
Internet Movie Database http//www.imdb.com
32Discovering Hidden Types
Type
Type
Type
Taskar, Segal, Koller
33Discovering Hidden Types
34Outline
- Bayesian Networks
- Probabilistic Relational Models
- Collective Classification Clustering
- Undirected Discriminative Models
- Markov Networks
- Relational Markov Networks
- Collective Classification Revisited
- PRMs for NLP
35Directed Models Limitations
Solution Undirected Models
- Acyclicity constraint limits expressive power
- Two objects linked to by a student probably not
both professors
- Allow arbitrary patterns over sets of objects
links
- Acyclicity forces modeling of all potential
links - Network size O(N2)
- Inference is quadratic
- Influence flows over existing links, exploiting
link graph sparsity - Network size O(N)
- Generative training
- Train to fit all of data, not to maximize accuracy
- Allow discriminative training
- Max P (labels observations)
Lafferty, McCallum, Pereira
36Markov Networks
Eve
Alice
Betty
Chris
Dave
Graph structure encodes independence
assumptions Chris conditionally independent of
Eve given Alice Dave
37Relational Markov Networks
- Universals Probabilistic patterns hold for all
groups of objects - Locality Represent local probabilistic
dependencies - Sets of links give us possible interactions
Taskar, Abbeel, Koller 02
38RMN Semantics
- Instantiated RMN ? MN
- variables attributes of all objects
- dependencies determined by links RMN
Geo Study Group
George
Welcome to
CS101
CS Study Group
Jane
Jill
39Outline
- Bayesian Networks
- Probabilistic Relational Models
- Collective Classification Clustering
- Undirected Discriminative Models
- Collective Classification Revisited
- Discriminative training of RMNs
- Webpage classification
- Link prediction
- PRMs for NLP
40Learning RMNs
- Parameter estimation is not closed form
- Convex problem ? unique global maximum
Maximize
P(Grades,IntelligenceDifficulty)
L log
low / high
easy / hard
ABC
Grade
Difficulty
Intelligence
Grade
Intelligence
Grade
Difficulty
Intelligence
Grade
41Flat Models
Logistic Regression
P(CategoryWords)
42Exploiting Links
42.1 relative reduction in error relative to
generative approach
43More Complex Structure
44Collective Classification Results
35.4 relative reduction in error relative to
strong flat approach
45Scalability
Training
Classification
- WebKB data set size
- 1300 entities
- 180K attributes
- 5800 links
- Network size / school
- Directed model
- 200,000 variables
- 360,000 edges
- Undirected model
- 40,000 variables
- 44,000 edges
- Difference in training time decreases
substantially when - some training data is unobserved
- want to model with hidden variables
Directed models
Undirected models
46Predicting Relationships
Tom Mitchell Professor
WebKB Project
Sean Slattery Student
- Even more interesting are the relationships
between objects - e.g., verbs are almost always relationships
47Flat Model
To-
Page
Page
From-
...
...
Word1
WordN
Word1
WordN
Type
Rel
NONE advisor instructor TA member project-of
...
LinkWordN
LinkWord1
48Flat Model
49Collective Classification Links
To-
Page
Page
From-
Category
Category
...
...
Word1
WordN
Word1
WordN
Type
Rel
...
LinkWordN
LinkWord1
50Link Model
51Triad Model
Advisor
Professor
Student
Member
Member
Group
52Triad Model
Advisor
Professor
Student
TA
Instructor
Course
53Triad Model
54WebKB
- Four new department web sites
- Berkeley, CMU, MIT, Stanford
- Labeled page type (8 types)
- faculty, student, research scientist, staff,
research group, research project, course,
organization - Labeled hyperlinks and virtual links (6 types)
- advisor, instructor, TA, member, project-of, NONE
- Data set size
- 11K pages
- 110K links
- 2million words
55Link Prediction Results
72.9 relative reduction in error relative to
strong flat approach
- Error measured over links predicted to be present
- Link presence cutoff is at precision/recall
break-even point (?30 for all models)
56Summary
- PRMs inherit key advantages of probabilistic
graphical models - Coherent probabilistic semantics
- Exploit structure of local interactions
- Relational models inherently more expressive
- Web of influence use all available information
to reach powerful conclusions - Exploit both relational information and power of
probabilistic reasoning
57Outline
- Bayesian Networks
- Probabilistic Relational Models
- Collective Classification Clustering
- Undirected Discriminative Models
- Collective Classification Revisited
- PRMs for NLP
- Word-Sense Disambiguation
- Relation Extraction
- Natural Language Understanding (?)
or Why Should I Care?
An outsiders perspective
58Word Sense Disambiguation
Her advisor gave her feedback about the draft.
- Neighboring words alone may not provide enough
information to disambiguate - We can gain insight by considering compatibility
between senses of related words
59Collective Disambiguation
Her advisor gave her feedback about the draft.
Can we infer grammatical structure and
disambiguate word senses simultaneously rather
than sequentially?
- Objects words in text
- Attributes sense, gender, number, pos,
- Links
- Grammatical relations (subject-object,
modifier,) - Close semantic relations (is-a, cause-of, )
- Same word in different sentences
(one-sense-per-discourse) - Compatibility parameters
- Learned from tagged data
- Based on prior knowledge (e.g., WordNet, FrameNet)
Can we integrate inter-word relationships
directly into our probabilistic model?
60Relation Extraction
ACMEs board of directors began a search for a
new CEO after the departure of current CEO, James
Jackson, following allegations of creative
accounting practices at ACME. 6/01
In an attempt to improve the companys image,
ACME is considering former judge Mary Miller for
the job. 7/01
As her first act in her new position, Miller
announced that ACME will be doing a stock
buyback. 9/01
Candidate
Departs
Hired??
Of
Made
Concerns
61Understanding Language
Professor Sarah met Jane. She explained the hole
in her proof.
Most likely interpretation
N1
Student Jane
Professor Sarah
62Resolving Ambiguity
Professor Sarah met Jane. She explained the hole
in her proof.
- Professors often meet with students
- Jane is probably a student
- Professors like to explain
- She is probably Prof. Sarah
Attribute values
Link types
Object identity
Probabilistic reasoning about objects, their
attributes, and the relationships between them
Goldman Charniak, Pasula Russell
63Acquiring Semantic Models
- Statistical NLP reveals patterns
- Standard models learn patterns at word level
- But word-patterns are only implicit surrogates
for underlying semantic patterns - Teacher objects tend to participate in certain
relationships - Can use this pattern for objects not explicitly
labeled as a teacher
train
be
hire
3
24
3
pay
1.5
teacher
1.4
fire
0.3
serenade
64Competing Approaches
Complementary Approaches
Semantic Understanding
Scaling Up (via learning)
Noise Ambiguity
Desiderata
Logical
Statistical
PRMs
65Statistics from Words to Semantics
- Represent statistical patterns at semantic level
- What types of objects participate in what types
of relationships - Learn statistical models of semantics from text
- Reason using the models to obtain global semantic
understanding of the text
Georgia OKeefe Ladder to the Moon