Probabilistic Logic Learning - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Probabilistic Logic Learning

Description:

Learned link-based model distribution over link and ... MPAA Rating #Votes. Rating. CORA. Author. Paper. C. Topic. Word 1. Word 2. Word N. Cited. Wrote ... – PowerPoint PPT presentation

Number of Views:362

Avg rating:3.0/5.0

Slides: 36

Provided by: Ram260

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Logic Learning

1
Probabilistic Logic Learning
Advanced Link Mining
15.01.04

Ramya Ramakrishnan

2
Overview

Statistical Predicate Invention
Probabilistic Classification and Clustering
Identitity Uncertainity
- Experimental Results
Reference

3
Link Mining

Hypertext and link mining combines techniques
from ILP with statistical learning algorithms
construct features from related documents
Learned link-based model distribution over link
and content attributes, which may be correlated
based on the links between them needs a complex
classification algorithm for relational learning
Record linkage a link mining task which is used
for identity uncertainity in link mining,
important to take into account not only the
similarity between objects based on attributes
but also based on their links

4
Statistical Predicate Invention

Hypertext classifier is the combination of
- statistical text learning method
- relational rule learner
Well suited in hypertext domains
- word frequencies
- hyperlinks
Applied to tasks that involve learning
- classes of pages
- particular relations
- locating a particular class of information

5
Motivation

Documents related to one another by hyperlinks
- sources of evidence found in neighbouring
pages and hyperlinks
Feature set is needed
- text classifiers have feature spaces of
hundreds or thousands of words

6
Approaches to Hypertext Learning

Naive Bayes Algorithm - for text classification
FOIL for relational learning
Evaluate this combined algorithm on tasks
- learning definition of page classes
- learning definitions of relations between
pages
- learning to locate a particular type of
information within pages

7
Naive Bayes

Represent documents using bag-of-words
representation
Feature vector consists of one feature for each
word
- can be either boolean or continuous (some
measure of frequency of the word)
Position of the words does not matter
Also occurence of a given word in a document is
independent of all other words in the document

8
Naive Bayes

Use some type os feature selection method
Dropping un-informative words that occur on a
stop list
Dropping words that occur fewer than a specified
number in the training set
Ranking words by a measure such as their mutual
information with the class variable, and then
dropping low ranked words
Stemming process of heuristically reducing
words to their root form
- for eg, the words compute, computers and
computing would be stemmed to comput

9
Relational Text Learning

Enable learned classifiers
- to represent the relationships among documents
- to represent the information about the
occurence of words in documents
Problem representation for our relational
learning tasks consist of
- link_to(Hyperlink,Page,Page)
- has_word(Page)
- has_anchor_word(Hyperlink)
- has_neighbourhood_word(Hyperlink)
- all_words_capitalized(Hyperlink)
- has_alphanumeric_word(Hyperlink)

10
Relational Rule Learning

describe the graph structure of the Web
for word occurences in pages and hyperlinks
has_word, has_anchor_word, has_neighbourhood_word
predicates provide a bag-of-words representation

11
Combination

Able to learn predicates that characterizes pages
or hyperlinks by their word statistics
Able to represent the graph structure of the Web
thereby it can represent the word statistics of
neighbouring pages and hyperlinks
FOIL also employ the bag-of-words representation
Two properties -
1. Will not be dependent on the presence or
absence of specific key words statistical
classifiers in its learned rules consider the
weighted eveidence of many words
2. Can perform feature selection in a more
directed manner

Involves using the relational learner to
represent the graph structure, the statistical
learner to characterize the edges and nodes of
the graph.
FOIL-PILFS (FOIL with Predicate Invention for
Large Feature Spaces) basically a FOIL
augmented with a predicate-invention method
Predicates invented are statistical classifiers
applied to some textual description of pages,
hyperlinks or components
Assumption is that each constant in the problem
domain has a type, and that each type may have
one or more associated document collections
Each document of the given type maps to a unique
document in each associated collection
Invents new predicates at each step of the search
for a clause

Next step is to assemble the training set for the
Naive Bayes learner
Learner should mainly focus on the
characteristics that are common to many of the
documents in the training set, instead of
focusing on the characteristics of a few
instances that occur many times in the training
set
Our method should determine the vocabulary to be
used by Naive Bayes
Do not necessarily want to allow Naive Bayes to
use all of the words that occur in the training
set as features

So first each word wi that occurs in the
predicates training set
will be ranked
Given this ranking, we take the vocabulary for
the Naive Bayes classifier to be the n top ranked
words, where n is determined by
n e ? m
m the number of instances in the predictes
training set
e a parameter (set to 0.05 for the
experiments)
Make the dimensionality (feature set size) of the
predicate learning task to be small
- when a predicate is found that fits in the
training set well, it will generalize to new
instances of the target class.
Also the class priors has to be set in the Naive
Bayes classifier

15
Summary

Hybrid relational/statistical approach learning
in hypertext domains
This approach is applicable to learning tasks
other than those that involve hypertext
Well suited for domains that involve both
relational structure and potentially large
feature spaces

16
Probabilistic Classification and Clustering

Best described by relational models
Capture probabilistic dependencies between
related instances
Assume that data instances are independent and
identically distributed(IID)
Classification and clustering approaches have
been designed for such IID data, where each data
instance is a fixed length vector of attibute
values
Real world data sets are richer in structure
IID assumption is violated for two papers written
by the same author or two papers linked by
citation, which are likely to have the same topic

17
Classification

Iterative classification iteratively assigning
labels to test instances the classifier is
confident about, and using these labels to
classify related instances
FOIL-HUBS for classifying web pages
None of them produces a single coherent model of
the correlations between different related
instances
hence a purely procedural method is produced,
where the results of different classification
steps or algorithms are combined without a
unifying principle

18
Clustering

Emphasis is on dyadic data, such as word document
co-occurence, document citations, web links and
gene expresion data
Here the clustering is viewed for one or two
types of instances with a single relation between
them
But we have to model richer structures present in
many real world domains
Identify for each instance type, sub-populations
of instances that are similar to both their
attributes and their relations to other instances

19
PRM

Extend Bayesian networks to a relational setting
Language that allows to capture probabilistic
dependencies between related instances
Accomodate the entire spectrum between purely
supervised classification and purely unsupervised
clustering
Can learn from data where some instances have a
class label and other do not
Transductive learning setting, where we use the
test data, without the labels, in the trainign
phase
EM algorithm for learning such PRMs with latent
variables from a relational database

20
Models for relational data

Data instances are assumed to be IID samples
Each instance belongs to exactly one of ? classes
or clusters
PRM is a template for a probability distribution
over a relational database of a given schema
Probabilistic Model -
- Quantitative part of the PRM specifies the
parameterization of the model
- given a set of parents for an attribute ,
define a local probability model by associating
with it a CPD (conditional probability
distribution)

21
Aggregates

Many possible choices of aggregation operator to
allow dependencies on a set of variables
Mode aggregate which computes the most common
value of its parents but not very sensitive to
te distribution of values of its parents
Stochastic mode aggregate will define a set of
distribution but the aggregate is a weighted
average of these distributions where the weight
of vi is the frequency of this value
Viewed as a randomized selector node

22
IMDB
Actor
Director
C
C
Gender
Movie
Role
C
Rating
Genre
Votes
Credit Order
Year
MPAA Rating
23
CORA
Author
C
Cited
Wrote
Paper
Topic
Word N
Word 1
Word 2
24
Summary

Growing interest in learning methods to exploit
the relational structure of the domain
Here we construct clusters based on relational
information
- Since instances are not independent,
information about some instances can be used to
reach conclusions about others
Main problem is the model selection when domain
expertise is lacking, use techniques in Bayesian
networks, allowing the learning algorithm to
select the model structure that best suits the
data

25
Identity Uncertainity

objects are not labeled with unique identifiers
those identifiers are not perceived perfectly
May or may not correspond to the same object
Citation Matching which citations correspond to
the same publication
Relational Probability Model
Markov chain Monte Carlo method

26
Citation Matching

Existence of a set of objects and their
properties and relations, given a collection of
raw perceptual data
When two observations describe the same object
combined to develop a complete description of the
object
Objects carry unique identifiers seldomly so
identity uncertainity is ubiquitous
Lashkari et al 94 Collaborative Interface
Agents, Yezdi Lashkari, Max Metral, and Pattie
Maes, Proceedings of the Twelfth national
Conference On Artificial Inteligence, MIT Press,
Cambridge, MA, 1994.
Metral M. Lashkari, Y. and P. Maes.
Collaborative Interface Agents. In Conference of
the American Asociation for Artificial
Intelligence, Seattle, WA, August 1994.

27
Identity Uncertainity

Record linkage matching up records in two
files, might be required when merging two
databases
Probability model developed by Cohen et al
model the database as a combination of some
original records and some number of erroneous
versions
Data Association Assigning new obeservations to
existing trajectories when multiple objects are
tracked
Citeseer best example
- System groups citations based on some greedy
agglomerative clustering which is based on text
similarity

28
RPM

Consists of
set ? of classes
set I of named instances
set A of complex attributes denoting functional
relations
set B of simple attributes denoting functions
set of conditional probability models P(BPaB)
for the simple attributes
Number uncertainity
Same-as statement
RPMs include the unique names assumption
Express the RPM as an Bayesian network

29
RPMs for citations
Citation.obsAuthorsi.author same-as Citation.pap
er.authorsi
30
Bayesian Network equivalent to the RPM
A12. (fnames)
A12. surname
A13. surname
A11. (fnames)
A12. fnames
A13. fnames
A11. fnames
D12. (fnames)
D12. surname
A13. (fnames)
A11. surname
D11. (fnames)
D13. surname
D12. fnames
D13. fnames
D11. fnames
A13. (fnames)
D11. surname
C1. (authors)
P1. title
C1. parse
C1. obsTitle
C1. text
P1. pubtype
31
Bayesian Network equivalent to the RPM
A22. (fnames)
A22. surname
A23. surname
A21. (fnames)
A22. fnames
A23. fnames
A21. fnames
D22. (fnames)
D22. surname
A23. (fnames)
A21. surname
D21. (fnames)
D23. surname
D22. fnames
D23. fnames
D21. fnames
A23. (fnames)
D21. surname
C2. (authors)
P2. title
C2. parse
C2. obsTitle
C2. text
P2. pubtype
32
Identity uncertainity

Assignment i mapping of terms in the language
to objects in the world
For eg. if P1 and P2 are the only terms and they
co-refer then
i is P1,P2
If P1 and P2 do not co-refer, then i is P1,
P2
Probability model for the space of extended
possible worlds is P(i ) can be re-written as
P(i C ) of all the classes C ? C
Here two cases are distinguished -
- for some classes, unique names assumption
remains appropriate P(i Citation) to be 1.0
- for classes such as Paper and Author,
elements are subject to identity uncertainity
P( i C,k,m ) that maps k named instances to m
distinct objects

33
Inference

The model grows from a single Bayesian network to
a colletcion of networks approximated by Markov
chain Monte Carlo
Markov chain Monte Carlo - Approximating an
expectation over some distribution p (?)
Citation Matching Algorithm -
- uses a factored q function to propose, first,
a change to i, and then, values for all the
hidden attributes of all the objects affected by
that change
Scaling up acknowledged flaws of MCMC algorithm
is it often fails to scale so an efficient
algorithm is used to preprocess a large dataset
and fragment it into many smaller, overlapping
set of elements that have a non-zero probability
of matching

34
Experimental Results
Results on four citeseer data sets, for the text
matching and MCMC algorithms
35
Reference

S. Slattery and M. Craven. Combining Statistical
and Relational Methods for Learning in Hypertext
Domains. In Proceedings of the 8th International
Conference on Inductive Logic Programming.
Springer-Verlag, 1998.
B. Taskar, E. Segal and D. Koller. Probabilistic
Classification and Clustering in Relational Data.
In Proceedings of IJCAI - 01, 2001.
H. Pasula, B. Marthi, B. Milch, S. Russell, and
I.. Shpitser. Identity Uncertainity and Citation
Matching. In Advances in Neural Information
Processing Systems 15 (NIPS 2002). MIT Press,
2003.