Statistical Relational Learning for Link Prediction - PowerPoint PPT Presentation

About This Presentation

Title:

Statistical Relational Learning for Link Prediction

Description:

Being able to predict the presence of links or connections in a ... makes the standard 'flat' file domain representation ... table 'flat' domain ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 20

Provided by: Ron5150

Learn more at: https://web.engr.oregonstate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Relational Learning for Link Prediction

1
Statistical Relational Learning for Link
Prediction

Alexandrin Popescul and Lyle H. Unger
Presented by Ron Bjarnason
11 November 2003

2
Link Prediction

Link Prediction is an important problem arising
in many domains
Web pages
Computers
Scientific publications
Organizations
People

Being able to predict the presence of links or
connections in a domain is both important and
difficult to do well
3
Characteristics in Link Prediction Domains

Their nature is inherently multi-relational
This makes the standard flat file domain
representation inadequate
Data is often noisy or partially observed
e.g. articles may be cited for any number of
reasons which reasons are not fully observed

4
Typical Learning Approaches

Assume one-table flat domain representation
Process of feature creation is decoupled from
feature selection (and is often performed
manually)
Relevant features may not be readily observed by
human eyes

5
The Full Join Approach

Perform a full join on the entire database and
statistically analyze the entries
Both impractical and incorrect
Size is prohibitive
Notion of an object is lost (stored across
multiple rows)
Entries will be atomic attribute values, rather
than results from a complex search
Negates option to introduce intelligent search
heuristics

6
The Relational Method

Integrates standard statistical modeling
(logistic regression) with a process for
systematically generating features from
relational data
Feature generation is formulated as search in the
space of relational database queries
Space bias can be controlled by specifying valid
query types
Aggregations or statistical operations
Groupings
Richer join conditions
Arg-max based queries
Allows for discovery of complex, interesting
relationships

7
Link Prediction in the Citeseer Domain

Can be used as a citation recommendation service
User would provide an abstract, author names,
possibly a partial reference list
Citeseer provides a rich set of relational data
Texts of titles
Abstracts and documents
Citation information
Author names and affiliations
Conference or journal names

8
Methodology

Couple the two main processes
Generation of feature candidates from relational
data
Their selection with statistical model selection
criteria

9
Relational Feature Generation

Main principle of search formulation is based on
the concept of refinement graphs
Start with the most general clauses and progress
by refining them into more specialized clauses

10
Relational Feature Generation Refinement Graphs

Directed acyclic graphs specifying search space
Constrained by specifying legal clauses
Negation and recursion disallowed
Structured by partial ordering of clauses
A search node is expanded (refined) to produce
the most general specializations
ILP systems using refinement graph search usually
apply two refinement operators
Add a predicate to a clause
A single variable substitution

11
Relational Feature Generation Aggregates

Query results are aggregated to produce scalar
numeric values to be used in statistical learning
Any statistical aggregate can be valid, but some
are expected to be more useful than others
Count
Average
Max
Min
Mode
Empty
Aggregations are considered for inclusion at each
node, but not factored into further search

12
Relational Feature Selection

Logistic Regression is used for binary
classification problems
Regression coefficients are learned to maximize
the likelihood function
Stepwise model selection and Bayesian Information
Criterion (BIC) are used to avoid overfitting

13
Tasks and Data IID Violation

The relational structure violates the assumption
of independence
This can be remedied by choosing the right
features
When the right features are used, the
observations are independent given the features

14
Two Prediction Tasks

The identity of all objects is known. Some link
structure is known. Predict unobserved links.
New objects arrive. Predict their links.
What do we know about the objects?
Some of their links
Some of their attributes
This paper presents results for task 1

15
The Citeseer Environment

271,343 documents
1,092,200 citations
Five data sets defined
Four data sets consist of links among documents
containing a certain query phrase (e.g.
artificial intelligence)
Fifth data set includes all documents

16
Learning Methodology

Populate three relations Citation, Author and
PublishedIn
Sample 2,500 citations each of
Positive training examples (from available links)
Negative training examples (absence of a link)
Positive test examples
Negative test examples

17
Learning Methodology

Remove citations from test set (but no other
relevant information)
Remove citations from training set (so answers
are not contained in background information)
Perform learning
Using citations only
Using all relevant information (citation, authors
and venue)

18
Results Training and Test set accuracies
balanced priors
Dataset BK Citation BK Citation BK All BK All
Dataset Train Test Train Test
artificial intelligence 90.24 89.68 92.60 92.14
data mining 87.40 87.20 89.70 89.18
information retrieval 85.98 85.34 88.88 88.82
machine learning 89.40 89.14 91.42 91.14
Entire collection 92.80 92.28 93.66 93.22
19
The End

Write a Comment

User Comments (0)