Title: Statistical Relational Learning for Link Prediction
1Statistical Relational Learning for Link
Prediction
- Alexandrin Popescul and Lyle H. Unger
- Presented by Ron Bjarnason
- 11 November 2003
2Link Prediction
- Link Prediction is an important problem arising
in many domains - Web pages
- Computers
- Scientific publications
- Organizations
- People
Being able to predict the presence of links or
connections in a domain is both important and
difficult to do well
3Characteristics in Link Prediction Domains
- Their nature is inherently multi-relational
- This makes the standard flat file domain
representation inadequate - Data is often noisy or partially observed
- e.g. articles may be cited for any number of
reasons which reasons are not fully observed
4Typical Learning Approaches
- Assume one-table flat domain representation
- Process of feature creation is decoupled from
feature selection (and is often performed
manually) - Relevant features may not be readily observed by
human eyes
5The Full Join Approach
- Perform a full join on the entire database and
statistically analyze the entries - Both impractical and incorrect
- Size is prohibitive
- Notion of an object is lost (stored across
multiple rows) - Entries will be atomic attribute values, rather
than results from a complex search - Negates option to introduce intelligent search
heuristics
6The Relational Method
- Integrates standard statistical modeling
(logistic regression) with a process for
systematically generating features from
relational data - Feature generation is formulated as search in the
space of relational database queries - Space bias can be controlled by specifying valid
query types - Aggregations or statistical operations
- Groupings
- Richer join conditions
- Arg-max based queries
- Allows for discovery of complex, interesting
relationships
7Link Prediction in the Citeseer Domain
- Can be used as a citation recommendation service
- User would provide an abstract, author names,
possibly a partial reference list - Citeseer provides a rich set of relational data
- Texts of titles
- Abstracts and documents
- Citation information
- Author names and affiliations
- Conference or journal names
8Methodology
- Couple the two main processes
- Generation of feature candidates from relational
data - Their selection with statistical model selection
criteria
9Relational Feature Generation
- Main principle of search formulation is based on
the concept of refinement graphs - Start with the most general clauses and progress
by refining them into more specialized clauses
10Relational Feature Generation Refinement Graphs
- Directed acyclic graphs specifying search space
- Constrained by specifying legal clauses
- Negation and recursion disallowed
- Structured by partial ordering of clauses
- A search node is expanded (refined) to produce
the most general specializations - ILP systems using refinement graph search usually
apply two refinement operators - Add a predicate to a clause
- A single variable substitution
11Relational Feature Generation Aggregates
- Query results are aggregated to produce scalar
numeric values to be used in statistical learning - Any statistical aggregate can be valid, but some
are expected to be more useful than others - Count
- Average
- Max
- Min
- Mode
- Empty
- Aggregations are considered for inclusion at each
node, but not factored into further search
12Relational Feature Selection
- Logistic Regression is used for binary
classification problems - Regression coefficients are learned to maximize
the likelihood function - Stepwise model selection and Bayesian Information
Criterion (BIC) are used to avoid overfitting
13Tasks and Data IID Violation
- The relational structure violates the assumption
of independence - This can be remedied by choosing the right
features - When the right features are used, the
observations are independent given the features
14Two Prediction Tasks
- The identity of all objects is known. Some link
structure is known. Predict unobserved links. - New objects arrive. Predict their links.
- What do we know about the objects?
- Some of their links
- Some of their attributes
- This paper presents results for task 1
15The Citeseer Environment
- 271,343 documents
- 1,092,200 citations
- Five data sets defined
- Four data sets consist of links among documents
containing a certain query phrase (e.g.
artificial intelligence) - Fifth data set includes all documents
16Learning Methodology
- Populate three relations Citation, Author and
PublishedIn - Sample 2,500 citations each of
- Positive training examples (from available links)
- Negative training examples (absence of a link)
- Positive test examples
- Negative test examples
17Learning Methodology
- Remove citations from test set (but no other
relevant information) - Remove citations from training set (so answers
are not contained in background information) - Perform learning
- Using citations only
- Using all relevant information (citation, authors
and venue)
18Results Training and Test set accuracies
balanced priors
Dataset BK Citation BK Citation BK All BK All
Dataset Train Test Train Test
artificial intelligence 90.24 89.68 92.60 92.14
data mining 87.40 87.20 89.70 89.18
information retrieval 85.98 85.34 88.88 88.82
machine learning 89.40 89.14 91.42 91.14
Entire collection 92.80 92.28 93.66 93.22
19The End