Statistical Relational Learning for Link Prediction - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Relational Learning for Link Prediction

Description:

Being able to predict the presence of links or connections in a ... makes the standard 'flat' file domain representation ... table 'flat' domain ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 20
Provided by: Ron5150
Category:

less

Transcript and Presenter's Notes

Title: Statistical Relational Learning for Link Prediction


1
Statistical Relational Learning for Link
Prediction
  • Alexandrin Popescul and Lyle H. Unger
  • Presented by Ron Bjarnason
  • 11 November 2003

2
Link Prediction
  • Link Prediction is an important problem arising
    in many domains
  • Web pages
  • Computers
  • Scientific publications
  • Organizations
  • People

Being able to predict the presence of links or
connections in a domain is both important and
difficult to do well
3
Characteristics in Link Prediction Domains
  • Their nature is inherently multi-relational
  • This makes the standard flat file domain
    representation inadequate
  • Data is often noisy or partially observed
  • e.g. articles may be cited for any number of
    reasons which reasons are not fully observed

4
Typical Learning Approaches
  • Assume one-table flat domain representation
  • Process of feature creation is decoupled from
    feature selection (and is often performed
    manually)
  • Relevant features may not be readily observed by
    human eyes

5
The Full Join Approach
  • Perform a full join on the entire database and
    statistically analyze the entries
  • Both impractical and incorrect
  • Size is prohibitive
  • Notion of an object is lost (stored across
    multiple rows)
  • Entries will be atomic attribute values, rather
    than results from a complex search
  • Negates option to introduce intelligent search
    heuristics

6
The Relational Method
  • Integrates standard statistical modeling
    (logistic regression) with a process for
    systematically generating features from
    relational data
  • Feature generation is formulated as search in the
    space of relational database queries
  • Space bias can be controlled by specifying valid
    query types
  • Aggregations or statistical operations
  • Groupings
  • Richer join conditions
  • Arg-max based queries
  • Allows for discovery of complex, interesting
    relationships

7
Link Prediction in the Citeseer Domain
  • Can be used as a citation recommendation service
  • User would provide an abstract, author names,
    possibly a partial reference list
  • Citeseer provides a rich set of relational data
  • Texts of titles
  • Abstracts and documents
  • Citation information
  • Author names and affiliations
  • Conference or journal names

8
Methodology
  • Couple the two main processes
  • Generation of feature candidates from relational
    data
  • Their selection with statistical model selection
    criteria

9
Relational Feature Generation
  • Main principle of search formulation is based on
    the concept of refinement graphs
  • Start with the most general clauses and progress
    by refining them into more specialized clauses

10
Relational Feature Generation Refinement Graphs
  • Directed acyclic graphs specifying search space
  • Constrained by specifying legal clauses
  • Negation and recursion disallowed
  • Structured by partial ordering of clauses
  • A search node is expanded (refined) to produce
    the most general specializations
  • ILP systems using refinement graph search usually
    apply two refinement operators
  • Add a predicate to a clause
  • A single variable substitution

11
Relational Feature Generation Aggregates
  • Query results are aggregated to produce scalar
    numeric values to be used in statistical learning
  • Any statistical aggregate can be valid, but some
    are expected to be more useful than others
  • Count
  • Average
  • Max
  • Min
  • Mode
  • Empty
  • Aggregations are considered for inclusion at each
    node, but not factored into further search

12
Relational Feature Selection
  • Logistic Regression is used for binary
    classification problems
  • Regression coefficients are learned to maximize
    the likelihood function
  • Stepwise model selection and Bayesian Information
    Criterion (BIC) are used to avoid overfitting

13
Tasks and Data IID Violation
  • The relational structure violates the assumption
    of independence
  • This can be remedied by choosing the right
    features
  • When the right features are used, the
    observations are independent given the features

14
Two Prediction Tasks
  • The identity of all objects is known. Some link
    structure is known. Predict unobserved links.
  • New objects arrive. Predict their links.
  • What do we know about the objects?
  • Some of their links
  • Some of their attributes
  • This paper presents results for task 1

15
The Citeseer Environment
  • 271,343 documents
  • 1,092,200 citations
  • Five data sets defined
  • Four data sets consist of links among documents
    containing a certain query phrase (e.g.
    artificial intelligence)
  • Fifth data set includes all documents

16
Learning Methodology
  • Populate three relations Citation, Author and
    PublishedIn
  • Sample 2,500 citations each of
  • Positive training examples (from available links)
  • Negative training examples (absence of a link)
  • Positive test examples
  • Negative test examples

17
Learning Methodology
  • Remove citations from test set (but no other
    relevant information)
  • Remove citations from training set (so answers
    are not contained in background information)
  • Perform learning
  • Using citations only
  • Using all relevant information (citation, authors
    and venue)

18
Results Training and Test set accuracies
balanced priors
Dataset BK Citation BK Citation BK All BK All
Dataset Train Test Train Test
artificial intelligence 90.24 89.68 92.60 92.14
data mining 87.40 87.20 89.70 89.18
information retrieval 85.98 85.34 88.88 88.82
machine learning 89.40 89.14 91.42 91.14
Entire collection 92.80 92.28 93.66 93.22
19
The End
Write a Comment
User Comments (0)
About PowerShow.com