Probabilistic Logic Learning - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Probabilistic Logic Learning

Description:

Learned link-based model distribution over link and ... MPAA Rating #Votes. Rating. CORA. Author. Paper. C. Topic. Word 1. Word 2. Word N. Cited. Wrote ... – PowerPoint PPT presentation

Number of Views:362
Avg rating:3.0/5.0
Slides: 36
Provided by: Ram260
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Logic Learning


1
Probabilistic Logic Learning
Advanced Link Mining
15.01.04
  • Ramya Ramakrishnan

2
Overview
  • Statistical Predicate Invention
  • Probabilistic Classification and Clustering
  • Identitity Uncertainity
  • - Experimental Results
  • Reference

3
Link Mining
  • Hypertext and link mining combines techniques
    from ILP with statistical learning algorithms
    construct features from related documents
  • Learned link-based model distribution over link
    and content attributes, which may be correlated
    based on the links between them needs a complex
    classification algorithm for relational learning
  • Record linkage a link mining task which is used
    for identity uncertainity in link mining,
    important to take into account not only the
    similarity between objects based on attributes
    but also based on their links

4
Statistical Predicate Invention
  • Hypertext classifier is the combination of
  • - statistical text learning method
  • - relational rule learner
  • Well suited in hypertext domains
  • - word frequencies
  • - hyperlinks
  • Applied to tasks that involve learning
  • - classes of pages
  • - particular relations
  • - locating a particular class of information

5
Motivation
  • Documents related to one another by hyperlinks
  • - sources of evidence found in neighbouring
    pages and hyperlinks
  • Feature set is needed
  • - text classifiers have feature spaces of
    hundreds or thousands of words

6
Approaches to Hypertext Learning
  • Naive Bayes Algorithm - for text classification
  • FOIL for relational learning
  • Evaluate this combined algorithm on tasks
  • - learning definition of page classes
  • - learning definitions of relations between
    pages
  • - learning to locate a particular type of
    information within pages

7
Naive Bayes
  • Represent documents using bag-of-words
    representation
  • Feature vector consists of one feature for each
    word
  • - can be either boolean or continuous (some
    measure of frequency of the word)
  • Position of the words does not matter
  • Also occurence of a given word in a document is
    independent of all other words in the document

8
Naive Bayes
  • Use some type os feature selection method
  • Dropping un-informative words that occur on a
    stop list
  • Dropping words that occur fewer than a specified
    number in the training set
  • Ranking words by a measure such as their mutual
    information with the class variable, and then
    dropping low ranked words
  • Stemming process of heuristically reducing
    words to their root form
  • - for eg, the words compute, computers and
    computing would be stemmed to comput

9
Relational Text Learning
  • Enable learned classifiers
  • - to represent the relationships among documents
  • - to represent the information about the
    occurence of words in documents
  • Problem representation for our relational
    learning tasks consist of
  • - link_to(Hyperlink,Page,Page)
  • - has_word(Page)
  • - has_anchor_word(Hyperlink)
  • - has_neighbourhood_word(Hyperlink)
  • - all_words_capitalized(Hyperlink)
  • - has_alphanumeric_word(Hyperlink)

10
Relational Rule Learning
  • describe the graph structure of the Web
  • for word occurences in pages and hyperlinks
  • has_word, has_anchor_word, has_neighbourhood_word
    predicates provide a bag-of-words representation

11
Combination
  • Able to learn predicates that characterizes pages
    or hyperlinks by their word statistics
  • Able to represent the graph structure of the Web
    thereby it can represent the word statistics of
    neighbouring pages and hyperlinks
  • FOIL also employ the bag-of-words representation
  • Two properties -
  • 1. Will not be dependent on the presence or
    absence of specific key words statistical
    classifiers in its learned rules consider the
    weighted eveidence of many words
  • 2. Can perform feature selection in a more
    directed manner

12
  • Involves using the relational learner to
    represent the graph structure, the statistical
    learner to characterize the edges and nodes of
    the graph.
  • FOIL-PILFS (FOIL with Predicate Invention for
    Large Feature Spaces) basically a FOIL
    augmented with a predicate-invention method
  • Predicates invented are statistical classifiers
    applied to some textual description of pages,
    hyperlinks or components
  • Assumption is that each constant in the problem
    domain has a type, and that each type may have
    one or more associated document collections
  • Each document of the given type maps to a unique
    document in each associated collection
  • Invents new predicates at each step of the search
    for a clause

13
  • Next step is to assemble the training set for the
    Naive Bayes learner
  • Learner should mainly focus on the
    characteristics that are common to many of the
    documents in the training set, instead of
    focusing on the characteristics of a few
    instances that occur many times in the training
    set
  • Our method should determine the vocabulary to be
    used by Naive Bayes
  • Do not necessarily want to allow Naive Bayes to
    use all of the words that occur in the training
    set as features

14
  • So first each word wi that occurs in the
    predicates training set
  • will be ranked
  • Given this ranking, we take the vocabulary for
    the Naive Bayes classifier to be the n top ranked
    words, where n is determined by
  • n e ? m
  • m the number of instances in the predictes
    training set
  • e a parameter (set to 0.05 for the
    experiments)
  • Make the dimensionality (feature set size) of the
    predicate learning task to be small
  • - when a predicate is found that fits in the
    training set well, it will generalize to new
    instances of the target class.
  • Also the class priors has to be set in the Naive
    Bayes classifier

15
Summary
  • Hybrid relational/statistical approach learning
    in hypertext domains
  • This approach is applicable to learning tasks
    other than those that involve hypertext
  • Well suited for domains that involve both
    relational structure and potentially large
    feature spaces

16
Probabilistic Classification and Clustering
  • Best described by relational models
  • Capture probabilistic dependencies between
    related instances
  • Assume that data instances are independent and
    identically distributed(IID)
  • Classification and clustering approaches have
    been designed for such IID data, where each data
    instance is a fixed length vector of attibute
    values
  • Real world data sets are richer in structure
  • IID assumption is violated for two papers written
    by the same author or two papers linked by
    citation, which are likely to have the same topic

17
Classification
  • Iterative classification iteratively assigning
    labels to test instances the classifier is
    confident about, and using these labels to
    classify related instances
  • FOIL-HUBS for classifying web pages
  • None of them produces a single coherent model of
    the correlations between different related
    instances
  • hence a purely procedural method is produced,
    where the results of different classification
    steps or algorithms are combined without a
    unifying principle

18
Clustering
  • Emphasis is on dyadic data, such as word document
    co-occurence, document citations, web links and
    gene expresion data
  • Here the clustering is viewed for one or two
    types of instances with a single relation between
    them
  • But we have to model richer structures present in
    many real world domains
  • Identify for each instance type, sub-populations
    of instances that are similar to both their
    attributes and their relations to other instances

19
PRM
  • Extend Bayesian networks to a relational setting
  • Language that allows to capture probabilistic
    dependencies between related instances
  • Accomodate the entire spectrum between purely
    supervised classification and purely unsupervised
    clustering
  • Can learn from data where some instances have a
    class label and other do not
  • Transductive learning setting, where we use the
    test data, without the labels, in the trainign
    phase
  • EM algorithm for learning such PRMs with latent
    variables from a relational database

20
Models for relational data
  • Data instances are assumed to be IID samples
  • Each instance belongs to exactly one of ? classes
    or clusters
  • PRM is a template for a probability distribution
    over a relational database of a given schema
  • Probabilistic Model -
  • - Quantitative part of the PRM specifies the
    parameterization of the model
  • - given a set of parents for an attribute ,
    define a local probability model by associating
    with it a CPD (conditional probability
    distribution)

21
Aggregates
  • Many possible choices of aggregation operator to
    allow dependencies on a set of variables
  • Mode aggregate which computes the most common
    value of its parents but not very sensitive to
    te distribution of values of its parents
  • Stochastic mode aggregate will define a set of
    distribution but the aggregate is a weighted
    average of these distributions where the weight
    of vi is the frequency of this value
  • Viewed as a randomized selector node

22
IMDB
Actor
Director
C
C
Gender
Movie
Role
C
Rating
Genre
Votes
Credit Order
Year
MPAA Rating
23
CORA
Author
C
Cited
Wrote
Paper
Topic
Word N
Word 1
Word 2
24
Summary
  • Growing interest in learning methods to exploit
    the relational structure of the domain
  • Here we construct clusters based on relational
    information
  • - Since instances are not independent,
    information about some instances can be used to
    reach conclusions about others
  • Main problem is the model selection when domain
    expertise is lacking, use techniques in Bayesian
    networks, allowing the learning algorithm to
    select the model structure that best suits the
    data

25
Identity Uncertainity
  • objects are not labeled with unique identifiers
  • those identifiers are not perceived perfectly
  • May or may not correspond to the same object
  • Citation Matching which citations correspond to
    the same publication
  • Relational Probability Model
  • Markov chain Monte Carlo method

26
Citation Matching
  • Existence of a set of objects and their
    properties and relations, given a collection of
    raw perceptual data
  • When two observations describe the same object
    combined to develop a complete description of the
    object
  • Objects carry unique identifiers seldomly so
    identity uncertainity is ubiquitous
  • Lashkari et al 94 Collaborative Interface
    Agents, Yezdi Lashkari, Max Metral, and Pattie
    Maes, Proceedings of the Twelfth national
    Conference On Artificial Inteligence, MIT Press,
    Cambridge, MA, 1994.
  • Metral M. Lashkari, Y. and P. Maes.
    Collaborative Interface Agents. In Conference of
    the American Asociation for Artificial
    Intelligence, Seattle, WA, August 1994.

27
Identity Uncertainity
  • Record linkage matching up records in two
    files, might be required when merging two
    databases
  • Probability model developed by Cohen et al
    model the database as a combination of some
    original records and some number of erroneous
    versions
  • Data Association Assigning new obeservations to
    existing trajectories when multiple objects are
    tracked
  • Citeseer best example
  • - System groups citations based on some greedy
    agglomerative clustering which is based on text
    similarity

28
RPM
  • Consists of
  • set ? of classes
  • set I of named instances
  • set A of complex attributes denoting functional
    relations
  • set B of simple attributes denoting functions
  • set of conditional probability models P(BPaB)
    for the simple attributes
  • Number uncertainity
  • Same-as statement
  • RPMs include the unique names assumption
  • Express the RPM as an Bayesian network

29
RPMs for citations
Citation.obsAuthorsi.author same-as Citation.pap
er.authorsi
30
Bayesian Network equivalent to the RPM
A12. (fnames)
A12. surname
A13. surname
A11. (fnames)
A12. fnames
A13. fnames
A11. fnames
D12. (fnames)
D12. surname
A13. (fnames)
A11. surname
D11. (fnames)
D13. surname
D12. fnames
D13. fnames
D11. fnames
A13. (fnames)
D11. surname
C1. (authors)
P1. title
C1. parse
C1. obsTitle
C1. text
P1. pubtype
31
Bayesian Network equivalent to the RPM
A22. (fnames)
A22. surname
A23. surname
A21. (fnames)
A22. fnames
A23. fnames
A21. fnames
D22. (fnames)
D22. surname
A23. (fnames)
A21. surname
D21. (fnames)
D23. surname
D22. fnames
D23. fnames
D21. fnames
A23. (fnames)
D21. surname
C2. (authors)
P2. title
C2. parse
C2. obsTitle
C2. text
P2. pubtype
32
Identity uncertainity
  • Assignment i mapping of terms in the language
    to objects in the world
  • For eg. if P1 and P2 are the only terms and they
    co-refer then
  • i is P1,P2
  • If P1 and P2 do not co-refer, then i is P1,
    P2
  • Probability model for the space of extended
    possible worlds is P(i ) can be re-written as
    P(i C ) of all the classes C ? C
  • Here two cases are distinguished -
  • - for some classes, unique names assumption
    remains appropriate P(i Citation) to be 1.0
  • - for classes such as Paper and Author,
    elements are subject to identity uncertainity
  • P( i C,k,m ) that maps k named instances to m
    distinct objects

33
Inference
  • The model grows from a single Bayesian network to
    a colletcion of networks approximated by Markov
    chain Monte Carlo
  • Markov chain Monte Carlo - Approximating an
    expectation over some distribution p (?)
  • Citation Matching Algorithm -
  • - uses a factored q function to propose, first,
    a change to i, and then, values for all the
    hidden attributes of all the objects affected by
    that change
  • Scaling up acknowledged flaws of MCMC algorithm
    is it often fails to scale so an efficient
    algorithm is used to preprocess a large dataset
    and fragment it into many smaller, overlapping
    set of elements that have a non-zero probability
    of matching

34
Experimental Results
Results on four citeseer data sets, for the text
matching and MCMC algorithms
35
Reference
  • S. Slattery and M. Craven. Combining Statistical
    and Relational Methods for Learning in Hypertext
    Domains. In Proceedings of the 8th International
    Conference on Inductive Logic Programming.
    Springer-Verlag, 1998.
  • B. Taskar, E. Segal and D. Koller. Probabilistic
    Classification and Clustering in Relational Data.
    In Proceedings of IJCAI - 01, 2001.
  • H. Pasula, B. Marthi, B. Milch, S. Russell, and
    I.. Shpitser. Identity Uncertainity and Citation
    Matching. In Advances in Neural Information
    Processing Systems 15 (NIPS 2002). MIT Press,
    2003.
Write a Comment
User Comments (0)
About PowerShow.com