CMSC 828G: Introduction to Statistical Relational Learning (SRL) - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

CMSC 828G: Introduction to Statistical Relational Learning (SRL)

Description:

CMSC 828G: Introduction to Statistical Relational Learning (SRL) & Link Analysis (LA) January 28, 2005 Today s Outline Brief Introduction to SRL Student ... – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 46
Provided by: csUmdEdu5
Learn more at: http://www.cs.umd.edu
Category:

less

Transcript and Presenter's Notes

Title: CMSC 828G: Introduction to Statistical Relational Learning (SRL)


1
CMSC 828G Introduction to Statistical Relational
Learning (SRL) Link Analysis (LA)
  • January 28, 2005

2
Todays Outline
  • Brief Introduction to SRL
  • Student Introductions
  • Course Mechanics
  • Slightly Longer Introduction to SRL
  • SRL focus problem
  • Exercise Create your own SRL focus problem
  • Discussion of SRL focus problems
  • Survey
  • Resources

3
Statistical Relational Learning
  • Traditional machine learning and data mining
    approaches assume
  • A random sample of homogeneous objects from
    single relation
  • Real world data sets
  • Multi-relational, heterogeneous and
    semi-structured
  • SRL
  • newly emerging research area at the intersection
    of research in graphical models, social network
    and link analysis, hypertext and web mining,
    graph mining, relational learning and inductive
    logic programming

4
SRL Approaches
  • Combine logical/combinatorial structures with
    statistical/probabilistic models
  • Families of Approaches
  • Entity-relation Models Graphical Models
    (BNs/Markov Models)
  • First-Order Logic Graphical Models
  • Functional Programming Stochastic Execution

5
Sample Domains
  • web data (web)
  • bibliographic data (cite)
  • epidimiological data (epi)
  • communication data (comm)
  • customer networks (cust)
  • collaborative filtering problems (cf)
  • trust networks (trust)
  • biological data (bio)

6
Recent SRL Activities
  • Dagstuhl Workshop on Probabilistic, Logical and
    Relational Learning - Towards a Synthesis
    (1/30/05-2/04/05)http//www.dagstuhl.de/05051/
  • ICML 2004 workshop on Statistical Relational
    Learning and its Connections to Other
    Fieldshttp//www.cs.umd.edu/projects/srl2004/
  • IJCAI 2003 workshop on Statistical Relational
    Learninghttp//kdl.cs.umass.edu/srl2003/
  • AAAI 2000 workshop on Statistical Relational
    Learninghttp//robotics.stanford.edu/srl
  • Several related workshops
  • KDD MRDM workshops
  • http//www-ai.ijs.si/SasoDzeroski/MRDM2004/
  • http//www-ai.ijs.si/SasoDzeroski/MRDM2003/
  • http//www-ai.ijs.si/SasoDzeroski/MRDM2002/
  • Benjamin Taskar and I are working on an edited
    SRL collection, and ideally we will have access
    to draft chapters from this collection.

7
Other SRL Related Courses
  • Tom Dietterichs course at OSU http//web.engr.or
    egonstate.edu/tgd/classes/539/
  • David Page, Mark Craven and Jude Shavlik at
    UWischttp//www.biostat.wisc.edu/page/838.html
  • Pedro Domingos course at UWash
  • Eric Mjolsness course at UCI on Probabilistic
    Knowledge Representationhttp//computableplant.ic
    s.uci.edu/emj/classes/280_04/Syllabus20ICS20280
    20v2.doc
  • Stuart Russells course at Berkeley on Knowledge
    Representation and Reasoninghttp//www.cs.berkele
    y.edu/russell/classes/cs289/f04/
  • Joydeep Ghosh course at UT Austin on Advanced
    Topics in Data Mininghttp//www.lans.ece.utexas.e
    du/course/382v/05sp/
  • Michael Littman course at Rutgers on Learned
    Representations in AI,http//www.cs.rutgers.edu/
    mlittman/courses/lightai03/
  • David Jensen and Andrew McCallums course at UMass
    on Computational Social Network
    Analysishttp//kdl.cs.umass.edu/courses/csna/

8
Goals of this Course
  • NEW area
  • Understand Foundations
  • Tutorials on Graphical Models, Logic, ILP, etc.
  • Understand existing work
  • Wade through and make sense of Alphabet Soup of
    approaches (PRMs, BLPs, SLPs, MLPs, RMNs, LBNs,
    etc.)
  • Understand interesting theoretical issues
  • Collective classification, Open World
    assumptions, etc.
  • Study interesting and practical applications of
    SRL
  • Do a significant (publishable) project in this
    area.

9
Course Mechanics
  • Course meets 1000-1245.
  • We will have 15 minute break, typically
    1115-1130
  • Class will consists of
  • Tutorials
  • Exercises
  • Readings and Discussion
  • Course URL
  • http//www.cs.umd.edu/class/spring2005/cmsc828g/
  • Course Wiki
  • stay tuned.

10
Course Expectations
  • SRL Focus problem (15)
  • Each student will develop an SRL focus problem
    (10) due Feb. 11
  • Describe a domain
  • Describe useful inference and learning tasks
  • (Ideally) Collect data
  • Each student will solve SRL focus problem using
    at least two different SRL techniques (5)
  • Lead at least one class discussion (5)
  • Each student will sign up to lead the discussion
    of one (or more depending on class size) class
    discussion topic.
  • Class Participation (15)
  • Each week each student must turn in a short
    discussion of the readings by noon Thursday
    before class. The discussion leader should
    review the others responses, and use them to
    structure the class discussion.
  • Class Project (50)
  • Each student is expected to do a research project
    for the course.
  • Feb. 18, Project Proposals Due
  • Mar. 18, Project Progress Report 1 due
  • Apr. 22, Project Progress Report 2 due
  • May 6, Project Presentations
  • May 12, Project Write-up Due
  • Class Exercises (10)
  • Throughout the course, there will be small class
    exercises

11
Introductions
  • Name
  • Where you are originally from
  • Research Interest/Advisor if you have one

12
SRL Intro Part II
  • An Example Probabilistic Relational Models

13
Bayesian Networks Problem
  • Bayesian nets use propositional representation
  • Real world has objects, related to each other

Intelligence
Difficulty
These instances are not independent
A
C
Grade
14
Probabilistic Relational Models
  • Combine advantages of relational logic BNs
  • Natural domain modeling objects, properties,
    relations
  • Generalization over a variety of situations
  • Compact, natural probability models
  • Integrate uncertainty with relational model
  • Properties of domain entities can depend on
    properties of related entities
  • Uncertainty over relational structure of domain

15
St. Nordaf University
Prof. Smith
Prof. Jones
Teaches
Teaches
Grade
In-course
Registered
Satisfac
George
Grade
Registered
Satisfac
In-course
Grade
Registered
Jane
Satisfac
In-course
16
Relational Schema
  • Specifies types of objects in domain, attributes
    of each type of object types of relations
    between objects

Classes
Student
Professor
Intelligence
Teaching-Ability
Teach
Take
Attributes
Relations
In
Course
Difficulty
17
Representing the Distribution
  • Very large probability space for a given context
  • All possible assignments of all attributes of all
    objects
  • Infinitely many potential contexts
  • Each associated with a very different set of
    worlds

Need to represent infinite set of complex
distributions
18
Probabilistic Relational Models
  • Universals Probabilistic patterns hold for all
    objects in class
  • Locality Represent direct probabilistic
    dependencies
  • Links define potential interactions

Koller Pfeffer Poole Ngo Haddawy
19
PRM Semantics
  • Instantiated PRM ?BN
  • variables attributes of all objects
  • dependencies determined by
  • links PRM

Prof. Smith
Prof. Jones
George
Jane
20
The Web of Influence
easy / hard
low / high
21
Reasoning with a PRM
  • Generic approach
  • Instantiate PRM to produce ground BN
  • Use standard BN inference
  • In most cases, resulting BN is too densely
    connected to allow exact inference
  • Use approximate inference belief propagation
  • Improvement Use domain structure objects
    relations to guide computation
  • Kikuchi approximation where clusters objects

22
Data ? Model ? Objects
Database
Learner
Probabilistic Model
Expert knowledge
  • What are the objects in the new situation?
  • How are they related to each other?

Prob. Inference
Data for New Situation
Friedman, Getoor, Koller Pfeffer
23
PRM Summary
  • PRMs inherit key advantages of probabilistic
    graphical models
  • Coherent probabilistic semantics
  • Exploit structure of local interactions
  • Relational models inherently more expressive
  • Web of influence use multiple sources of
    information to reach conclusions
  • Exploit both relational information and power of
    probabilistic reasoning

24
SRL Link Mining
  • General Issues

25
Linked Data
  • Heterogeneous, multi-relational data represented
    as a graph or network
  • Nodes are objects
  • May have different kinds of objects
  • Objects have attributes
  • Objects may have labels or classes
  • Edges are links
  • May have different kinds of links
  • Links may have attributes
  • Links may be directed, are not required to be
    binary

26
Link Mining Tasks
  • Link-based Object Classification
  • Object Type Prediction
  • Link Type Prediction
  • Predicting Link Existence
  • Link Cardinality Estimation
  • Object Consolidation
  • Group Detection
  • Subgraph Discovery
  • Metadata Mining

27
Link-based Object Classification
  • Predicting the category of an object based on its
    attributes and its links and attributes of linked
    objects
  • web Predict the category of a web page, based on
    words that occur on the page, links between
    pages, anchor text, html tags, etc.
  • cite Predict the topic of a paper, based on word
    occurrence, citations, co-citations
  • epi Predict disease type based on
    characteristics of the patients infected by the
    disease

28
Object Class Prediction
  • Predicting the type of an object based on its
    attributes and its links and attributes of linked
    objects
  • comm Predict whether a communication contact is
    by email, phone call or mail.
  • cite Predict the venue type of a publication
    (conference, journal, workshop)

29
Link Type Classification
  • Predicting type or purpose of link based on
    properties of the participating objects
  • web predict advertising link or navigational
    link predict an advisor-advisee relationship
  • epi predicting whether contact is familial,
    co-worker or acquaintance

30
Predicting Link Existence
  • Predicting whether a link exists between two
    objects
  • web predict whether there will be a link between
    two pages
  • cite predicting whether a paper will cite
    another paper
  • epi predicting who a patients contacts are

31
Link Cardinality Estimation I
  • Predicting the number of links to an object
  • web predict the authoratativeness of a page
    based on the number of in-links identifying hubs
    based on the number of out-links
  • cite predicting the impact of a paper based on
    the number of citations
  • epi predicting the number of people that will be
    infected based on the infectiousness of a disease.

32
Link Cardinality Estimation II
  • Predicting the number of objects reached along a
    path from an object
  • Important for estimating the number of objects
    that will be returned by a query
  • web predicting number of pages retrieved by
    crawling a site
  • cite predicting the number of citations of a
    particular author in a specific journal

33
Entity Resolution
  • Predicting when two objects are the same, based
    on their attributes and their links
  • aka record linkage, duplicate elimination,
    identity uncertainty
  • web predict when two sites are mirrors of each
    other.
  • cite predicting when two citations are referring
    to the same paper.
  • epi predicting when two disease strains are the
    same
  • bio learning when two names refer to the same
    protein

34
Group Detection
  • Predicting when a set of entities belong to the
    same group based on clustering both object
    attribute values and link structure
  • web identifying communities
  • cite identifying research communities

35
Subgraph Identification
  • Find characteristic subgraphs
  • Focus of graph-based data mining (Cook Holder,
    Inokuchi, Washio Motoda, Kuramochi Karypis,
    Yan Han)
  • bio protein structure discovery
  • comm legitimate vs. illegitimate groups
  • chem chemical substructure discovery

36
Metadata Mining
  • Schema mapping, schema discovery, schema
    reformulation
  • cite matching between two bibliographic sources
  • web - discovering schema from unstructured or
    semi-structured data
  • bio mapping between two medical ontologies

37
Link Mining Tasks
  • Link-based Object Classification
  • Object Type Prediction
  • Link Type Prediction
  • Predicting Link Existence
  • Link Cardinality Estimation
  • Object Consolidation
  • Group Detection
  • Subgraph Discovery
  • Metadata Mining

38
SRL General Issues Summary
  • SRL Tasks
  • Link-based Object Classification
  • Object Type Prediction
  • Link Type Prediction
  • Predicting Link Existence
  • SRL Challenges
  • Logical vs. Statistical dependencies
  • Feature construction
  • Instances vs. Classes
  • Collective Classification
  • Link Cardinality Estimation
  • Object Consolidation
  • Group Detection
  • Subgraph Discovery
  • Metadata Mining
  • Collective Consolidation
  • Effective Use of Labeled Unlabeled Data
  • Link Prediction
  • Closed vs. Open World

39
SRL Focus Problem 1
  • Citation Analysis

40
Domain
  • The first focus problem domain is bibliographic
    citation analysis. A large number of SRL
    researchers have worked with this domain. Some
    advantages of this domain are
  • the availability of data (thanks largely to
    Andrew McCallum, William Cohen, Steve Lawrence
    and others)
  • the ease of understanding the domain and
  • our obvious inherent interest in the domain as
    academics, ?.
  • the potential high payoff, high visability of SRL
    apporaches if they can solve this problem.
  • Within this domain, some of the objects are
  • papers, authors, affiliations and venues and so
    on,
  • Some of the links or relationships are
  • citations, authorship and co-authorship and so
    on.
  • An interesting aspect of the problem is that one
    must deal with indentity uncertainty objects can
    be referenced in many ways, and an important task
    is entity resolution figuring out the underlying
    object domains and mappings between references
    and objects.

41
SRL Tasks in FP 1
  • topic prediction collective classification of
    the topics of papers
  • author attribution predicting the author of a
    paper. An issue is whether we assume a closed or
    open world for the authors. Plagiarism detection.
  • author-topic identification discovering the
    topic areas for authors. This can be used for
    example to assign reviewers for papers.
  • entity resolution collective clustering of the
    reference to objects to determine the set of
    authors, papers and venues.
  • topic evolution tracking change in topics over
    time.
  • group detection finding collaboration networks.
  • citation counting/ranking predicting number of
    citations or ranking based on predicted number of
    citations.
  • hidden object invention Analogous to hidden
    variable introduction, the introduction of a
    hidden object, such as an advisor, that relates
    two author instances.
  • predicate invention from co-author information,
    affiliation information and perhaps information
    such as position and room location, invent
    advisor predicate.

42
Data for FP 1
  • Many people have constructed data sets by
    crawling bibliography servers such as CiteSeer,
    ACM, DBLP and, soon one would imagine,
    GoogleScholar.
  • Steve Lawrence several years ago made available a
    large collection of the citeseer data, this is
    available by contacting him.
  • Several versions of the Cora data set are
    available here http//www.cs.umass.edu/mccallum/
    code-data.html
  • The recent 2003 KDD Cup challenge has data
    available from high energy physics,
    http//www.cs.cornell.edu/projects/kddcup/

43
Your Turn
  • Come up with an SRL focus problem
  • Define the schema, objects, links, etc.
  • Describe some SRL tasks in this domain
  • Think about where you could get the data

44
Survey
45
Next Time
  • Graphical Models Review
  • Led by Indrajit Bhattacharya
  • Readings available for pickup and in library.
    (Due to draft nature, they are not available on
    the web)
Write a Comment
User Comments (0)
About PowerShow.com