Learning to construct knowledge bases from the World Wide Web PowerPoint PPT Presentation

presentation player overlay
1 / 32
About This Presentation
Transcript and Presenter's Notes

Title: Learning to construct knowledge bases from the World Wide Web


1
Learning to construct knowledge bases from the
World Wide Web
  • Presented by Janneke van der Zwaan

2
Overview
  • Introduction
  • Overview of Web?KB system
  • Learning tasks and algorithms
  • Conclusions

3
Introduction
  • Goal of Web?KB system
  • Automatically create computer-understandable
    knowledge base (KB)
  • Knowledge is extracted from the WWW
  • Assertions in symbolic, probabilistic form, e.g.
    employed_by(mark_craven, carnegie_mellon_univ),
    probability.99

4
Introduction
  • Knowledge base can be used for
  • Effective information retrieval for complex
    queries
  • E.g. find all universities within 20 miles of
    Pittsburgh
  • Knowledge-based inference and problem solving
  • E.g. software travel agent

5
Overview of Web?KB
  • Web?KB is trained to extract desired knowledge
  • Browses new Web sites to extract assertions from
    hypertext

6
Learning task
  • Hypertext classifier
  • Input
  • Ontology (specification of the classes and
    relations of interest)
  • Labeled training instances
  • Output
  • General procedures for extracting new instances
    (classes and relations) from the Web

7
Example ontology
8
(No Transcript)
9
Assumptions about the types of knowledge
  • Instances of classes are segments of hypertext
  • Single Web page
  • Rooted graph of several Web pages
  • Contiguous string within a Web page
  • Instances of relations
  • Undirected path of hyperlinks between pages
  • Segment of text contained within other text
    segment
  • Hypertext segment satisfies some learned model
    for relatedness

10
Primary learning tasks
  • Recognizing class instances
  • Recognizing relation instances
  • Recognizing class and relation instances by
    extraction small fields of text

11
Experimental testbed
  • All experiments based on ontology for computer
    science departments
  • Set of pages and hyperlinks from computer science
    departments of 4 universities
  • Labeled pages and relation instances
  • Set of pages labeled as other
  • Four-fold cross-validation
  • Train on 3 universities and test on 1

12
Performance measures
  • Coverage
  • Percentage of pages of a class that are correctly
    classified
  • Accuracy
  • Percentage of pages classified into a given class
    that are actually members of that class

13
Learning to recognize class instances
  • Assumption class instances are represented by
    Web pages
  • Statistical bag-of-words approach (Naïve Bayes)
  • First-order rules
  • Combination of classifiers

14
Statistical bag-of-words approach
  • Three independent classifiers based on different
    representations
  • Full-text (baseline)
  • Title/Heading
  • Hyperlink

15
Statistical bag-of-words approach
  • Naïve Bayes classifier with minor modifications
    based on Kullback-Leibler Divergence
  • Witten-Bell smoothing
  • Restricted vocabulary of 2000 words

16
Results
Full-text
Title/Heading
Hyperlink
17
First-order text classification
  • Take hyperlink structure into account
  • Based on Quinlans FOIL algorithm
  • Greedy covering algorithm for learning Horn
    clauses
  • Add literals to rule until clause covers set of
    positive instances

18
First-order text classification
  • Representation of Web pages and hyperlinks
  • has_word(Page)
  • link_to(Page, Page)
  • m-estimate for accuracy of rules

19
First-order text classification example rules
20
First-order text classification results
21
Combination of classifiers
  • Predict class of majority of classifiers
  • If no majority, select class with highest
    confidence
  • Make confidence measures comparable

22
Combination of classifiers results
  • Not uniformly better than constituent classifiers
  • Due to simple method of combining classifiers
  • Better performance with small vocabulary

23
Learning to recognize relation instances
  • Relations between instances of classes
  • Often represented by hyperlink paths
  • E.g. instructors_of(A, B) - course(A),
    person(B), link_to(A, B)

24
Learning to recognize relation instances
  • Rule learning consists of two phases
  • Path of hyperlinks
  • Literals

25
Example rules
26
Results
27
Learning to extract text fields
  • Instance sometimes represented by text field
    instead of entire Web page
  • Information Extraction (IE) with Sequence Rules
    with Validation (SRV)

28
Sequence Rules with Validation
  • Input
  • Set of pages with labeled text fields
  • Set of features defined over tokens
  • Output
  • Set of information extraction rules
  • Rules are grown literal by literal
  • length(lt,gt,, N)
  • some(Var, Path, Feat, Value)

29
SRV Example
30
Results
31
Conclusions
  • Web?KB works reasonably well
  • Accuracy gt 70 at coverage levels 30
  • But
  • What level of accuracy can be achieved?
  • What level of accuracy is required?
  • A lot of hand-labeled Web pages needed for
    training
  • How much effort will be required to train the
    system for each new or extended ontology?

32
Questions?
Write a Comment
User Comments (0)
About PowerShow.com