Learning to construct knowledge bases from the World Wide Web presentation

About This Presentation

Transcript and Presenter's Notes

Title: Learning to construct knowledge bases from the World Wide Web

1
Learning to construct knowledge bases from the
World Wide Web

Presented by Janneke van der Zwaan

2
Overview

Introduction
Overview of Web?KB system
Learning tasks and algorithms
Conclusions

3
Introduction

Goal of Web?KB system
Automatically create computer-understandable
knowledge base (KB)
Knowledge is extracted from the WWW
Assertions in symbolic, probabilistic form, e.g.
employed_by(mark_craven, carnegie_mellon_univ),
probability.99

4
Introduction

Knowledge base can be used for
Effective information retrieval for complex
queries
E.g. find all universities within 20 miles of
Pittsburgh
Knowledge-based inference and problem solving
E.g. software travel agent

5
Overview of Web?KB

Web?KB is trained to extract desired knowledge
Browses new Web sites to extract assertions from
hypertext

6
Learning task

Hypertext classifier
Input
Ontology (specification of the classes and
relations of interest)
Labeled training instances
Output
General procedures for extracting new instances
(classes and relations) from the Web

7
Example ontology
8
(No Transcript)
9
Assumptions about the types of knowledge

Instances of classes are segments of hypertext
Single Web page
Rooted graph of several Web pages
Contiguous string within a Web page
Instances of relations
Undirected path of hyperlinks between pages
Segment of text contained within other text
segment
Hypertext segment satisfies some learned model
for relatedness

10
Primary learning tasks

Recognizing class instances
Recognizing relation instances
Recognizing class and relation instances by
extraction small fields of text

11
Experimental testbed

All experiments based on ontology for computer
science departments
Set of pages and hyperlinks from computer science
departments of 4 universities
Labeled pages and relation instances
Set of pages labeled as other
Four-fold cross-validation
Train on 3 universities and test on 1

12
Performance measures

Coverage
Percentage of pages of a class that are correctly
classified
Accuracy
Percentage of pages classified into a given class
that are actually members of that class

13
Learning to recognize class instances

Assumption class instances are represented by
Web pages
Statistical bag-of-words approach (Naïve Bayes)
First-order rules
Combination of classifiers

14
Statistical bag-of-words approach

Three independent classifiers based on different
representations
Full-text (baseline)
Title/Heading
Hyperlink

15
Statistical bag-of-words approach

Naïve Bayes classifier with minor modifications
based on Kullback-Leibler Divergence
Witten-Bell smoothing
Restricted vocabulary of 2000 words

16
Results
Full-text
Title/Heading
Hyperlink
17
First-order text classification

Take hyperlink structure into account
Based on Quinlans FOIL algorithm
Greedy covering algorithm for learning Horn
clauses
Add literals to rule until clause covers set of
positive instances

18
First-order text classification

Representation of Web pages and hyperlinks
has_word(Page)
link_to(Page, Page)
m-estimate for accuracy of rules

19
First-order text classification example rules
20
First-order text classification results
21
Combination of classifiers

Predict class of majority of classifiers
If no majority, select class with highest
confidence
Make confidence measures comparable

22
Combination of classifiers results

Not uniformly better than constituent classifiers
Due to simple method of combining classifiers
Better performance with small vocabulary

23
Learning to recognize relation instances

Relations between instances of classes
Often represented by hyperlink paths
E.g. instructors_of(A, B) - course(A),
person(B), link_to(A, B)

24
Learning to recognize relation instances

Rule learning consists of two phases
Path of hyperlinks
Literals

25
Example rules
26
Results
27
Learning to extract text fields

Instance sometimes represented by text field
instead of entire Web page
Information Extraction (IE) with Sequence Rules
with Validation (SRV)

28
Sequence Rules with Validation

Input
Set of pages with labeled text fields
Set of features defined over tokens
Output
Set of information extraction rules
Rules are grown literal by literal
length(lt,gt,, N)
some(Var, Path, Feat, Value)

29
SRV Example
30
Results
31
Conclusions

Web?KB works reasonably well
Accuracy gt 70 at coverage levels 30
But
What level of accuracy can be achieved?
What level of accuracy is required?
A lot of hand-labeled Web pages needed for
training
How much effort will be required to train the
system for each new or extended ontology?

32
Questions?

Write a Comment

User Comments (0)

About PowerShow.com

Learning to construct knowledge bases from the World Wide Web PowerPoint PPT Presentation