Extracting Symbolic Knowledge From The Web - PowerPoint PPT Presentation

About This Presentation

Title:

Extracting Symbolic Knowledge From The Web

Description:

We can raise or lower coverage (accuracy) by setting a confidence threshold : ... The student class: Coverage= , Accuracy = Coverage = , Accuracy =% 70 ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 29

Provided by: ofe5

Category:

more less

Transcript and Presenter's Notes

Title: Extracting Symbolic Knowledge From The Web

1
Extracting Symbolic Knowledge From The Web

Ofer Neiman

2
The Problem

The WWW contains information which is easily
understandable to humans , but less
understandable to computers
Web Pages contain mostly text , images and
sounds. These can be immediately processed by
humans, but the information is not necessarily
arranged in the optimal manner for automatic
problem solving by computers.

3
The Web-gtKB Projects Long Range Goal

Create a computer understandable knowledge base
whose content mirrors that of the WWW.
This Knowledge Base would consist of
assertions in symbolic form.

4
Web-gtKB Goal (ctd.)

At the minimum, the KB would allow more
sophisticated queries than keyword-based search
engines.

5
The 1st Step A Machine Learning Approach

Develop a TRAINABLE system , that can be
taught to extract information.
The inputs to the system 1)
An Ontology specifying classes in
hierarchical tree form and relations between
these classes. Class Examples
Student , Person, Research.Project
Relation Examples Advisor.Of(Instructor,Stud
ent) , Projects.Led.By(Project,Researcher).

6
A Machine Learning Approach (ctd.)

2) (2nd input) Training Examples
that represent instances of the classes and
relations. Given the ontology and
training examples, the system is expected to
extract from the web NEW instances of classes
and relations.

7
(No Transcript)
8
Some Simplifying Assumptions

1) Instances of ontology classes are represented
by single web pages (home pages).
2) Instances of relations R(A,B) are
represented in one of 3 ways
- a segment
of text connecting the segment representing A
to the segment representing B .
- a contiguous segment of text representing
A that contains a segment representing B .

9
Assumptions (ctd.)

Example Jims home page contains the words
Intro To AI , so courses.taught.by(Jim,Intro2AI
) holds
- The segment representing A is related to B
because it fits some pattern .
Example Jims page contains words typical of AI
researchers , so the relation Research.Of (Jim,
AI) holds.

10
The Goal In More Specific Terms

1. Recognizing class instances by classifying
segments of hypertext
2. Recognizing relation instances, mostly by
classifying hyperlinks

11
A KB for CS Departments

2 sets of pages from CS departments were drawn,
each one with more than 4000 pages .
The 1st set was drawn from four departments, and
the 2nd from many departments.
The classes and relation instances were hand
labeled.
The idea is to learn instances from one
department using the data from other departments
as training input.

12
1st Method For Recognizing Class Instances- A
Statistical Model

From the labeled data, each class A is assigned
a vector of probabilities W(A) W(A)(
W1,.Wvocabulary ) Wi the frequency
of word i in a page representing class A.
The working vocabulary size is finite (2000
key words)

13
Recognizing Class Instances- A Statistical
Model (ctd.)

A new page P is assigned to the class which
is most probable given the distribution of
words in P (The class with minimal distance ).
The calculation is done using a variant of the
KL divergence, which measures the distance
between distributions
D(PQ) Sigma x P(x) log (P(x)/q(x))

14
A Statistical Model (ctd.)

Evaluating the results, according to two
criteria 1) Coverage - The
percentage of pages from a given class that are
correctly classified as belonging to that class
2) Accuracy - The percentage of pages
classified to a class that are actually members
of that class
Theres a natural trade-off between the two.
We can raise or lower coverage (accuracy) by
setting a confidence threshold Page P will be
classified to class A only if the distribution
in P is sufficiently similar to the
distribution in A .

15
A Statistical Model (ctd.)

Mistaken Classifications Since the
ontology is hierarchical (Person -gt Faculty
Or Staff Or Student), a classification of an
instance into a more general ancestor class
can still be useful.

16
Some Experimental Results

The Student Class
Coverage 20 , Accuracy 67 Coverage80
, Accuracy45
The Course Class
Coverage 20 , Accuracy 55
Coverage80 , Accuracy30

17
2nd Method For Recognizing Class Instances
Learning First Order Rules

Instead of looking just at the word pattern on a
given page, we look at the word pattern in
neighboring pages as well.
Example A page is a course home page if it
contains the words textbook and TA, and is
linked to a page that contains the word
assignment.

18
Recognizing Class Instances Learning First
Order Rules ( ctd .)

We need an algorithm to infer rules in
predicate form, where the arguments are pages.
The rule defines a target class , using
basic (atomic) predicates of 2 kinds
has_word(Page) - a finite family of
predicates where word can be any word, indicating
that the page contains the word.
link_to(Page1,Page2) - there is a hyperlink
from page 1 to page 2.

19
Recognizing Class Instances Learning First
Order Rules (ctd.)

The input positive instances of basic
predicates and positive and negative instances
of the target class.
The output rules are in terms of the basic
predicates (previous slide).
The algorithm (FOIL) uses a greedy method
At each stage , add to the current
definition a basic predicate that excludes
from the instances still unaccounted for as many
negative examples as possible, while including
as many positive examples as possible.

20
Recognizing Class Instances Learning First
Order Rules (ctd.)

Example The Algorithm found the following
definition for the class faculty
faculty(A) has_professor(A),
has_ph(A), link_to(B,A), has_faculty(B)
MeaningA page belongs to a faculty member if it
contains the words professor and ph (prefix
of phd) , and there is a link to it from a
page containing the word faculty.

21
More Experimental Results (for first order
rules)

The student class Coverage 20 ,
Accuracy 90
Coverage 80 , Accuracy 70
Tends to be more accurate than
statistical classification, but the coverage is
not as good (hard to come up with rules that
will make all instances of the class classified
as belonging to the class) .

22
Recognizing Relation Instances

- The main issue to consider is hyperlink
connection.
- The algorithm is similar to the algorithm for
learning 1st order class rules.
- The underlying assumption a relation between
pages is expressed in terms of a hyperlink or
a chain of hyperlinks. Therefore , we need
predicates whose arguments are pages and
hyperlinks.
- The algorithm is applied assuming class
instances have already been extracted.
Tjhe studying 1st order r ru
rulerules. T

23
Recognizing Relation Instances (ctd.)

The basic predicates
class(Page)
link_to(Hyperlink,Page,Page)
Has_word(Hyperlink)
all_words_capitalized(Hyperlink)
has_alphanumeric_word(Hyperlink)
has_neighborhood_word(Hyperlink)
Tjhe studying 1st order r ru
rulerules. T

24
Recognizing Relation Instances (ctd.)

An Example Learned Rule
members.of.project(A,B)
research_project(A) , person(B) , link_to(C,A,D),
link_to(E,D,B)
Meaning The projects home page A points to an
intermediate page D which points to a personal
home page B.
Tjhe studying 1st order r ru
rulerules. T

25
Recognizing Relation Instances (ctd.)

Another Rule
department.of.person(A,B)
person(A) , department(B),link_to(C,D,A),
link_to(E,F,D),link_to(G,B,F),
has_neighborhood_graduate(E)
Meaning A 3 hyperlink path from a department
to a person, requiring that the word graduate
occur near the 2nd hyperlink.
Tjhe studying 1st order r ru
rulerules. T

26
Recognizing Relation Instances (ctd.)

Better results than for 1st order class learning
rules The coverage was not perfect but the
accuracy was close to 100.
Tjhe studying 1st order r ru
rulerules. T

27
Future Research

1) Relaxation of restrictions, e.g a class may
be represented by more than one page.
2) Exploiting HTML structure there are
different kinds of text fields in a page .

28
References

1) M. Craven et al. , Learning to Extract
Symbolic Knowledge from the World Wide Web,
AAAI-98.
2) More Articles The Machine Learning Group at
CMU
www.cs.cmu.edu/Groups/TextLearning/

Write a Comment

User Comments (0)