Title: Extracting Symbolic Knowledge From The Web
1Extracting Symbolic Knowledge From The Web
2The Problem
- The WWW contains information which is easily
understandable to humans , but less
understandable to computers - Web Pages contain mostly text , images and
sounds. These can be immediately processed by
humans, but the information is not necessarily
arranged in the optimal manner for automatic
problem solving by computers.
3The Web-gtKB Projects Long Range Goal
- Create a computer understandable knowledge base
whose content mirrors that of the WWW. - This Knowledge Base would consist of
assertions in symbolic form.
4Web-gtKB Goal (ctd.)
- At the minimum, the KB would allow more
sophisticated queries than keyword-based search
engines.
5The 1st Step A Machine Learning Approach
- Develop a TRAINABLE system , that can be
taught to extract information. - The inputs to the system 1)
An Ontology specifying classes in
hierarchical tree form and relations between
these classes. Class Examples
Student , Person, Research.Project
Relation Examples Advisor.Of(Instructor,Stud
ent) , Projects.Led.By(Project,Researcher).
6A Machine Learning Approach (ctd.)
- 2) (2nd input) Training Examples
that represent instances of the classes and
relations. Given the ontology and
training examples, the system is expected to
extract from the web NEW instances of classes
and relations.
7(No Transcript)
8Some Simplifying Assumptions
- 1) Instances of ontology classes are represented
by single web pages (home pages). - 2) Instances of relations R(A,B) are
represented in one of 3 ways
- a segment
of text connecting the segment representing A
to the segment representing B .
- a contiguous segment of text representing
A that contains a segment representing B .
9Assumptions (ctd.)
- Example Jims home page contains the words
Intro To AI , so courses.taught.by(Jim,Intro2AI
) holds - - The segment representing A is related to B
because it fits some pattern .
Example Jims page contains words typical of AI
researchers , so the relation Research.Of (Jim,
AI) holds.
10The Goal In More Specific Terms
- 1. Recognizing class instances by classifying
segments of hypertext - 2. Recognizing relation instances, mostly by
classifying hyperlinks
11A KB for CS Departments
- 2 sets of pages from CS departments were drawn,
each one with more than 4000 pages . - The 1st set was drawn from four departments, and
the 2nd from many departments. - The classes and relation instances were hand
labeled. - The idea is to learn instances from one
department using the data from other departments
as training input.
121st Method For Recognizing Class Instances- A
Statistical Model
- From the labeled data, each class A is assigned
a vector of probabilities W(A) W(A)(
W1,.Wvocabulary ) Wi the frequency
of word i in a page representing class A. - The working vocabulary size is finite (2000
key words)
13 Recognizing Class Instances- A Statistical
Model (ctd.)
- A new page P is assigned to the class which
is most probable given the distribution of
words in P (The class with minimal distance ). - The calculation is done using a variant of the
KL divergence, which measures the distance
between distributions
D(PQ) Sigma x P(x) log (P(x)/q(x))
14A Statistical Model (ctd.)
- Evaluating the results, according to two
criteria 1) Coverage - The
percentage of pages from a given class that are
correctly classified as belonging to that class
2) Accuracy - The percentage of pages
classified to a class that are actually members
of that class - Theres a natural trade-off between the two.
- We can raise or lower coverage (accuracy) by
setting a confidence threshold Page P will be
classified to class A only if the distribution
in P is sufficiently similar to the
distribution in A .
15 A Statistical Model (ctd.)
- Mistaken Classifications Since the
ontology is hierarchical (Person -gt Faculty
Or Staff Or Student), a classification of an
instance into a more general ancestor class
can still be useful.
16Some Experimental Results
- The Student Class
- Coverage 20 , Accuracy 67 Coverage80
, Accuracy45 - The Course Class
- Coverage 20 , Accuracy 55
- Coverage80 , Accuracy30
172nd Method For Recognizing Class Instances
Learning First Order Rules
- Instead of looking just at the word pattern on a
given page, we look at the word pattern in
neighboring pages as well. - Example A page is a course home page if it
contains the words textbook and TA, and is
linked to a page that contains the word
assignment.
18Recognizing Class Instances Learning First
Order Rules ( ctd .)
- We need an algorithm to infer rules in
predicate form, where the arguments are pages. - The rule defines a target class , using
basic (atomic) predicates of 2 kinds - has_word(Page) - a finite family of
predicates where word can be any word, indicating
that the page contains the word. - link_to(Page1,Page2) - there is a hyperlink
from page 1 to page 2.
19Recognizing Class Instances Learning First
Order Rules (ctd.)
- The input positive instances of basic
predicates and positive and negative instances
of the target class. - The output rules are in terms of the basic
predicates (previous slide). - The algorithm (FOIL) uses a greedy method
At each stage , add to the current
definition a basic predicate that excludes
from the instances still unaccounted for as many
negative examples as possible, while including
as many positive examples as possible.
20Recognizing Class Instances Learning First
Order Rules (ctd.)
- Example The Algorithm found the following
definition for the class faculty
faculty(A) has_professor(A),
has_ph(A), link_to(B,A), has_faculty(B) - MeaningA page belongs to a faculty member if it
contains the words professor and ph (prefix
of phd) , and there is a link to it from a
page containing the word faculty.
21More Experimental Results (for first order
rules)
- The student class Coverage 20 ,
Accuracy 90
Coverage 80 , Accuracy 70
- Tends to be more accurate than
statistical classification, but the coverage is
not as good (hard to come up with rules that
will make all instances of the class classified
as belonging to the class) .
22Recognizing Relation Instances
- - The main issue to consider is hyperlink
connection. - - The algorithm is similar to the algorithm for
learning 1st order class rules. - - The underlying assumption a relation between
pages is expressed in terms of a hyperlink or
a chain of hyperlinks. Therefore , we need
predicates whose arguments are pages and
hyperlinks. - - The algorithm is applied assuming class
instances have already been extracted. -
- Tjhe studying 1st order r ru
rulerules. T
23Recognizing Relation Instances (ctd.)
- The basic predicates
- class(Page)
- link_to(Hyperlink,Page,Page)
- Has_word(Hyperlink)
- all_words_capitalized(Hyperlink)
- has_alphanumeric_word(Hyperlink)
- has_neighborhood_word(Hyperlink)
- Tjhe studying 1st order r ru
rulerules. T
24Recognizing Relation Instances (ctd.)
- An Example Learned Rule
- members.of.project(A,B)
- research_project(A) , person(B) , link_to(C,A,D),
link_to(E,D,B) - Meaning The projects home page A points to an
intermediate page D which points to a personal
home page B. - Tjhe studying 1st order r ru
rulerules. T
25Recognizing Relation Instances (ctd.)
- Another Rule
- department.of.person(A,B)
- person(A) , department(B),link_to(C,D,A),
link_to(E,F,D),link_to(G,B,F), - has_neighborhood_graduate(E)
- Meaning A 3 hyperlink path from a department
to a person, requiring that the word graduate
occur near the 2nd hyperlink. -
- Tjhe studying 1st order r ru
rulerules. T
26Recognizing Relation Instances (ctd.)
- Better results than for 1st order class learning
rules The coverage was not perfect but the
accuracy was close to 100. -
- Tjhe studying 1st order r ru
rulerules. T
27Future Research
- 1) Relaxation of restrictions, e.g a class may
be represented by more than one page.
2) Exploiting HTML structure there are
different kinds of text fields in a page .
28References
- 1) M. Craven et al. , Learning to Extract
Symbolic Knowledge from the World Wide Web,
AAAI-98. - 2) More Articles The Machine Learning Group at
CMU - www.cs.cmu.edu/Groups/TextLearning/