Extracting Symbolic Knowledge From The Web - PowerPoint PPT Presentation

About This Presentation
Title:

Extracting Symbolic Knowledge From The Web

Description:

We can raise or lower coverage (accuracy) by setting a confidence threshold : ... The student class: Coverage= , Accuracy =  Coverage = € , Accuracy =% 70 ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 29
Provided by: ofe5
Category:

less

Transcript and Presenter's Notes

Title: Extracting Symbolic Knowledge From The Web


1
Extracting Symbolic Knowledge From The Web
  • Ofer Neiman

2
The Problem
  • The WWW contains information which is easily
    understandable to humans , but less
    understandable to computers
  • Web Pages contain mostly text , images and
    sounds. These can be immediately processed by
    humans, but the information is not necessarily
    arranged in the optimal manner for automatic
    problem solving by computers.

3
The Web-gtKB Projects Long Range Goal
  • Create a computer understandable knowledge base
    whose content mirrors that of the WWW.
  • This Knowledge Base would consist of
    assertions in symbolic form.

4
Web-gtKB Goal (ctd.)
  • At the minimum, the KB would allow more
    sophisticated queries than keyword-based search
    engines.

5
The 1st Step A Machine Learning Approach
  • Develop a TRAINABLE system , that can be
    taught to extract information.
  • The inputs to the system 1)
    An Ontology specifying classes in
    hierarchical tree form and relations between
    these classes. Class Examples
    Student , Person, Research.Project
    Relation Examples Advisor.Of(Instructor,Stud
    ent) , Projects.Led.By(Project,Researcher).


6
A Machine Learning Approach (ctd.)
  • 2) (2nd input) Training Examples
    that represent instances of the classes and
    relations. Given the ontology and
    training examples, the system is expected to
    extract from the web NEW instances of classes
    and relations.

7
(No Transcript)
8
Some Simplifying Assumptions
  • 1) Instances of ontology classes are represented
    by single web pages (home pages).
  • 2) Instances of relations R(A,B) are
    represented in one of 3 ways
    - a segment
    of text connecting the segment representing A
    to the segment representing B .
    - a contiguous segment of text representing
    A that contains a segment representing B .

9
Assumptions (ctd.)
  • Example Jims home page contains the words
    Intro To AI , so courses.taught.by(Jim,Intro2AI
    ) holds
  • - The segment representing A is related to B
    because it fits some pattern .
    Example Jims page contains words typical of AI
    researchers , so the relation Research.Of (Jim,
    AI) holds.

10
The Goal In More Specific Terms
  • 1. Recognizing class instances by classifying
    segments of hypertext
  • 2. Recognizing relation instances, mostly by
    classifying hyperlinks

11
A KB for CS Departments
  • 2 sets of pages from CS departments were drawn,
    each one with more than 4000 pages .
  • The 1st set was drawn from four departments, and
    the 2nd from many departments.
  • The classes and relation instances were hand
    labeled.
  • The idea is to learn instances from one
    department using the data from other departments
    as training input.

12
1st Method For Recognizing Class Instances- A
Statistical Model
  • From the labeled data, each class A is assigned
    a vector of probabilities W(A) W(A)(
    W1,.Wvocabulary ) Wi the frequency
    of word i in a page representing class A.
  • The working vocabulary size is finite (2000
    key words)

13
Recognizing Class Instances- A Statistical
Model (ctd.)
  • A new page P is assigned to the class which
    is most probable given the distribution of
    words in P (The class with minimal distance ).
  • The calculation is done using a variant of the
    KL divergence, which measures the distance
    between distributions
    D(PQ) Sigma x P(x) log (P(x)/q(x))

14
A Statistical Model (ctd.)
  • Evaluating the results, according to two
    criteria 1) Coverage - The
    percentage of pages from a given class that are
    correctly classified as belonging to that class
    2) Accuracy - The percentage of pages
    classified to a class that are actually members
    of that class
  • Theres a natural trade-off between the two.
  • We can raise or lower coverage (accuracy) by
    setting a confidence threshold Page P will be
    classified to class A only if the distribution
    in P is sufficiently similar to the
    distribution in A .

15
A Statistical Model (ctd.)
  • Mistaken Classifications Since the
    ontology is hierarchical (Person -gt Faculty
    Or Staff Or Student), a classification of an
    instance into a more general ancestor class
    can still be useful.

16
Some Experimental Results
  • The Student Class

  • Coverage 20 , Accuracy 67 Coverage80
    , Accuracy45
  • The Course Class
  • Coverage 20 , Accuracy 55
  • Coverage80 , Accuracy30

17
2nd Method For Recognizing Class Instances
Learning First Order Rules
  • Instead of looking just at the word pattern on a
    given page, we look at the word pattern in
    neighboring pages as well.
  • Example A page is a course home page if it
    contains the words textbook and TA, and is
    linked to a page that contains the word
    assignment.

18
Recognizing Class Instances Learning First
Order Rules ( ctd .)
  • We need an algorithm to infer rules in
    predicate form, where the arguments are pages.
  • The rule defines a target class , using
    basic (atomic) predicates of 2 kinds
  • has_word(Page) - a finite family of
    predicates where word can be any word, indicating
    that the page contains the word.
  • link_to(Page1,Page2) - there is a hyperlink
    from page 1 to page 2.

19
Recognizing Class Instances Learning First
Order Rules (ctd.)
  • The input positive instances of basic
    predicates and positive and negative instances
    of the target class.
  • The output rules are in terms of the basic
    predicates (previous slide).
  • The algorithm (FOIL) uses a greedy method
    At each stage , add to the current
    definition a basic predicate that excludes
    from the instances still unaccounted for as many
    negative examples as possible, while including
    as many positive examples as possible.

20
Recognizing Class Instances Learning First
Order Rules (ctd.)
  • Example The Algorithm found the following
    definition for the class faculty
    faculty(A) has_professor(A),
    has_ph(A), link_to(B,A), has_faculty(B)
  • MeaningA page belongs to a faculty member if it
    contains the words professor and ph (prefix
    of phd) , and there is a link to it from a
    page containing the word faculty.

21
More Experimental Results (for first order
rules)
  • The student class Coverage 20 ,
    Accuracy 90
    Coverage 80 , Accuracy 70
  • Tends to be more accurate than
    statistical classification, but the coverage is
    not as good (hard to come up with rules that
    will make all instances of the class classified
    as belonging to the class) .

22
Recognizing Relation Instances
  • - The main issue to consider is hyperlink
    connection.
  • - The algorithm is similar to the algorithm for
    learning 1st order class rules.
  • - The underlying assumption a relation between
    pages is expressed in terms of a hyperlink or
    a chain of hyperlinks. Therefore , we need
    predicates whose arguments are pages and
    hyperlinks.
  • - The algorithm is applied assuming class
    instances have already been extracted.
  • Tjhe studying 1st order r ru
    rulerules. T

23
Recognizing Relation Instances (ctd.)
  • The basic predicates
  • class(Page)
  • link_to(Hyperlink,Page,Page)
  • Has_word(Hyperlink)
  • all_words_capitalized(Hyperlink)
  • has_alphanumeric_word(Hyperlink)
  • has_neighborhood_word(Hyperlink)
  • Tjhe studying 1st order r ru
    rulerules. T

24
Recognizing Relation Instances (ctd.)
  • An Example Learned Rule
  • members.of.project(A,B)
  • research_project(A) , person(B) , link_to(C,A,D),
    link_to(E,D,B)
  • Meaning The projects home page A points to an
    intermediate page D which points to a personal
    home page B.
  • Tjhe studying 1st order r ru
    rulerules. T

25
Recognizing Relation Instances (ctd.)
  • Another Rule
  • department.of.person(A,B)
  • person(A) , department(B),link_to(C,D,A),
    link_to(E,F,D),link_to(G,B,F),
  • has_neighborhood_graduate(E)
  • Meaning A 3 hyperlink path from a department
    to a person, requiring that the word graduate
    occur near the 2nd hyperlink.
  • Tjhe studying 1st order r ru
    rulerules. T

26
Recognizing Relation Instances (ctd.)
  • Better results than for 1st order class learning
    rules The coverage was not perfect but the
    accuracy was close to 100.
  • Tjhe studying 1st order r ru
    rulerules. T

27
Future Research
  • 1) Relaxation of restrictions, e.g a class may
    be represented by more than one page.
    2) Exploiting HTML structure there are
    different kinds of text fields in a page .

28
References
  • 1) M. Craven et al. , Learning to Extract
    Symbolic Knowledge from the World Wide Web,
    AAAI-98.
  • 2) More Articles The Machine Learning Group at
    CMU
  • www.cs.cmu.edu/Groups/TextLearning/
Write a Comment
User Comments (0)
About PowerShow.com