Title: Learning to construct knowledge bases from the World Wide Web
1Learning to construct knowledge bases from the
World Wide Web
- Presented by Janneke van der Zwaan
2Overview
- Introduction
- Overview of Web?KB system
- Learning tasks and algorithms
- Conclusions
3Introduction
- Goal of Web?KB system
- Automatically create computer-understandable
knowledge base (KB) - Knowledge is extracted from the WWW
- Assertions in symbolic, probabilistic form, e.g.
employed_by(mark_craven, carnegie_mellon_univ),
probability.99
4Introduction
- Knowledge base can be used for
- Effective information retrieval for complex
queries - E.g. find all universities within 20 miles of
Pittsburgh - Knowledge-based inference and problem solving
- E.g. software travel agent
5Overview of Web?KB
- Web?KB is trained to extract desired knowledge
- Browses new Web sites to extract assertions from
hypertext
6Learning task
- Hypertext classifier
- Input
- Ontology (specification of the classes and
relations of interest) - Labeled training instances
- Output
- General procedures for extracting new instances
(classes and relations) from the Web
7Example ontology
8(No Transcript)
9Assumptions about the types of knowledge
- Instances of classes are segments of hypertext
- Single Web page
- Rooted graph of several Web pages
- Contiguous string within a Web page
- Instances of relations
- Undirected path of hyperlinks between pages
- Segment of text contained within other text
segment - Hypertext segment satisfies some learned model
for relatedness
10Primary learning tasks
- Recognizing class instances
- Recognizing relation instances
- Recognizing class and relation instances by
extraction small fields of text
11Experimental testbed
- All experiments based on ontology for computer
science departments - Set of pages and hyperlinks from computer science
departments of 4 universities - Labeled pages and relation instances
- Set of pages labeled as other
- Four-fold cross-validation
- Train on 3 universities and test on 1
12Performance measures
- Coverage
- Percentage of pages of a class that are correctly
classified - Accuracy
- Percentage of pages classified into a given class
that are actually members of that class
13Learning to recognize class instances
- Assumption class instances are represented by
Web pages - Statistical bag-of-words approach (Naïve Bayes)
- First-order rules
- Combination of classifiers
14Statistical bag-of-words approach
- Three independent classifiers based on different
representations - Full-text (baseline)
- Title/Heading
- Hyperlink
15Statistical bag-of-words approach
- Naïve Bayes classifier with minor modifications
based on Kullback-Leibler Divergence - Witten-Bell smoothing
- Restricted vocabulary of 2000 words
16Results
Full-text
Title/Heading
Hyperlink
17First-order text classification
- Take hyperlink structure into account
- Based on Quinlans FOIL algorithm
- Greedy covering algorithm for learning Horn
clauses - Add literals to rule until clause covers set of
positive instances
18First-order text classification
- Representation of Web pages and hyperlinks
- has_word(Page)
- link_to(Page, Page)
- m-estimate for accuracy of rules
-
19First-order text classification example rules
20First-order text classification results
21Combination of classifiers
- Predict class of majority of classifiers
- If no majority, select class with highest
confidence - Make confidence measures comparable
22Combination of classifiers results
- Not uniformly better than constituent classifiers
- Due to simple method of combining classifiers
- Better performance with small vocabulary
23Learning to recognize relation instances
- Relations between instances of classes
- Often represented by hyperlink paths
- E.g. instructors_of(A, B) - course(A),
person(B), link_to(A, B)
24Learning to recognize relation instances
- Rule learning consists of two phases
- Path of hyperlinks
- Literals
25Example rules
26Results
27Learning to extract text fields
- Instance sometimes represented by text field
instead of entire Web page - Information Extraction (IE) with Sequence Rules
with Validation (SRV)
28Sequence Rules with Validation
- Input
- Set of pages with labeled text fields
- Set of features defined over tokens
- Output
- Set of information extraction rules
- Rules are grown literal by literal
- length(lt,gt,, N)
- some(Var, Path, Feat, Value)
29SRV Example
30Results
31Conclusions
- Web?KB works reasonably well
- Accuracy gt 70 at coverage levels 30
- But
- What level of accuracy can be achieved?
- What level of accuracy is required?
- A lot of hand-labeled Web pages needed for
training - How much effort will be required to train the
system for each new or extended ontology?
32Questions?