The - PowerPoint PPT Presentation

About This Presentation

Title:

The

Description:

1. Self-supervised extraction from Wikipedia text (and the greater Web) ... Self-Supervised Extraction from Wikipedia Text. 2. Automatic Ontology Generation ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 40

Provided by: wuf

Learn more at: https://homes.cs.washington.edu

Category:

Tags: wikipedia

more less

Transcript and Presenter's Notes

Title: The

1
The Intelligence in Wikipedia Project
Daniel S. Weld Department of Computer Science
Engineering University of Washington Seattle, WA,
USA Joint Work with Fei Wu, Eytan Adar,
Saleema Amershi, Oren Etzioni, James Fogarty,
Raphael Hoffmann, Kayur Patel, Stef
Schoenmackers Michael Skinner
2
Wikipedia for AI

Benefit to AI

Fantastic Corpus
Dynamic Environment
E.g., Bot Framework
Powers Reasoning
Semantic Distance Measure PonzettoStrube07
Word-Sense Disambiguation BunescuPasca06,
Mihalcea07
Coreference Resolution PonzettoStrube06,
YangSu07
Ontology / Taxonomy Suchanek07, Muchnik07
Multi-Lingual Alignment AdafreRijke06
Question Answering Ahn et al.05, Kaisser08
Basis of Huge KB Auer et al.07

3
AI for Wikipedia

Benefit to Wikipedia Tools

Internal link maintenance
Infobox Creation
Schema Management
Reference suggestion fact checking
Disambiguation page maintenance
Translation across languages
Vandalism Alerts

Comscore MediaMetrix August 2007
4
Motivating Vision

Next-Generation Search Information Extraction
Ontology
Inference

Which German Scientists Taught at US
Universities?
5
Next-Generation Search

Information Extraction
Ontology
Physicist (x) ? Scientist(x)
Inference
Einstein Einstein

Scalable Means Self-Supervised
6
Why Mine Wikipedia?

Pros

Cons

High-quality, comprehensive
UID for key concepts
First sentence as definition
Infoboxes
Categories lists
Redirection pages
Disambiguation pages
Revision history
Multilingual corpus

Natural-language
Missing data
Inconsistent
Low redundancy

7
The Intelligence in Wikipedia Project

Outline
1. Self-supervised extraction from Wikipedia
text
(and the greater Web)
2. Automatic ontology generation
3. Scalable probabilistic inference for Q/A

Outline
1. Self-supervised extraction from Wikipedia
text
2. Automatic ontology generation
3. Scalable probabilistic inference for Q/A

9
Building on SNOWBALL AgichteinGravano
00 MULDER Kwok et al. TOIS01 AskMSR Brill
et al. EMNLP02 KnowItAll Etzioni et al.
AAAI04

Outline
1. Self-supervised extraction from Wikipedia
text
2. Automatic ontology generation
3. Scalable probabilistic inference for Q/A

10
Kylin Self-Supervised Information
Extraction from Wikipedia
Wu Weld CIKM 2007
From infoboxes to a training set
Clearfield County was created in 1804 from parts
of Huntingdon and Lycoming Counties but was
administered as part of Centre County until 1812.
Its county seat is Clearfield.
2,972 km² (1,147 mi²) of it is land and 17 km²
(7 mi²) of it (0.56) is water.
As of 2005, the population density was 28.2/km².
11
Kylin Architecture
12
Preliminary Evaluation

Kylin Performed Well on Popular Classes
Precision mid 70 high 90
Recall low 50 mid 90

... Floundered on Sparse Classes Little
Training Data

82
13
Shrinkage?
person (1201)
.birth_place
performer (44)
.location
.birthplace .birth_place .cityofbirth .origin
actor (8738)
comedian (106)
14

Outline
1. Self-Supervised Extraction from Wikipedia
Text
Training on Infoboxes
Improving Recall Shrinkage, Retraining, Web
Extraction
Community Content Creation
2. Automatic Ontology Generation
3. Scalable Probabilistic Inference for Q/A

15
KOG Kylin Ontology Generator Wu Weld, WWW08
16
Subsumption Detection

Binary Classification Problem
Nine Complex Features
E.g., String Features
IR Measures
Mapping to Wordnet
Hearst Pattern Matches
Class Transitions in Revision History
Learning Algorithm
SVM MLN Joint Inference

Person
6/07 Einstein
Scientist
Physicist
17
KOG Architecture
18
Schema Mapping
Performer
Person
birth_date birth_place name other_names
birthdate location name othername

Heuristics
Edit History
String Similarity
Experiments
Precision 94 Recall 87
Future
Integrated Joint Inference

Outline
1. Self-Supervised Extraction from Wikipedia
Text
Training on Infoboxes
Improving Recall Shrinkage, Retraining, Web
Extraction
Community Content Creation
2. Automatic Ontology Generation
3. Scalable Probabilistic Inference for Q/A

20
Improving Recall on Sparse Classes Wu et
al. KDD-08

Shrinkage
Extra Training Examples from Related Classes
How Weight New Examples?
Retraining
Compare Kylin Extractions with Ones from
Textrunner Banko et al. IJCAI-07
Additional Positive Examples
Eliminate False Negatives
Extraction from Broader Web

21
Effect of Shrinkage Retraining
22
Effect of Shrinkage Retraining
1755 improvement for a sparse class
13.7 improvement for a popular class
23
Related Work on Ontology Driven Information
Extraction