Title: David W. Embley
1WoK A Web of Knowledge
- David W. Embley
- Brigham Young University
- Provo, Utah, USA
2A Web of Pages ? A Web of Facts
- Birthdate of my great grandpa Orson
- Price and mileage of red Nissans, 1990 or newer
- Location and size of chromosome 17
- US states with property crime rates above 1
3Toward a Web of Knowledge
- Fundamental questions
- What is knowledge?
- What are facts?
- How does one know?
- Philosophy
- Ontology
- Epistemology
- Logic and reasoning
4Ontology
- Existence ? asks What exists?
- Concepts, relationships, and constraints with
formal foundation
5Epistemology
- The nature of knowledge ? asks What is
knowledge? and How is knowledge acquired? - Populated conceptual model
6Logic and Reasoning
- Principles of valid inference ? asks What is
known? and What can be inferred? - For us, it answers what can be inferred (in a
formal sense) from conceptualized data.
Find price and mileage of red Nissans, 1990 or
newer
7Making this Work ? How?
- Distill knowledge from the wealth of digital web
data - Annotate web pages
- Need a computational alembic to algorithmically
turn raw symbols contained in web pages into
knowledge
Annotation
Annotation
Fact
Fact
Fact
8Turning Raw Symbols into Knowledge
- Symbols 11,500 117K Nissan CD AC
- Data price(11,500) mileage(117K)
make(Nissan) - Conceptualized data
- Car(C123) has Price(11,500)
- Car(C123) has Mileage(117,000)
- Car(C123) has Make(Nissan)
- Car(C123) has Feature(AC)
- Knowledge
- Correct facts
- Provenance
9Actualization (with Extraction Ontologies)
Find me the price and mileage of all red Nissans
I want a 1990 or newer.
10Data Extraction Demo
11Semantic Annotation Demo
12Free-Form Query Demo
13Explanation How it Works
- Extraction Ontologies
- Semantic Annotation
- Free-Form Query Interpretation
14Extraction Ontologies
Object sets Relationship sets Participation
constraints Lexical Non-lexical Primary object
set Aggregation Generalization/Specialization
15Extraction Ontologies
Data Frame
Internal Representation float
Values
External Rep. \s\s(\d1,3)(\.\d2)?
Left Context
Key Word Phrase
Key Words (Pprice)(Ccost)
Operators
Operator gt
Key Words (more\sthan)(more\scostly)
16Generality Resiliency ofExtraction Ontologies
- Generality assumptions about web pages
- Data rich
- Narrow domain
- Document types
- Single-record documents (hard, but doable)
- Multiple-record documents (harder)
- Records with scattered components (even harder)
- Resiliency declarative
- Still works when web pages change
- Works for new, unseen pages in the same domain
- Scalable, but takes work to declare the
extraction ontology
17Semantic Annotation
18Free-Form Query Interpretation
- Parse Free-Form Query
- (with respect to data extraction ontology)
- Select Ontology
- Formulate Query Expression
- Run Query Over Semantically Annotated Data
19Parse Free-Form Query
Find me the and of all
s I want a
price
mileage
red
Nissan
1996
or newer
gt Operator
20Select Ontology
Find me the price and mileage of all red Nissans
I want a 1996 or newer
21Formulate Query Expression
- Conjunctive queries and aggregate queries
- Projection on mentioned object sets
- Selection via values and operator keywords
- Color red
- Make Nissan
- Year gt 1996
gt Operator
22Formulate Query Expression
For
Let
Where
Return
23Run QueryOver Semantically Annotated Data
24Great!But Problems Still Need Resolution
- How do we create extraction ontologies?
- Manual creation requires several dozen person
hours - Semi-automatic creation
- TISP (Table Interpretation by Sibling Pages)
- TANGO (Table ANalysis for Generating Ontologies)
- Nested Schemas with Regular Expressions
- Synergistic Bootstrapping
- Form-based Information Harvesting
- How do we scale up?
- Practicalities of technology transfer and usage
- Millions of queries over zillions of facts for
thousands of ontologies
25Manual Creation
26Manual Creation
27Manual Creation
- Library of instance recognizers
- Library of lexicons
28Automatic Annotation with TISP(Table
Interpretation with Sibling Pages)
- Recognize tables (discard non-tables)
- Locate table labels
- Locate table values
- Find label/value associations
29Recognize Tables
Layout Tables (discard)
Data Table
Nested Data Tables
30Locate Table Labels
Examples Identification.Gene
model(s).Protein Identification.Gene model(s).2
31Locate Table Labels
Examples Identification.Gene model(s).Gene
Model Identification.Gene model(s).2
1 2
32Locate Table Values
Value
33Find Label/Value Associations
Example (Identification.Gene model(s).Protein,
Identification.Gene model(s).2) WPCE28918
1 2
34Interpretation TechniqueSibling Page Comparison
35Interpretation TechniqueSibling Page Comparison
Same
36Interpretation TechniqueSibling Page Comparison
Almost Same
37Interpretation TechniqueSibling Page Comparison
Different
Same
38Technique Details
- Unnest tables
- Match tables in sibling pages
- Perfect match (table for layout ? discard )
- Reasonable match (sibling table)
- Determine use table-structure pattern
- Discover pattern
- Pattern usage
- Dynamic pattern adjustment
39Generated RDF
40WoK Demo (via TISP)
41Semi-Automatic Annotation with TANGO (Table
Analysis for Generating Ontologies)
- Recognize and normalize table information
- Construct mini-ontologies from tables
- Discover inter-ontology mappings
- Merge mini-ontologies into a growing ontology
42Recognize Table Information
Religion
Population Albanian
Roman Shia
Sunni Country (July 2001 est.) Orthodox
Muslim Catholic Muslim Muslim
other Afganistan 26,813,057
15
84 1 Albania
3,510,484 20 70 10
43Construct Mini-Ontology
44Discover Mappings
45Merge
46Semi-Automatic Annotation viaSynergistic
Bootstrapping(Based on Nested Schemas with
Regular Expressions)
- Build a page-layout, pattern-based annotator
- Automate layout recognition based on examples
- Auto-generate examples with extraction ontologies
- Synergistically run pattern-based annotator
extraction-ontology annotator
47(No Transcript)
48Synergistic Execution
Extraction Ontology
Partially Annotated Document
Conceptual Annotator (ontology-based annotation)
Pattern Generation
Document
Layout Patterns
Annotated Document
Structural Annotator (layout-driven annotation)
49Form-Based Information Harvesting
- Forms
- General familiarity
- Reasonable conceptual framework
- Appropriate correspondence
- Transformable to ontological descriptions
- Capable of accepting source data
- Instance recognizers
- Some pre-existing instance recognizers
- Lexicons
- Automated extraction ontology creation?
50Form Creation
- Basic form-construction facilities
- single-entry field
- multiple-entry field
- nested form
-
51Created Sample Form
52Generated Ontology View
53Source-to-Form Mapping
54Source-to-Form Mapping
55Source-to-Form Mapping
56Source-to-Form Mapping
57Almost Ready to Harvest
- Need reading path DOM-tree structure
- Need to resolve mapping problems
- Split/Merge
- Union/Selection
58Almost Ready to Harvest
- Need reading path DOM-tree structure
- Need to resolve mapping problems
- Split/Merge
- Union/Selection
Name
Voltage-dependent anion-selective channel
protein 3 VDAC-3 hVDAC3 Outer mitochondrial
membrane Protein porin 3
59Almost Ready to Harvest
- Need reading path DOM-tree structure
- Need to resolve mapping problems
- Split/Merge
- Union/Selection
Name
Voltage-dependent anion-selective channel
protein 3 VDAC-3 hVDAC3 Outer mitochondrial
membrane Protein porin 3
60Almost Ready to Harvest
- Need reading path DOM-tree structure
- Need to resolve mapping problems
- Split/Merge
- Union/Selection
Name
T-complex protein 1 subunit theta TCP-1-theta CCT-
theta Renal carcinoma antigen NY-REN-15
61Almost Ready to Harvest
- Need reading path DOM-tree structure
- Need to resolve mapping problems
- Split/Merge
- Union/Selection
Name
T-complex protein 1 subunit theta TCP-1-theta CCT-
theta Renal carcinoma antigen NY-REN-15
62Can Now Harvest
Name
63Can Now Harvest
Name
14-3-3 protein epsilon Mitochondrial import
stimulation factor Lsubunit Protein kinase C
inhibitor protein-1 KCIP-1 14-3-3E
64Can Now Harvest
Name
Voltage-dependent anion-selective channel
protein 3 VDAC-3 hVDAC3 Outer mitochondrial
membrane Protein porin 3
65Can Now Harvest
Name
Tryptophanyl-tRNA synthetase, mitochondrial
precursor EC 6.1.1.2 TryptophantRNA
ligase TrpRS (Mt)TrpRS
66Harvesting Populates Ontology
67Harvesting Populates Ontology
Also helps adjust ontology constraints
68Can Harvest from Additional Sites
Name
T-complex protein 1 subunit theta TCP-1-theta CCT-
theta Renal carcinoma antigen NY-REN-15
69AutomatingExtraction Ontology Creation
Lexicons
Name
14-3-3 protein epsilon Mitochondrial import
stimulation factor Lsubunit Protein kinase C
inhibitor protein-1 KCIP-1 14-3-3E
14-3-3 protein epsilon Mitochondrial import
stimulation factor Lsubunit Protein kinase C
inhibitor protein-1 KCIP-1 14-3-3E T-complex
protein 1 subunit theta TCP-1-theta CCT-theta Rena
l carcinoma antigen NY-REN-15 Tryptophanyl-tRNA
synthetase, mitochondrial precursor EC
6.1.1.2 TryptophantRNA ligase TrpRS (Mt)TrpRS
Name
T-complex protein 1 subunit theta TCP-1-theta CCT-
theta Renal carcinoma antigen NY-REN-15
Name
Tryptophanyl-tRNA synthetase, mitochondrial
precursor EC 6.1.1.2 TryptophantRNA
ligase TrpRS (Mt)TrpRS
70AutomatingExtraction Ontology Creation
Instance Recognizers
Number Patterns
Context Keywords and Phrases
71Automatic Source-to-Form Mapping
72Automatic Semantic Annotation
Recognize and annotate with respect to an ontology
73Ontology Transformations
Transformations to and from all
74Practicalities WoK Query Interfaces
(Future Work)
- Advanced free-form queries with disjunction and
negation - Form-based query language
- Table-based query languages
- Graphical query languages
75Practicalities Bootstrapping the WoK
(Future Work)
- Wont just happen without sufficient content
- Niche applications
- Historical Data (e.g. Genealogy)
- Topical Blogs
- Local WoKs
- Intra-organizational effort
- Individual interests
76Practicalities Scalability
(Future Work)
- Potential Rapid growth
- Thousands of ontologies
- Millions of simultaneous queries
- Billions of annotated pages
- Trillions of facts
- Search-engine-like caching query processing
77Key to SuccessSimplicity via Automation
- Automatic (or near automatic) creation of
extraction ontologies - Automatic (or near automatic) annotation of web
pages - Simple but accurate query specification without
specialized training
www.deg.byu.edu