Title: Scalability : A Semantic Web Perspective
1Scalability A Semantic Web Perspective
- Jeff Heflin
- Lehigh University
2My Background
- Semantic Web
- eight years of experience
- helped design DAMLOIL and OWL
- research focus
- scalable reasoning
- distributed semantics
- distributed queries
3Reasoning vs. Resources
- Lehigh University Benchmark
- used to evaluate semantic web reasoning systems
- becoming a de facto standard in the Semantic Web
community - helping to push research on scalability
- Features
- OWL ontology for university domain (moderate
complexity) - customizable data generation
- can select number of universities and random
number generator seed - arbitrary size
- repeatable
- plausible
- real world constraints are applied
- Naming Scheme
- LUBM( of universities, seed)
4Metrics
- initialization metrics
- load time
- total time to load input files and do any
pre-processing - repository size
- disk space utilized (for systems with secondary
storage only) - query metrics
- 14 queries that test a range of features
- query response time
- each query is executed 10 times and averaged to
account for caching - degree of completeness
- of correct answers / of entailed answers
- degree of soundness
- of correct answers / of returned answers
5Experiment Data Loading
- Time Sesame-mem is the fastest in loading up to
10 univ. - DLDB scales better
- OWLJessKB is the slowest, can only load 1 univ.
- Space DLDB also scales better
- Only DLDB loaded 50 univ. in a day
6Experiment Query Time
7Completeness vs. Response Time
8State-of-the-art Scalable Reasoners
- Sesame 2 (Aduna)
- RDF(S) simple semantic net style reasoning
- LUBM(500,0) ( 70 million statements)
- BRAHMS (U. of Georgia)
- RDF(S) simple semantic net style reasoning
- LUBM(700,0) ( 100 million statements)
- query time 1-5 minutes
- OWLIM (Ontotext)
- complete for most of OWL Lite (a simple DL)
- LUBM(300,0) ( 40 million statements)
- DLDB (Lehigh)
- complete for a large subset of OWL DL (an
expressive DL) - LUBM(100,0) ( 13 million statements)
- 45 million statements (real Semantic Web
documents)
9Challenges for MANET Reasoners
- scalability
- millions of devices each with potentially
millions of facts (changing over time) - no big servers
- complete reasoning is infeasible
- how to find the best answers first?
- other considerations
- power consumption
- bandwidth consumption
- how do we empirically compare approaches?
- develop a MANET benchmark
- to drive research on reasoners for the unique
aspects of this problem
10MANET Benchmark
- more distributed than LUBM
- dynamically changing data?
- ontologies
- policy ontologies, device ontologies, etc.
- new metrics
- power consumption
- bandwidth consumption
- time to first answer
- relative quality of answers
11For more information...
- My information
- heflin_at_cse.lehigh.edu
- http//www.cse.lehigh.edu/heflin/
- For more on the Semantic Web
- http//www.semwebcentral.org/
- http//www.w3.org/2001/sw/
- http//www.daml.org/
- http//www.semanticweb.org/
12The End
13Ontology
- Definition
- a logical theory that accounts for the intended
meaning of a formal vocabulary (Guarino 98) - has a formal syntax and unambiguous semantics
- inference algorithms can compute what logically
follows - Relevance to Web
- identify context
- provide shared definitions
- eases the integration of distinct resources
14RDF and RDF Schema
ltrdfsProperty rdfIDnamegt ltrdfsdomain
rdfresourcePersongt lt/rdfsPropertygt ltrdfsCl
ass rdfIDChairgt ltrdfssubclassOf
rdfresource http//schema.org/genPerson
gt lt/rdfsClassgt
rdfsClass
rdfsProperty
rdftype
rdftype
gPerson
rdftype
rdfsdomain
rdfssubclassOf
ltrdfRDF xmlnsghttp//schema.org/gen
xmlnsuhttp//schema.org/univgt ltuChair
rdfIDjohngt ltgnamegtJohn Smithlt/gnamegt
lt/uChairgt lt/rdfRDFgt
uChair
gname
rdftype
gname
John Smith
15URIs and Namespaces
- URI
- Uniform Resource Identifier
- includes URLs
- but also anything that you can design an
identification scheme for - helps to prevent collision of names
- all the symbols in RDF are either URIs or
Literals - Namespace
- a mechanism for abbreviating URIs
- by assigning a prefix for a URI fragment
16OWL
markup linked to semantics
- Web Ontology Language
- W3C Recommendation
- released Feb. 2004
- based on RDF
ltrdfDescription rdfaboutgt ltimports
resourcewww.books.com/bookontgt ltrdfDescription
gt ltBook rdfIDbook26489gt ltauthorgtE.B.
Whitelt/authorgt lttitlegtCharlottes
Weblt/titlegt ltpricegt6.99lt/pricegt ltsubject
rdfresourcebookontFictionChildgt lt/Bookgt
semantic markup
ltClass IDBookgt ltProperty IDsubjectgt
ltdomain resourceBookgt ltrange
resourceTopicgt lt/Propertygt ltClass
IDFictionChildgt ltsubclassOf
resourceFictiongt ltsubclassOf
resourceChildrensgt lt/Classgt
imports
bookont ontology
17Species of OWL
- OWL Full
- very expressive (e.g., classes as instances)
- theoretical properties not well understood
- OWL DL
- has a standard model theoretic semantics
- OWL Lite
- subset of OWL DL
- easier to reason with
18OWL Class Constructors
borrowed from Ian Horrocks
19OWL Axioms
borrowed from Ian Horrocks
20Benefit of Description Logic
- optimized computation of subsumption
- calculate implicit subClassOf relations
- ontology integration
- if two ontologies use class expressions to define
their vocabularies in terms of a third ontology,
then subsumption can be used to compute an
integrated ontology
21OWL RDF Syntax
- ltowlClass rdfIDBandgt ltrdfssubClassOfgt
ltowlRestrictiongt ltowlonProperty
rdfresourcehasMember /gt
ltowlallValuesFrom resourceMusician /gt
lt/owlRestrictiongt lt/rdfssubClassOfgtlt/owlCla
ssgt - A Band is a subset of the set of objects which
only have Musicians as members
22OWL Inference
ltowlProperty rdfIDheadgt
ltrdfsubPropertyOf rdfsresourcemembe
r /gtlt/owlPropertygt ltowlClass
rdfIDTerroristgt ltowlsameClassAsgt
ltowlRestrictiongt ltowlonProperty
rdfresourcemember /gt
ltowlsomeValuesFrom
rdfresourceTerroristOrg /gt
lt/owlRestrictiongt lt/owlsameClassAsgtlt/owlCla
ssgt
- The head of an organization is also a member of
it - A member of a terror organization is a terrorist
- Therefore, the head of a terror organization is a
terrorist
type
Bin Laden
Terrorist
head
type
Al Qaeda
TerrorOrg
23A Web of Ontologies
revises
commits to
A1
A2
S1
extends
extends
extends
extends
revises
revises
B3
B1
B2
C1
D1
extends
extends
extends
commits to
commits to
commits to
S4
E1
F1
S5
commits to
commits to
S2
S3
24Criticisms of the Semantic Web
- Who will create all of the RDF/OWL data?
- How do you integrate heterogeneous ontologies?
- How can you handle spam / deceit /
misinformation? - How can a system based on formal logic achieve
Web scale?
25Semantic Web Scalability
- Questions
- what inference algorithms are best for large
scale data? - can AI reasoning be combined with databases to
achieve the best of both worlds? - how do we accurately evaluate systems when there
is relatively little real world data available? - how do we compare systems with very different
capabilities?
26DLDB
- approach
- lightweight coupling of a database and a
description logic reasoner - optimized table design
- implementation
- DL Description Logics (FaCT reasoner)
- rich inference capability
- close correspondence to semantics of OWL
- DB Relational Database (MicrosoftAccess)
- ubiquitous DBMS for small to medium size databases
27 Design RDF(S) Entailment
- Use views to reason about class membership
ltowlClass rdfIDStudent/gt ltowlClass
rdfID"UndergraduateStudent"gt
ltrdfssubClassOf rdfresource"Student"
/gt ltowlClass/gt
CREATE VIEW Student_v AS SELECT FROM
Student UNION SELECT FROM
UndergraduateStudent_view
28Design OWL Entailment
Student ? Person who takes a Course GraduateStud
ent ? Person who takes a GraduateCourse GraduateCo
urse ? Course
Ontology
DL Reasoner
Graduate Student ? Student
Inferred Hierarchy
table view creation
CREATE VIEW Student_1_view AS SELECT FROM
Student_1 UNION SELECT FROM UndergraduateStuden
t_1_view UNION SELECT FROM GraduateStudent_1_vie
w
Database operation
29Implementation DB Schema
Student_1
Ontologies_Index
Source_Index
TakeCourse_1
URI_Index
30Implementation Query
Query Interface application
(Type GraduateStudent ?X) (TakeCourse ?X
http//www.foo.edu/dept0/course0)
KIF-like conjunctive query
Query API
SELECT GraduateStudent_2_view.ID FROM
GraduateStudent_2_view, takeCourse_2_view WHERE
GraduateStudent_2_view.id takeCourse_2_view.sub
ject AND takeCourse_2_view.object
http//www.foo.edu/dept0/course0
Query Translation Algorithm
SQL Sentences
RDBMS
31Benchmark System
32Initial Experiment
- Conducted in 2004
- Four systems tested
- Sesame Memory, Sesame DB, OWLJessKB, DLDB
- Five data sizes
- ranging from 15 files (8 MB) to 999 files (583
MB) - Test Environment
- 1.8G/256MB mem/80GB disk/WinXP Pro
- JDK 1.4.1, 512MB max heap size, (1 GB for
OWLJessKB) - note, this is a very inexpensive platform
33Results - Completeness
34Results - Soundness
- Sesame and DLDB were sound on all queries
- OWLJessKB was unsound on some queries
- this problem has been fixed in the most recent
release of OWLJessKB
35Results Query Time Scaling
- Some queries DLDB did better, others Sesame-DB
did better
36Results - Overall
37High Performance DLDB
- Sun W2100z workstation
- dual 64-bit Opteron / 2GB / Solaris10
- RDBMS PostgreSQL 8.0
- Racer on 2.4GHz / 256MB / Windows XP
- Racer not available for Solaris
- Two machines are connected via 100 Base-T
Ethernet - Additional features
- support for owlinverseOf
- complete on 12 out of 14 queries.
38Improved Scalability
- Unlike MS Access, PostgreSQL has no limitations
on number of tables and DB sizes - Conducted experiment with up to 13 million
triples - The load times grew about proportionally to the
dataset sizes - 2 GB disk space for the largest data set
39Real Semantic Web Data
- Used Swoogles crawl of Semantic Web documents
(RDF and OWL) - High performance DLDB loaded 343,977 SW documents
in 15.6 days - 41,741 ontologies
- 45 million triples were stored using 8 GB
- 50,976 classes and 24,094 properties
- Sample queries showed reasonable response time
- many queries under a second
40Benchmark Architecture
41Determination of Query Completeness and Soundness
42Experimental Results and Their Interpretation
- Combined metric (multi-metrics, multi-datasets)
43Recent Work
- High performance DLDB
- DLDBs architecture allows easy composition of
any SQL-compliant RDMBS and DIG-compliant DL
reasoner. - Benchmarking of other systems
44Improved Scalability (II)
- Most query response time demonstrate linear
increment as the data set size increase. - DLDB add support to owlinverseOf and make itself
complete on 12 out of 14 queries.
45Knowledge Acquisition
- data
- create or find relevant ontology
- then either
- convert existing forms to RDF
- e.g., XML, relational DBs, CGs, etc.
- information extraction
- natural language processing
- controlled English? (Sowa, yesterday)
- ontologies
- import existing ontologies
- manual creation (e.g., Protogé)
- machine learning
- formal concept analysis? (Rudolph, yesterday)
46Semantic Web Timeline
May 2001 Berners-Lee et al. Scientific
American article
Mar. 1996 - SHOE 0.90 (simple frames in HTML)
Feb. 1998 XML (semi-structured data for Web)
Feb. 1999 RDF (semantic nets in XML)
Feb. 2004 OWL (W3C Rec.)
1996
2004
2000
2002
1998
Jan. 1998 SHOE 1.0 (frames Horn logic)
Sep. 1998 Berners-Lees Semantic Web Roadmap
Mar. 2001 DAMLOIL (expressive DL in RDF)
June. 2002 1st Intl Semantic Web Conference
47Semantic Web Challenges
- The Web is distributed
- many sources, varying authority
- inconsistency
- The Web is dynamic
- representational needs may change
- The Web is enormous
- systems must scale well
- The Web is an open-world