Towards a Science of Knowledge Base Performance Analysis - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Towards a Science of Knowledge Base Performance Analysis

Description:

http://www.cs.vu.nl/~pmika/swc/btc.html. Analysis ... Degree of interconnectedness (percentage of non-literal statements, with/without ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 28

Provided by: mike1157

Category:

more less

Transcript and Presenter's Notes

Title: Towards a Science of Knowledge Base Performance Analysis

1
Towards a Science ofKnowledge BasePerformance
Analysis

Mike Dean
mdean_at_bbn.com
4th International Workshop on
Scalable Semantic Web Knowledge Base Systems
(SSWS2008)
Karlsruhe, Germany
27 October 2008
http//asio.bbn.com/2008/10/iswc2008/mdean-ssws-20
08-10-27.ppt

2
Outline

Metrics
ParliamentTM Knowledge Base
Analysis of the Billion Triples Challenge Corpus
Conclusions

3
Metrics

I find it helpful to compare latencies in terms
of machine instructions
3 GHz processor 3 billion instructions/sec
Subroutine call 10 instructions
Round-trip local host inter-process communication
100,000 instructions
Reading 4K from a 7200rpm SATA drive 45 million
instructions
Speed of light
4 inches 1 instruction
Round-trip US transcontinental 100 million
instructions
Round-trip geosynchronous satellite 1.5 billion
instructions
It pays to have your data in memory whenever
possible

4
Usual Triple StoreImplementation Approaches

RDBMS
Inherent scalability and ACID properties
Generic, table-per-class, or table-per-property
Column stores (VLDB 2007 Best Paper)
B-Trees
Multiple indexes on spo, pos, osp
Can easily be distributed
Most implementations intern URIs and literal
values into fixed-length integers

5
(Ancient) History

Several mainframe technologies I used as a
teenager left a lasting impression
Multics
Memory-mapped filesystem (survives as Unix mmap)
CODASYL (Network) DBMSs
Linked-list chains with hashed lookups
Page allocation and locking
Similar structure to later OODBMSs (e.g.
Objectivity), which added inheritance

6
ParliamentTM

Lightweight embedded triple store
Started as DAML DB in September 2001
Multiple re-implementations over the years
Simple rule engine added
Now part of AsioTM tool suite
Still the primary triple store used in BBN
projects
Will soon be released as open source under BSD
license on SemWebCentral.org

7
Embedding

Embedded storage layer
Used with higher-level parsers, APIs, query, and
reasoning mechanisms
Efficient, persistent, and scalable
Memory mapped files (same as OS virtual memory)

8
Example
ltrdfRDF xmlnsrdf'http//www.w3.org/1999/02/22
-rdf-syntax-ns' xmlns'http//www.daml.org/2001
/01/gedcom/gedcom' gt ltIndividual
rdfID'thornton'gt ltnamegtThornton
Deanlt/namegt ltsexgtMlt/sexgt ltbirthgt
ltBirthgt ltdategt1844-05-10lt/dategt
ltplace rdfresource"fips55VAc165"/gt
lt/Birthgt lt/birthgt lt/Individualgt
ltIndividual rdfID'sol'gt ltnamegtSolomon Job
Hensleylt/namegt ltsexgtMlt/sexgt ltbirthgt
ltBirthgt ltdategt1855-04-12lt/dategt
ltplace rdfresource"fips55VAc165"/gt
lt/Birthgt lt/birthgt lt/Individualgt lt/rdfRDFgt
9
LUBM Results
Rohloff, Dean, Emmons, Ryder, Sumner SSWS2007
10
Desires

A means of formally comparing performance between
Parliament, RDBMS, and B-Tree implementations
I dont know how to do this
Probably based on counts of some shared primitive
operations
Work on formal system and/or database performance
models should be relevant here

11
Billion Triples Challenge

A new Semantic Web Challenge track in 2008
Do something interesting with a large subset of
a billion provided triples
12 real web data sets
Not a scientific sample
Enough to be interesting and probably
representative
Stable snapshot
Our analysis initially arose from discussing a
possible application
We now know yes, there is enough data to support
what we wanted to do
Tools and techniques should be generally
applicable to other corpora

12
Billion Triples Corpus
Data Set Format Triples URLs Size Composition
Webscope WARC 82,768,342 1,979,022 2.7 GB Heterogeneous
Falcon WARC 32,512,340 541,518 834 MB Heterogeneous
Swoogle WARC 174,981,639 1,468,766 3.2 GB Heterogeneous
Watson WARC 59,750,019 130,701 267 MB Heterogeneous
SWSE-1 WARC 30,346,451 194,259 4 GB Heterogeneous
SWSE-2 WARC 60,504,716 389,107 2.4 GB Heterogeneous
DBpedia tar.gz 110,241,463 29 1.9 GB Homogeneous
Geonames WARC 69,778,255 6,668,395 3.4 GB Homogeneous
SwetoDBLP tar.gz 14,936,600 1 167 MB Homogeneous
WordNet tar.gz 1,942,887 1 17 MB Homogeneous
Freebase tar.gz 63,069,952 1 569 MB Heterogeneous
US Census tar.gz 445,752,172 1 3.3 GB Homogeneous
TOTAL 1,146,584,836 11,371,801 22.8 GB
http//www.cs.vu.nl/pmika/swc/btc.html
13
Analysis

Stream processing of the compressed data set
archives
Statement counts
Datatype, language, predicate, and type counts
Use of RDF, RDFS, OWL, FOAF, and other
vocabularies
(May include duplicate statements)
Load each dataset into its own Parliament KB
(Eliminates duplicates within dataset)
(Both programs used code based on Peter Mikas
WARC example with the OpenRDF RIO parser and no
inference)
Process the statement and resource tables
Mark each node as resource and/or literal
URI, blank node, and literal counts
Chain length statistics and histograms
(Parliament worked very well here. Each
operation took 1-736 seconds.)

14
Classes and Predicates
Data Set Classes Predicates
Webscope 724 782
Falcon 19,660 29,248
Swoogle 33,318 33,981
Watson 13,660 18,091
SWSE-1 115 1,040
SWSE-2 104 625
DBpedia 4 288
Geonames 1 17
SwetoDBLP 11 145
WordNet 22 41
Freebase 0 5,008
US Census 8 1,682
15
Statements

Statement (subject, predicate, object)
Resource object
rdftype predicate
Other predicate
Literal object
rdfdatatype
Plain literal
xmllang
Neither datatype nor language

16
Statement (distinct values)
Dataset rdftype rdfresource rdfdatatype xmllang Neither
Webscope 24 (724) 32 3 (10) 14 (93) 27
Falcon 16 (19,660) 50 16 (72) 9 (252) 18
Swoogle 15 (33,318) 39 2 (87) 18 (280) 26
Watson 16 (13,660) 40 2 (79) 29 (162) 13
SWSE-1 13 (115) 53 0 (1) 32 (6) 1
SWSE-2 13 (104) 53 0 (1) 32 (15) 1
DBpedia 0 (4) 91 0 (6) 8 (1) 0
Geonames 10 (1) 49 0 (0) 1 (342) 41
SwetoDBLP 18 (11) 28 14 (4) 0 (0) 41
WordNet 24 (22) 30 0 (1) 46 (1) 0
Freebase 0 (0) 62 0 (0) 19 (169) 19
US Census 0 (8) 19 78 (2) 0 (0) 3
17
Resources and Literals

Node
Resource
URI
Blank Node
Literal

18
Node
Data Set URI Blank Node Literal
Webscope 24 53 23
Falcon 56 13 31
Swoogle 31 34 41
Watson 29 32 40
SWSE-1 39 36 25
SWSE-2 35 42 23
DBpedia 74 0 26
Geonames 45 0 55
SwetoDBLP 27 17 56
WordNet 55 0 45
Freebase 52 0 48
US Census 0 98 2
19
Chain Lengths

How long are the linked-list chains used by
Parliament?
How many statements share the same subject,
predicate, or object?
Histograms proved unwieldy
Presenting summary statistics instead
rdftype statements significantly impact results

20
Mean chain lengths (std dev)
Data Set Subject Predicate Object Literal Object
Webscope 3.96 (9.77) 87,900 (722,575) 3.43 (2170) 4.33 (659)
Falcon 4.22 (13) 983 (31,773) 2.56 (328) 2.31 (217)
Swoogle 5.65 (36) 4,464 (188,023) 3.27 (1,793) 3.38 (569)
Watson 5.58 (56) 3,040 (98,288) 2.87 (918) 2.91 (407)
SWSE-1 5.25 (15) 25,404 (289,000) 2.46 (1,138) 2.29 (187)
SWSE-2 5.37 (15) 83,773 (739,736) 2.89 (1,741) 2.87 (300)
DBpedia 15 (39) 300,855 (3,560,666) 3.84 (148) 1.17 (22)
Geonames 10.4 (1.66) 4,096,150 (3,167,048) 2.81 (1,623) 1.67 (15)
SwetoDBLP 5.63 (3.82) 103,009 (325,380) 2.93 (629) 2.36 (168)
Wordnet 4.18 (2.04) 47,387 (100,907) 2.53 (295) 2.39 (271)
Freebase 4.45 (15) 12,329 (316,363) 2.79 (1,286) 1.83 (116)
US Census 5.39 (9.18) 265,005 (1,921,537) 5.29 (15,916) 227 (115,616)
21
RDF/RDFS/OWL Usage

80,309,558 rdftype statements in 11 data sets
4,033,540 rdfssubClassOf statements in 6 data
sets
2,988,396 owlClass instances in 6 data sets
1,492,214 rdf_1 statements in 7 data sets
1,042,032 owlRestriction instances in 5 data
sets
480,771 owlsameAs statements in 9 data sets
299,962 rdfsClass instances in same 6 data sets
as owlClass
238,000 reified statements in 4 data sets
50,482 instances of rdfBag in 5 data sets
22,154 instances of owlOntology in 5 data sets
14,913 owlimport statements in 3 data sets
83 rdf_2000 statements in 3 data sets
1 rdf_10763 statement in 1 data set

22
Popular Vocabularies

FOAF
29,308,169 Person instances in 7 data sets
25,864,527 knows statements in 6 data sets
Dublin Core
43,591,844 title statements in 7 data sets
4,416,716 date statements in 6 data sets
Geospatial
7,075,380 wgs84_poslat statements in 9 data sets
4,436 georsspoint statements in 5 data sets
SKOS
6,619,912 subject statements in 4 data sets
403,912 Concept instances in 4 data sets
RSS 1.0
2,893,750 item instances in 6 data sets
OWL-S
92 0.9-1.2 Profiles in 3 data sets
OWL-Time
No usage?

23
Errors

95,937 Java exceptions
Lots of bad languages and datatypes
Lots of namespace/URI typos/confusion
Slightly different statement counts, due to
exceptions, duplicates, etc.
1,063,616,774 statements (4 less)

24
Next Steps

Increased factoring of rdftype statements
How many rdftypes are associated with each
resource?
Compare to LUBM synthetic data
Analyze the combined corpus
Determine how many URIs are (still) resolvable?
Start with the predicates.
Discussion of specific datasets
SemTech 2009 submission

25
Data Set Characterization

Metrics that can impact selection/tuning of KB
implementations
Statement count
Number of classes and predicates
Statements per subject/predicate/object
Degree of interconnectedness (percentage of
non-literal statements, with/without rdftype)
RDFS and OWL reasoning employed
Use of reification

26
Conclusions