Towards a Science of Knowledge Base Performance Analysis - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Towards a Science of Knowledge Base Performance Analysis

Description:

http://www.cs.vu.nl/~pmika/swc/btc.html. Analysis ... Degree of interconnectedness (percentage of non-literal statements, with/without ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 28
Provided by: mike1157
Category:

less

Transcript and Presenter's Notes

Title: Towards a Science of Knowledge Base Performance Analysis


1
Towards a Science ofKnowledge BasePerformance
Analysis
  • Mike Dean
  • mdean_at_bbn.com
  • 4th International Workshop on
  • Scalable Semantic Web Knowledge Base Systems
    (SSWS2008)
  • Karlsruhe, Germany
  • 27 October 2008
  • http//asio.bbn.com/2008/10/iswc2008/mdean-ssws-20
    08-10-27.ppt

2
Outline
  • Metrics
  • ParliamentTM Knowledge Base
  • Analysis of the Billion Triples Challenge Corpus
  • Conclusions

3
Metrics
  • I find it helpful to compare latencies in terms
    of machine instructions
  • 3 GHz processor 3 billion instructions/sec
  • Subroutine call 10 instructions
  • Round-trip local host inter-process communication
    100,000 instructions
  • Reading 4K from a 7200rpm SATA drive 45 million
    instructions
  • Speed of light
  • 4 inches 1 instruction
  • Round-trip US transcontinental 100 million
    instructions
  • Round-trip geosynchronous satellite 1.5 billion
    instructions
  • It pays to have your data in memory whenever
    possible

4
Usual Triple StoreImplementation Approaches
  • RDBMS
  • Inherent scalability and ACID properties
  • Generic, table-per-class, or table-per-property
  • Column stores (VLDB 2007 Best Paper)
  • B-Trees
  • Multiple indexes on spo, pos, osp
  • Can easily be distributed
  • Most implementations intern URIs and literal
    values into fixed-length integers

5
(Ancient) History
  • Several mainframe technologies I used as a
    teenager left a lasting impression
  • Multics
  • Memory-mapped filesystem (survives as Unix mmap)
  • CODASYL (Network) DBMSs
  • Linked-list chains with hashed lookups
  • Page allocation and locking
  • Similar structure to later OODBMSs (e.g.
    Objectivity), which added inheritance

6
ParliamentTM
  • Lightweight embedded triple store
  • Started as DAML DB in September 2001
  • Multiple re-implementations over the years
  • Simple rule engine added
  • Now part of AsioTM tool suite
  • Still the primary triple store used in BBN
    projects
  • Will soon be released as open source under BSD
    license on SemWebCentral.org

7
Embedding
  • Embedded storage layer
  • Used with higher-level parsers, APIs, query, and
    reasoning mechanisms
  • Efficient, persistent, and scalable
  • Memory mapped files (same as OS virtual memory)

8
Example
ltrdfRDF xmlnsrdf'http//www.w3.org/1999/02/22
-rdf-syntax-ns' xmlns'http//www.daml.org/2001
/01/gedcom/gedcom' gt ltIndividual
rdfID'thornton'gt ltnamegtThornton
Deanlt/namegt ltsexgtMlt/sexgt ltbirthgt
ltBirthgt ltdategt1844-05-10lt/dategt
ltplace rdfresource"fips55VAc165"/gt
lt/Birthgt lt/birthgt lt/Individualgt
ltIndividual rdfID'sol'gt ltnamegtSolomon Job
Hensleylt/namegt ltsexgtMlt/sexgt ltbirthgt
ltBirthgt ltdategt1855-04-12lt/dategt
ltplace rdfresource"fips55VAc165"/gt
lt/Birthgt lt/birthgt lt/Individualgt lt/rdfRDFgt
9
LUBM Results
Rohloff, Dean, Emmons, Ryder, Sumner SSWS2007
10
Desires
  • A means of formally comparing performance between
    Parliament, RDBMS, and B-Tree implementations
  • I dont know how to do this
  • Probably based on counts of some shared primitive
    operations
  • Work on formal system and/or database performance
    models should be relevant here

11
Billion Triples Challenge
  • A new Semantic Web Challenge track in 2008
  • Do something interesting with a large subset of
    a billion provided triples
  • 12 real web data sets
  • Not a scientific sample
  • Enough to be interesting and probably
    representative
  • Stable snapshot
  • Our analysis initially arose from discussing a
    possible application
  • We now know yes, there is enough data to support
    what we wanted to do
  • Tools and techniques should be generally
    applicable to other corpora

12
Billion Triples Corpus
Data Set Format Triples URLs Size Composition
Webscope WARC 82,768,342 1,979,022 2.7 GB Heterogeneous
Falcon WARC 32,512,340 541,518 834 MB Heterogeneous
Swoogle WARC 174,981,639 1,468,766 3.2 GB Heterogeneous
Watson WARC 59,750,019 130,701 267 MB Heterogeneous
SWSE-1 WARC 30,346,451 194,259 4 GB Heterogeneous
SWSE-2 WARC 60,504,716 389,107 2.4 GB Heterogeneous
DBpedia tar.gz 110,241,463 29 1.9 GB Homogeneous
Geonames WARC 69,778,255 6,668,395 3.4 GB Homogeneous
SwetoDBLP tar.gz 14,936,600 1 167 MB Homogeneous
WordNet tar.gz 1,942,887 1 17 MB Homogeneous
Freebase tar.gz 63,069,952 1 569 MB Heterogeneous
US Census tar.gz 445,752,172 1 3.3 GB Homogeneous
TOTAL 1,146,584,836 11,371,801 22.8 GB
http//www.cs.vu.nl/pmika/swc/btc.html
13
Analysis
  • Stream processing of the compressed data set
    archives
  • Statement counts
  • Datatype, language, predicate, and type counts
  • Use of RDF, RDFS, OWL, FOAF, and other
    vocabularies
  • (May include duplicate statements)
  • Load each dataset into its own Parliament KB
  • (Eliminates duplicates within dataset)
  • (Both programs used code based on Peter Mikas
    WARC example with the OpenRDF RIO parser and no
    inference)
  • Process the statement and resource tables
  • Mark each node as resource and/or literal
  • URI, blank node, and literal counts
  • Chain length statistics and histograms
  • (Parliament worked very well here. Each
    operation took 1-736 seconds.)

14
Classes and Predicates
Data Set Classes Predicates
Webscope 724 782
Falcon 19,660 29,248
Swoogle 33,318 33,981
Watson 13,660 18,091
SWSE-1 115 1,040
SWSE-2 104 625
DBpedia 4 288
Geonames 1 17
SwetoDBLP 11 145
WordNet 22 41
Freebase 0 5,008
US Census 8 1,682
15
Statements
  • Statement (subject, predicate, object)
  • Resource object
  • rdftype predicate
  • Other predicate
  • Literal object
  • rdfdatatype
  • Plain literal
  • xmllang
  • Neither datatype nor language

16
Statement (distinct values)
Dataset rdftype rdfresource rdfdatatype xmllang Neither
Webscope 24 (724) 32 3 (10) 14 (93) 27
Falcon 16 (19,660) 50 16 (72) 9 (252) 18
Swoogle 15 (33,318) 39 2 (87) 18 (280) 26
Watson 16 (13,660) 40 2 (79) 29 (162) 13
SWSE-1 13 (115) 53 0 (1) 32 (6) 1
SWSE-2 13 (104) 53 0 (1) 32 (15) 1
DBpedia 0 (4) 91 0 (6) 8 (1) 0
Geonames 10 (1) 49 0 (0) 1 (342) 41
SwetoDBLP 18 (11) 28 14 (4) 0 (0) 41
WordNet 24 (22) 30 0 (1) 46 (1) 0
Freebase 0 (0) 62 0 (0) 19 (169) 19
US Census 0 (8) 19 78 (2) 0 (0) 3
17
Resources and Literals
  • Node
  • Resource
  • URI
  • Blank Node
  • Literal

18
Node
Data Set URI Blank Node Literal
Webscope 24 53 23
Falcon 56 13 31
Swoogle 31 34 41
Watson 29 32 40
SWSE-1 39 36 25
SWSE-2 35 42 23
DBpedia 74 0 26
Geonames 45 0 55
SwetoDBLP 27 17 56
WordNet 55 0 45
Freebase 52 0 48
US Census 0 98 2
19
Chain Lengths
  • How long are the linked-list chains used by
    Parliament?
  • How many statements share the same subject,
    predicate, or object?
  • Histograms proved unwieldy
  • Presenting summary statistics instead
  • rdftype statements significantly impact results

20
Mean chain lengths (std dev)
Data Set Subject Predicate Object Literal Object
Webscope 3.96 (9.77) 87,900 (722,575) 3.43 (2170) 4.33 (659)
Falcon 4.22 (13) 983 (31,773) 2.56 (328) 2.31 (217)
Swoogle 5.65 (36) 4,464 (188,023) 3.27 (1,793) 3.38 (569)
Watson 5.58 (56) 3,040 (98,288) 2.87 (918) 2.91 (407)
SWSE-1 5.25 (15) 25,404 (289,000) 2.46 (1,138) 2.29 (187)
SWSE-2 5.37 (15) 83,773 (739,736) 2.89 (1,741) 2.87 (300)
DBpedia 15 (39) 300,855 (3,560,666) 3.84 (148) 1.17 (22)
Geonames 10.4 (1.66) 4,096,150 (3,167,048) 2.81 (1,623) 1.67 (15)
SwetoDBLP 5.63 (3.82) 103,009 (325,380) 2.93 (629) 2.36 (168)
Wordnet 4.18 (2.04) 47,387 (100,907) 2.53 (295) 2.39 (271)
Freebase 4.45 (15) 12,329 (316,363) 2.79 (1,286) 1.83 (116)
US Census 5.39 (9.18) 265,005 (1,921,537) 5.29 (15,916) 227 (115,616)
21
RDF/RDFS/OWL Usage
  • 80,309,558 rdftype statements in 11 data sets
  • 4,033,540 rdfssubClassOf statements in 6 data
    sets
  • 2,988,396 owlClass instances in 6 data sets
  • 1,492,214 rdf_1 statements in 7 data sets
  • 1,042,032 owlRestriction instances in 5 data
    sets
  • 480,771 owlsameAs statements in 9 data sets
  • 299,962 rdfsClass instances in same 6 data sets
    as owlClass
  • 238,000 reified statements in 4 data sets
  • 50,482 instances of rdfBag in 5 data sets
  • 22,154 instances of owlOntology in 5 data sets
  • 14,913 owlimport statements in 3 data sets
  • 83 rdf_2000 statements in 3 data sets
  • 1 rdf_10763 statement in 1 data set

22
Popular Vocabularies
  • FOAF
  • 29,308,169 Person instances in 7 data sets
  • 25,864,527 knows statements in 6 data sets
  • Dublin Core
  • 43,591,844 title statements in 7 data sets
  • 4,416,716 date statements in 6 data sets
  • Geospatial
  • 7,075,380 wgs84_poslat statements in 9 data sets
  • 4,436 georsspoint statements in 5 data sets
  • SKOS
  • 6,619,912 subject statements in 4 data sets
  • 403,912 Concept instances in 4 data sets
  • RSS 1.0
  • 2,893,750 item instances in 6 data sets
  • OWL-S
  • 92 0.9-1.2 Profiles in 3 data sets
  • OWL-Time
  • No usage?

23
Errors
  • 95,937 Java exceptions
  • Lots of bad languages and datatypes
  • Lots of namespace/URI typos/confusion
  • Slightly different statement counts, due to
    exceptions, duplicates, etc.
  • 1,063,616,774 statements (4 less)

24
Next Steps
  • Increased factoring of rdftype statements
  • How many rdftypes are associated with each
    resource?
  • Compare to LUBM synthetic data
  • Analyze the combined corpus
  • Determine how many URIs are (still) resolvable?
    Start with the predicates.
  • Discussion of specific datasets
  • SemTech 2009 submission

25
Data Set Characterization
  • Metrics that can impact selection/tuning of KB
    implementations
  • Statement count
  • Number of classes and predicates
  • Statements per subject/predicate/object
  • Degree of interconnectedness (percentage of
    non-literal statements, with/without rdftype)
  • RDFS and OWL reasoning employed
  • Use of reification

26
Conclusions
  • Needs
  • Better means of formally characterizing KB
    implementations and data sets
  • Please help!

27
More Information
  • http//parliament.projects.semwebcentral.org
  • Parliament download (soon)
  • http//asio.bbn.com/2008/10/btc/
  • Full raw Billion Triples Corpus analysis results
Write a Comment
User Comments (0)
About PowerShow.com