Title: Li%20Ding
1Enhancing Web-scale Semantic Web Data Access
- Li Ding
- Knowledge Systems, AI Lab, Stanford University
- Joint work with Tim Finin, Anupam Joshi, Rong
Pan, Pavan Reddivari, Joel Sachs, and Yun Peng - Presented in logic group, Computer Science
Department, Stanford University - November 15, 2006
2The Semantic Web is Simple
- Each URI denotes a concept, and optionally a web
address - URIs are connected by triples, i.e. simple
relational database - Machines read data as directed RDF graph
- Ontologies, e.g. RDFS, OWL, are supported by
inference-engines
Don't say "colour" say lthttp//example.com/2002/st
d6colgt
Relational database
RDF (Resource Description Framework)
Picture source Tim Berners-Lee, Putting the Web
back into Semantic Web, ISWC2005 Keynote
3Motivation Connecting Producers and Consumers
IWBrowser
Swoop
WWW
FOAF
OWL
?
consume
Tabulator
publish
- To use the Semantic Web on the Web, we need to
know - How semantic content is distributed on the Web?
- Where and how to access/reuse them?
- What are the useful portions?
- The above problems also occur when integrating
information from unknown sources
4Outline
- To characterize the Semantic Web on the Web
- Describing the Semantic Web and its Context
- Measuring the availability and utility of online
semantic content - To enhance Web-scale Semantic Web data access
- Harvesting online semantic content
semi-automatically - Enhancing Semantic Web data access with search
- Discussion How to improve user experiences
- Granularity of Semantic Web knowledge
- Ontology inference, integration, evolution, and
lifecycle - Knowledge provenance
- Information integration and social network
- Conclusion
5How to describe the Semantic Web and Its Context
- Scope three worlds and provenance
- Conceptual model our focus
- Vocabulary
Li Ding and Tim Finin, "Characterizing the
Semantic Web on the Web", in Proceedings of the
5th International Semantic Web Conference,
(ISWC'06) November 2006
6The three worlds to be modeled
The Agent World
legends
trusts
subClassOf modal assertion provenance
Agent
Person
publishes
creates
The RDF Graph World
believes
describes
RDF graph
RDF Resource
The Web
serializes
defines
Semantic Web Document
Ontology
Web Document
7Our Conceptual Model
SWNamespace
officialOntology
6
linked in RDF graph
use
5
use
4
RDF
1
SWTerm
define reference
Web
instantiate
8
3
SWOntology
SWDocument
2
7
subClassOf
imports
hyperlink
SWD - SWDoucment Semantic Web Document SWD -
SWOntology Semantic Web Ontology SWT SWTerm
Sematnic Web Term. URIs that are observed being
classes/properties. SWN SWNamespace Semantic
Web namespace
8Vocabulary
meta-description (SWD, SWT)
http//www.cs.umbc.edu/finin/foaf.rdf
SWT
rdftype
foafPerson
instantiate class
foafmbox
instantiate property
finin_at_umbc.edu
SWD
reference property
http//foo.com/mappings.rdf
sub-class
rdfssubPropertyOf
SWO
foafmaker
docreator
reference class
http//xmlns.com/foaf/1.0/
rdfssubClassOf
SWN
wordNetAgent
foafPerson
define Class
rdftype
rdfsClass
rdfsdomain
http//xmlns.com/foaf/1.0/
foafmbox
define Property
rdfProperty
rdftype
SW Semantic Web
9Measurements Observations (based on data
collected by May 2006)
- Scale of the Semantic Web
- SWD
- SWDs per website power distribution
- Size biased by test/auto-gen files embedded
SWDs are small - Age exponential for SWD flat tail for SWO
- size change usually increase some SWDs do
lose triples - SWT
- Usage few being really defined and populated
- Definition quality Instance space power
distribution - Instance properties of class power distribution
Li Ding and Tim Finin, "Characterizing the
Semantic Web on the Web", in Proceedings of the
5th International Semantic Web Conference,
(ISWC'06) November 2006
10The Scale of the Semantic Web
Statistics based Semantic Web data indexed by
Swoogle
Year Terms(million) Documents(million) Individuals(million) Triples(million) Bytes(billion)
2004 0.15 0.33 7.3 48 4.3
2006 1.9 1.6 16 276 47
2008 10 100 1000 20,000 3000
Estimated number of documents based on Google
query
Docs Corresponding Google query
Optimistic 109 rdf OR inurlrss OR inurlfoaf -filetypehtml
Conservative 105 rdf filetyperdf
11Where the semantic content is from
- Instance data are mainly from com ( gt39 pure
SWDs) - Ontologies are mainly from non-profit
organizations (gt46 org) and academia (gt14 edu) - one IP may be shared by many websites (domain
names) using virtual hosting technology.
note Statistics of top level domain is also used
in characterizing the Web (Henziger and Lawrence
2004)
12Major Semantic Web data sources
- FOAF, DC, RSS dominates
- More languages PML, VML
- More domains Book, BBC program
- Newly reported sources DBLP in RDF
The unpinged column gives the number of URLs
discovered on the site that are suspected of also
being Semantic Web documents but have not yet
been processed.
13Source websites of SWD
Jan 2005- Aug 2005
Jan 2005- Mar 2006
- Invariant power distribution found!
- The number of websites hosting more than m SWDs
follows power law distribution - Similar to the Web
- Head virtual hosting
- Tail crawling strategy
14Size of SWD
- Embedded SWDs are small
- 69 have 3 triples
- 96 have lt10 triples
- Pure SWDs
- 60 have 5 to 1000 triples.
- Special size of RSS 130
- 17 triples for channel
- 7 triples for each of the 15 items
- SWOs
- Biased by PML,
- Small ones from RDF test
- Largest has 1M triples
Number of SWDs
Number of SWOs
of triples
15Age of SWD
- Measured by the last-modified time of SWD
- PSWD Exponential distribution
- SWO flat tail
16Size Change of SWD
- Measured by comparing the size of two versions of
the same PSWD - swd the number of PSWD having that change
- delta the amount triples affected
- Observations Overall increase in 183,464 alive,
pure SWDs - Decrease 37,012 PSWDs, lost a total of
1,774,161 triples - Increase 73,964 PSWDs, gained a total
6,064,218 triples - Unchanged 72,488 PSWDs
of triples
17SWT definition and usage
18How Semantic Web Terms are used?
- All usage distributions follow Power distribution
- Few SWTs been well populated
- 371 has gt100 class-instance
- 1208 hasgt100 property-instances
19Well instantiated terms
20Swoogle Enhancing Web-scale Semantic Web Data
Access
- Architecture
- Hybrid crawling semi-automatically
- Metadata design
- Semantic Web surfing and ranking
- Search based Semantic Web data access
21Swoogle Architecture
Analysis
Ranking
SWD classifier
Index
Search Services
Semantic Web metadata
IR Indexer
Web Service
Web Server
SWD Indexer
html
rdf/xml
Discovery
the Web
document cache
SwoogleBot
Semantic Web
Candidate URLs
Bounded Web Crawler Google Crawler
human
machine
Legends
Information flow
Swoogles web interface
22A Hybrid Harvesting Framework
true
Swoogle Sample Dataset
Manual submission
Inductive learner
would
Seeds R
Seeds M
Seeds H
RDF crawling
Bounded HTML crawling
Meta crawling
google
Google API call
crawl
crawl
the Web
23Harvest Performance (May 2006)
- Confirmed 1.6M SWDs among 3.7M URLs
- Average daily discovery 6800 URLs, 3000 SWDs
- Also noticed that over 100K SWDs have gone offline
Swoogle started crawling data since 2004. We
restarted in Jan 2005. Ping fetch document from
a URL PSWD pure SWD, not embed, can be correctly
parse by RDF parser (e.g. JENA)
24Site-wise Coverage Swoogle vs Google
- We compare the number of SWDs per website
- Similar trends with variations
- Google estimation is usually too optimistic
- Swoogle sometimes finds more SWDs
25Summary Swoogle Metadata
- Metadata for SWD and SWO
- document metadata
- semantic content summary
- content annotation
- Metadata for SWT
- URI, local-name
- aggregated annotation
- Relational metadata
- SWD and SWT six types of meta-description
relation - SWT and SWT inductive domain/range relation
- SWD and SWN namespace (prefix) usage
- Triples
- Triples describing SWD
- Triples describing SWT CBD (strickler 04)
26Semantic Web Surfing
- Semantic Web Surfing is not simply Web surfing
- Hyperlinks
- Same URI or instance
- Motivating Scenarios
- Search the best Person ontology (Web)
- List ontologies importing FOAF ontology (Web)
- List all instance data of foafPerson (DB)
- List popular instance properties of foafPerson
(KB) - Approaches
- Build conceptual search and navigation model
- Support the conceptual model using Swoogle
27Current Search and Navigation
SWNamespace
officialOnto
6
links in RDF graph
use
5
4
SWTerm
use
1
def,ref
pop
8
3
SWOntology
SWDocument
2
7
subClassOf
imports
link
Conventional Web Search
- Search the best Person ontology - OK but not
perfect - List ontologies importing FOAF ontology -- hard
- List all instance data of foafPerson -- hard
- List popular instance properties of foafPerson
--hard
28Search Engine based Search and Navigation
- Semantic Web Search Engine
- New paths
- reverse links A,B,C,D
- Enhanced paths
- term usage 3, 8
- double links 2,7
- inductive path 4
- heuristic path 6
SWNamespace
6
5
B
4
A
SWTerm
1
8
3
D
SWOntology
SWDocument
C
2
7
subClassOf
Semantic Web Search
- Search the best Person ontology - YES,
semantic web search - List ontologies importing FOAF ontology YES,
use 7 - List all instance data of foafPerson -- YES,
use C - List popular instance properties of foafPerson
YES, use 4
29Ranking using rational surfing model
- Ranking both SWTs and SWDs using the following
guidelines - Pursue Term definition
- Follow links to other SWDs
- Terminate/Restart surfing
hasDefinitionIn
SELF
o1
swt1
Uniform random jump
imports
Source
o2
d1
hasDefinitionIn
IMP
swt2
hasOfficialOntology
ns2
Uniform Random jump
EXT
swt3
hasNamespace
LINK
node4
d2
sameURL
STOP
STOP
node5
30Simple Examples
1.54
d1
1/3
1/3
d6
0.875
1/3
1/3
1/3
2/3
d2
d5
1.125
1/3
0.82
1.41
d1
1/3
d7
(a) Simple DAG
1/3
2/3
1.24
1/3
d4
1.52
1/3
d2
1.18
d9
3/3
d8
0.86
d1
2/3
0.39
0.74
1/3
1/3
3/3
d3
1/3
0.88
d10
d2
0.55
0.81
(b) Simple Loop
(c) 10-node graph with loop, clique, and self
links
31Swoogle Rank (May 2006)
http//www.w3.org/2000/01/rdf-schema
indegree432,984,mean(inflow)0.039
http//www.w3.org/1999/02/22-rdf-syntax-ns
0.51
1
indegree1,077,768,mean(inflow)0.100
0.11
0.10
2
0.25
0.30
0.35
5
http//purl.org/rss/1.0
0.11
http//www.w3.org/2002/07/owl
0.03
indegree270,178,mean(inflow)0.168
indegree86,959,mean(inflow)0.069
0.18
0.20
0.10
0.16
6
8
0.12
http//web.resource.org/cc
0.43
0.17
indegree57,066,mean(inflow)0.195
0.27
0.21
9
0.27
0.07
0.10
4
http//www.w3.org/2001/vcard-rdf/3.0
0.10
0.07
indegree155,949,mean(inflow)0.036
0.25
0.12
0.11
0.06
0.23
0.12
0.16
0.05
http//purl.org/dc/elements/1.1
10
0.03
indegree861,416,mean(inflow)0.096
7
0.20
http//www.hackcraft.net/bookrdf/vocab/0_1/
http//purl.org/dc/terms
0.08
indegree16,380,mean(inflow)0.167
indegree54,909,mean(inflow)0.042
3
0.17
http//xmlns.com/foaf/0.1/index.rdf
0.29
indegree512,790,mean(inflow)0.217
Computed using Swoogle metadata by May 2006
32Swoogle based Search and Navigation
RDF graph World
Swoogle Term Search
5
SWTerm
SWNamespace
6
7
9
4
8
2
Swoogle Document Search
The Web
subClassOf
SWOntology
SWDocument
1
3
10
search SWD
list in-link
list SWT
list SWD
1. Search the best Person ontology (Web) 3.
List ontologies importing FOAF ontology (Web)
4. List all instance data of foafPerson
(DB) 5.List popular instance properties of
foafPerson (KB)
33Swoogle web-scale semantic web data access
the Web
agent
harvest RDF data
ask (person)
Search vocabulary
Search URIrefs in SW vocabulary
inform (foafPerson)
Compose query
ask (?x rdftype foafPerson)
Search URLs in SWD index
Populate RDF database
inform (doc URLs)
Fetch docs
Query local RDF database
http//swoogle.umbc.edu/
34Triple Shop SPARQL dataset finder
Who knows Anupam Joshi? Show me their names,
email address and pictures
1. Compose a SPARQL query without FROM clause
2. Parse SPARQL query, search Swoogle for
related URLs, and compose a dataset
3. Run SPARQL query on dataset
http//sparql.cs.umbc.edu/tripleshop2/
35Discussion How to Improve User Experiences
36Granularity of Semantic Web knowledge
Universal RDF Graph
The Semantic Web About 211M documents
Semantic Web Documents
Physically host knowledge About 200 triples in
average
Instances / Named-graph
cluster of triples highly relevant
Molecules
Finest lossless set of triples in RDF graph
Triples
Atomic knowledge block
- Who will publish/consume semantic content at what
granularity ? - Document are used to publish semantic web data
- End-user tends to access data in terms of ABox,
or using SPARQL on universal RDF graph - RDF molecule helps signing RDF graph, tracking
provenance, and more - Triples are managed by triple store
37Integrated Ontology Dictionary
- Reverse engineering TBox from observed ABox
- Building block is TBox
- Ontology-documents and namespaces enable grouping
Onto 1
Onto 2
rdftype
owlClass
foafname
rdfsdomain
foafPerson
foafPerson
foafAgent
rdfssubClassOf
foafname
rdfsdomain
rdftype
owlClass
wobhasInstanceDomain
foafPerson
wobhasInstanceDomain
foafAgent
dctitle
rdfssubClassOf
SWD3
foafname
Tim Finin
rdftype
foafPerson
dctitle
Dr.
http//swoogle.umbc.edu/2005/modules.php?nameOnto
logy_Dictionary
38Life cycle Semantic Web Archive
- Like Internet Archive RDF graph gt URL gt
version - track the evolution of an ontology, e.g., the
Protege ontology - track the grows of instance data, e.g., a FOAF
document. - Permanent URI for different versions
39Tracking Provenance via RDF Molecule
decompose
The graphs RDF molecules
An RDF graph G
http//www.cs.umbc.edu/dingli1
t1
foafknows
foafname
t2
t1
Li Ding
foafname
t2
t3
t4
Tim Finin
foafmbox
t3
t4
t4
mailtofinin_at_umbc.edu
Match sub-Graph
Web pages containing one or more molecules
discovered by Swoogle
Ding, L. Finin, T. Peng, Y. Pinheiro da Silva,
P. McGuinness, D.L. Tracking RDF Graph
Provenance using RDF Molecules. Proceedings of
the Fourth International Semantic Web Conference
(poster), November 2005. 2005 ,
http//www-ksl.stanford.edu/KSL_Abstracts/KSL-05-
06.html
40Integrating multiple social network (FOAF and
DBLP) and discovering interesting semantic
association
Real world
FOAF
DBLP
Pranam Kolari
Anupam Joshi
Anupam Joshi
coauthor
same
knows
knows
Tim Finin
Timothy W. Finin
knows
Amit P. Sheth
Amit Sheth
coauthor
sname_3_6
sname_3_6
sname_3_5
sname_3_5
sname_3_5
SName3(with mid-initial)
SName2(no mid-initial)
T. Finin
P. Kolari
A. Sheth
T. W. Finin
A. P. Sheth
A. Joshi
41Conclusion Discussion
42Conclusions and Future Work
- Summary
- Characterized the Semantic Web on the Web
- Built Swoogle for Semantic Web surfing
- Investigate issues on user experiences
- How to promote the Semantic Web
- More ontologies with reasoner support
- We have OWL, RDFS, and we are working on policy,
rule languages - We may also need temporal-spatial reasoning soon
- More data
- DBLP is online, how to use it? And what is the
next? - How to encourage end-users input semantic
content? - More user applications
- How to align consumers ontology and publishers
ontology - How to lower learning curve for programmers
- More show-cases to promote the benefits
- Controlled common vocabulary
- Expressive class description, and subsumption
inference - Declarative programming ?