Li%20Ding - PowerPoint PPT Presentation

About This Presentation
Title:

Li%20Ding

Description:

Li Ding – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 43
Provided by: infere
Category:
Tags: 20ding | ube

less

Transcript and Presenter's Notes

Title: Li%20Ding


1
Enhancing Web-scale Semantic Web Data Access
  • Li Ding
  • Knowledge Systems, AI Lab, Stanford University
  • Joint work with Tim Finin, Anupam Joshi, Rong
    Pan, Pavan Reddivari, Joel Sachs, and Yun Peng
  • Presented in logic group, Computer Science
    Department, Stanford University
  • November 15, 2006

2
The Semantic Web is Simple
  • Each URI denotes a concept, and optionally a web
    address
  • URIs are connected by triples, i.e. simple
    relational database
  • Machines read data as directed RDF graph
  • Ontologies, e.g. RDFS, OWL, are supported by
    inference-engines

Don't say "colour" say lthttp//example.com/2002/st
d6colgt
Relational database
RDF (Resource Description Framework)
Picture source Tim Berners-Lee, Putting the Web
back into Semantic Web, ISWC2005 Keynote
3
Motivation Connecting Producers and Consumers
IWBrowser
Swoop
WWW
FOAF
OWL
?
consume
Tabulator
publish
  • To use the Semantic Web on the Web, we need to
    know
  • How semantic content is distributed on the Web?
  • Where and how to access/reuse them?
  • What are the useful portions?
  • The above problems also occur when integrating
    information from unknown sources

4
Outline
  • To characterize the Semantic Web on the Web
  • Describing the Semantic Web and its Context
  • Measuring the availability and utility of online
    semantic content
  • To enhance Web-scale Semantic Web data access
  • Harvesting online semantic content
    semi-automatically
  • Enhancing Semantic Web data access with search
  • Discussion How to improve user experiences
  • Granularity of Semantic Web knowledge
  • Ontology inference, integration, evolution, and
    lifecycle
  • Knowledge provenance
  • Information integration and social network
  • Conclusion

5
How to describe the Semantic Web and Its Context
  • Scope three worlds and provenance
  • Conceptual model our focus
  • Vocabulary

Li Ding and Tim Finin, "Characterizing the
Semantic Web on the Web", in Proceedings of the
5th International Semantic Web Conference,
(ISWC'06) November 2006
6
The three worlds to be modeled
The Agent World
legends
trusts
subClassOf modal assertion provenance
Agent
Person
publishes
creates
The RDF Graph World
believes
describes
RDF graph
RDF Resource
The Web
serializes
defines
Semantic Web Document
Ontology
Web Document
7
Our Conceptual Model
SWNamespace
officialOntology
6
linked in RDF graph
use
5
use
4
RDF
1
SWTerm
define reference
Web
instantiate
8
3
SWOntology
SWDocument
2
7
subClassOf
imports
hyperlink
SWD - SWDoucment Semantic Web Document SWD -
SWOntology Semantic Web Ontology SWT SWTerm
Sematnic Web Term. URIs that are observed being
classes/properties. SWN SWNamespace Semantic
Web namespace
8
Vocabulary
meta-description (SWD, SWT)
http//www.cs.umbc.edu/finin/foaf.rdf
SWT
rdftype
foafPerson
instantiate class

foafmbox
instantiate property
finin_at_umbc.edu
SWD
reference property
http//foo.com/mappings.rdf
sub-class
rdfssubPropertyOf
SWO
foafmaker
docreator
reference class
http//xmlns.com/foaf/1.0/
rdfssubClassOf
SWN
wordNetAgent
foafPerson
define Class
rdftype
rdfsClass
rdfsdomain
http//xmlns.com/foaf/1.0/
foafmbox
define Property
rdfProperty
rdftype
SW Semantic Web
9
Measurements Observations (based on data
collected by May 2006)
  • Scale of the Semantic Web
  • SWD
  • SWDs per website power distribution
  • Size biased by test/auto-gen files embedded
    SWDs are small
  • Age exponential for SWD flat tail for SWO
  • size change usually increase some SWDs do
    lose triples
  • SWT
  • Usage few being really defined and populated
  • Definition quality Instance space power
    distribution
  • Instance properties of class power distribution

Li Ding and Tim Finin, "Characterizing the
Semantic Web on the Web", in Proceedings of the
5th International Semantic Web Conference,
(ISWC'06) November 2006
10
The Scale of the Semantic Web
Statistics based Semantic Web data indexed by
Swoogle
Year Terms(million) Documents(million) Individuals(million) Triples(million) Bytes(billion)
2004 0.15 0.33 7.3 48 4.3
2006 1.9 1.6 16 276 47
2008 10 100 1000 20,000 3000
Estimated number of documents based on Google
query
Docs Corresponding Google query
Optimistic 109 rdf OR inurlrss OR inurlfoaf -filetypehtml
Conservative 105 rdf filetyperdf
11
Where the semantic content is from
  • Instance data are mainly from com ( gt39 pure
    SWDs)
  • Ontologies are mainly from non-profit
    organizations (gt46 org) and academia (gt14 edu)
  • one IP may be shared by many websites (domain
    names) using virtual hosting technology.

note Statistics of top level domain is also used
in characterizing the Web (Henziger and Lawrence
2004)
12
Major Semantic Web data sources
  • FOAF, DC, RSS dominates
  • More languages PML, VML
  • More domains Book, BBC program
  • Newly reported sources DBLP in RDF

The unpinged column gives the number of URLs
discovered on the site that are suspected of also
being Semantic Web documents but have not yet
been processed.
13
Source websites of SWD
Jan 2005- Aug 2005
Jan 2005- Mar 2006
  • Invariant power distribution found!
  • The number of websites hosting more than m SWDs
    follows power law distribution
  • Similar to the Web
  • Head virtual hosting
  • Tail crawling strategy

14
Size of SWD
  • Embedded SWDs are small
  • 69 have 3 triples
  • 96 have lt10 triples
  • Pure SWDs
  • 60 have 5 to 1000 triples.
  • Special size of RSS 130
  • 17 triples for channel
  • 7 triples for each of the 15 items
  • SWOs
  • Biased by PML,
  • Small ones from RDF test
  • Largest has 1M triples

Number of SWDs
Number of SWOs
of triples
15
Age of SWD
  • Measured by the last-modified time of SWD
  • PSWD Exponential distribution
  • SWO flat tail

16
Size Change of SWD
  • Measured by comparing the size of two versions of
    the same PSWD
  • swd the number of PSWD having that change
  • delta the amount triples affected
  • Observations Overall increase in 183,464 alive,
    pure SWDs
  • Decrease 37,012 PSWDs, lost a total of
    1,774,161 triples
  • Increase 73,964 PSWDs, gained a total
    6,064,218 triples
  • Unchanged 72,488 PSWDs

of triples
17
SWT definition and usage
18
How Semantic Web Terms are used?
  • All usage distributions follow Power distribution
  • Few SWTs been well populated
  • 371 has gt100 class-instance
  • 1208 hasgt100 property-instances

19
Well instantiated terms
20
Swoogle Enhancing Web-scale Semantic Web Data
Access
  • Architecture
  • Hybrid crawling semi-automatically
  • Metadata design
  • Semantic Web surfing and ranking
  • Search based Semantic Web data access

21
Swoogle Architecture
Analysis
Ranking
SWD classifier

Index
Search Services
Semantic Web metadata
IR Indexer
Web Service
Web Server
SWD Indexer
html
rdf/xml
Discovery
the Web
document cache
SwoogleBot
Semantic Web
Candidate URLs
Bounded Web Crawler Google Crawler
human
machine
Legends
Information flow
Swoogles web interface
22
A Hybrid Harvesting Framework
true
Swoogle Sample Dataset
Manual submission
Inductive learner
would
Seeds R
Seeds M
Seeds H
RDF crawling
Bounded HTML crawling
Meta crawling
google
Google API call
crawl
crawl
the Web
23
Harvest Performance (May 2006)
  • Confirmed 1.6M SWDs among 3.7M URLs
  • Average daily discovery 6800 URLs, 3000 SWDs
  • Also noticed that over 100K SWDs have gone offline

Swoogle started crawling data since 2004. We
restarted in Jan 2005. Ping fetch document from
a URL PSWD pure SWD, not embed, can be correctly
parse by RDF parser (e.g. JENA)
24
Site-wise Coverage Swoogle vs Google
  • We compare the number of SWDs per website
  • Similar trends with variations
  • Google estimation is usually too optimistic
  • Swoogle sometimes finds more SWDs

25
Summary Swoogle Metadata
  • Metadata for SWD and SWO
  • document metadata
  • semantic content summary
  • content annotation
  • Metadata for SWT
  • URI, local-name
  • aggregated annotation
  • Relational metadata
  • SWD and SWT six types of meta-description
    relation
  • SWT and SWT inductive domain/range relation
  • SWD and SWN namespace (prefix) usage
  • Triples
  • Triples describing SWD
  • Triples describing SWT CBD (strickler 04)

26
Semantic Web Surfing
  • Semantic Web Surfing is not simply Web surfing
  • Hyperlinks
  • Same URI or instance
  • Motivating Scenarios
  • Search the best Person ontology (Web)
  • List ontologies importing FOAF ontology (Web)
  • List all instance data of foafPerson (DB)
  • List popular instance properties of foafPerson
    (KB)
  • Approaches
  • Build conceptual search and navigation model
  • Support the conceptual model using Swoogle

27
Current Search and Navigation
SWNamespace
officialOnto
6
links in RDF graph
use
5
4
SWTerm
use
1
def,ref
pop
8
3
SWOntology
SWDocument
2
7
subClassOf
imports
link
Conventional Web Search
  1. Search the best Person ontology - OK but not
    perfect
  2. List ontologies importing FOAF ontology -- hard
  3. List all instance data of foafPerson -- hard
  4. List popular instance properties of foafPerson
    --hard

28
Search Engine based Search and Navigation
  • Semantic Web Search Engine
  • New paths
  • reverse links A,B,C,D
  • Enhanced paths
  • term usage 3, 8
  • double links 2,7
  • inductive path 4
  • heuristic path 6

SWNamespace
6
5
B
4
A
SWTerm
1
8
3
D
SWOntology
SWDocument
C
2
7
subClassOf
Semantic Web Search
  1. Search the best Person ontology - YES,
    semantic web search
  2. List ontologies importing FOAF ontology YES,
    use 7
  3. List all instance data of foafPerson -- YES,
    use C
  4. List popular instance properties of foafPerson
    YES, use 4

29
Ranking using rational surfing model
  • Ranking both SWTs and SWDs using the following
    guidelines
  • Pursue Term definition
  • Follow links to other SWDs
  • Terminate/Restart surfing

hasDefinitionIn
SELF
o1
swt1
Uniform random jump
imports
Source
o2
d1
hasDefinitionIn
IMP
swt2
hasOfficialOntology
ns2
Uniform Random jump
EXT
swt3
hasNamespace
LINK
node4
d2
sameURL
STOP
STOP
node5
30
Simple Examples
1.54
d1
1/3
1/3
d6
0.875
1/3
1/3
1/3
2/3
d2
d5
1.125
1/3
0.82
1.41
d1
1/3
d7
(a) Simple DAG
1/3
2/3
1.24
1/3
d4
1.52
1/3
d2
1.18
d9
3/3
d8
0.86
d1
2/3
0.39
0.74
1/3
1/3
3/3
d3
1/3
0.88
d10
d2
0.55
0.81
(b) Simple Loop
(c) 10-node graph with loop, clique, and self
links
31
Swoogle Rank (May 2006)
http//www.w3.org/2000/01/rdf-schema
indegree432,984,mean(inflow)0.039
http//www.w3.org/1999/02/22-rdf-syntax-ns
0.51
1
indegree1,077,768,mean(inflow)0.100
0.11
0.10
2
0.25
0.30
0.35
5
http//purl.org/rss/1.0
0.11
http//www.w3.org/2002/07/owl
0.03
indegree270,178,mean(inflow)0.168
indegree86,959,mean(inflow)0.069
0.18
0.20
0.10
0.16
6
8
0.12
http//web.resource.org/cc
0.43
0.17
indegree57,066,mean(inflow)0.195
0.27
0.21
9
0.27
0.07
0.10
4
http//www.w3.org/2001/vcard-rdf/3.0
0.10
0.07
indegree155,949,mean(inflow)0.036
0.25
0.12
0.11
0.06
0.23
0.12
0.16
0.05
http//purl.org/dc/elements/1.1
10
0.03
indegree861,416,mean(inflow)0.096
7
0.20
http//www.hackcraft.net/bookrdf/vocab/0_1/
http//purl.org/dc/terms
0.08
indegree16,380,mean(inflow)0.167
indegree54,909,mean(inflow)0.042
3
0.17
http//xmlns.com/foaf/0.1/index.rdf
0.29
indegree512,790,mean(inflow)0.217
Computed using Swoogle metadata by May 2006
32
Swoogle based Search and Navigation
RDF graph World
Swoogle Term Search
5
SWTerm
SWNamespace
6
7
9
4
8
2
Swoogle Document Search
The Web
subClassOf
SWOntology
SWDocument
1
3
10
search SWD
list in-link
list SWT
list SWD
1. Search the best Person ontology (Web) 3.
List ontologies importing FOAF ontology (Web)
4. List all instance data of foafPerson
(DB) 5.List popular instance properties of
foafPerson (KB)
33
Swoogle web-scale semantic web data access
the Web
agent
harvest RDF data
ask (person)
Search vocabulary
Search URIrefs in SW vocabulary
inform (foafPerson)
Compose query
ask (?x rdftype foafPerson)
Search URLs in SWD index
Populate RDF database
inform (doc URLs)
Fetch docs
Query local RDF database
http//swoogle.umbc.edu/
34
Triple Shop SPARQL dataset finder
Who knows Anupam Joshi? Show me their names,
email address and pictures
1. Compose a SPARQL query without FROM clause
2. Parse SPARQL query, search Swoogle for
related URLs, and compose a dataset
3. Run SPARQL query on dataset
http//sparql.cs.umbc.edu/tripleshop2/
35
Discussion How to Improve User Experiences
36
Granularity of Semantic Web knowledge
Universal RDF Graph
The Semantic Web About 211M documents
Semantic Web Documents
Physically host knowledge About 200 triples in
average
Instances / Named-graph
cluster of triples highly relevant
Molecules
Finest lossless set of triples in RDF graph
Triples
Atomic knowledge block
  • Who will publish/consume semantic content at what
    granularity ?
  • Document are used to publish semantic web data
  • End-user tends to access data in terms of ABox,
    or using SPARQL on universal RDF graph
  • RDF molecule helps signing RDF graph, tracking
    provenance, and more
  • Triples are managed by triple store

37
Integrated Ontology Dictionary
  • Reverse engineering TBox from observed ABox
  • Building block is TBox
  • Ontology-documents and namespaces enable grouping

Onto 1
Onto 2
rdftype
owlClass
foafname
rdfsdomain
foafPerson
foafPerson
foafAgent
rdfssubClassOf
foafname
rdfsdomain
rdftype
owlClass
wobhasInstanceDomain
foafPerson
wobhasInstanceDomain
foafAgent
dctitle
rdfssubClassOf
SWD3
foafname
Tim Finin
rdftype
foafPerson

dctitle
Dr.
http//swoogle.umbc.edu/2005/modules.php?nameOnto
logy_Dictionary
38
Life cycle Semantic Web Archive
  • Like Internet Archive RDF graph gt URL gt
    version
  • track the evolution of an ontology, e.g., the
    Protege ontology
  • track the grows of instance data, e.g., a FOAF
    document.
  • Permanent URI for different versions

39
Tracking Provenance via RDF Molecule
decompose
The graphs RDF molecules
An RDF graph G
http//www.cs.umbc.edu/dingli1
t1
foafknows
foafname
t2
t1

Li Ding
foafname
t2
t3
t4
Tim Finin
foafmbox
t3
t4
t4
mailtofinin_at_umbc.edu
Match sub-Graph
Web pages containing one or more molecules
discovered by Swoogle
Ding, L. Finin, T. Peng, Y. Pinheiro da Silva,
P. McGuinness, D.L. Tracking RDF Graph
Provenance using RDF Molecules. Proceedings of
the Fourth International Semantic Web Conference
(poster), November 2005. 2005 ,
http//www-ksl.stanford.edu/KSL_Abstracts/KSL-05-
06.html
40
Integrating multiple social network (FOAF and
DBLP) and discovering interesting semantic
association
Real world
FOAF
DBLP
Pranam Kolari
Anupam Joshi
Anupam Joshi
coauthor
same
knows
knows
Tim Finin
Timothy W. Finin
knows
Amit P. Sheth
Amit Sheth
coauthor
sname_3_6
sname_3_6
sname_3_5
sname_3_5
sname_3_5
SName3(with mid-initial)
SName2(no mid-initial)
T. Finin
P. Kolari
A. Sheth
T. W. Finin
A. P. Sheth
A. Joshi
41
Conclusion Discussion
42
Conclusions and Future Work
  • Summary
  • Characterized the Semantic Web on the Web
  • Built Swoogle for Semantic Web surfing
  • Investigate issues on user experiences
  • How to promote the Semantic Web
  • More ontologies with reasoner support
  • We have OWL, RDFS, and we are working on policy,
    rule languages
  • We may also need temporal-spatial reasoning soon
  • More data
  • DBLP is online, how to use it? And what is the
    next?
  • How to encourage end-users input semantic
    content?
  • More user applications
  • How to align consumers ontology and publishers
    ontology
  • How to lower learning curve for programmers
  • More show-cases to promote the benefits
  • Controlled common vocabulary
  • Expressive class description, and subsumption
    inference
  • Declarative programming ?
Write a Comment
User Comments (0)
About PowerShow.com