Title: Knowledge Access Semantic technology for KM
- John Davies
- BT Research
john.nj.davies_at_bt.com
- Introduction to the Semantic Web
- Language stack
- Semantic Search and Browse
- Knowledge Sharing
- Natural Language Generation Summarisation
- Knowledge Delivery via Device Independence
- Quiz!
3Limitations of the Web today
- Machine-to-human, not machine-to-machine
4The Semantic Web
- allowing information to be shared and processed
- adding context and structure Tim Berners-Lee
- an extension of the current web in which
information is given well-defined meaning, better
enabling computers and people to work in
cooperation - An open platform
5Semantic Web
The Semantic Web is an extension of the current
web in which information is given well-defined
meaning, better enabling computers and people to
work in co-operation. Berners-Lee et al.,
6... Semantic Web HISTORY
10.2.2004 Resource Description Framework
(RDF) Web Ontology Language (OWL) become W3C
Source http//www.zakon.org/robert/internet/tim
7Semantic Web Layers
Entailment of the Implicit
Explicit Semantics
Relational Distributed Data
Data Exchange
8Where we are Today the Syntactic Web
Hendler Miller 02
9i.e. the Syntactic Web is
- A place where
- computers do the presentation (easy) and
- people do the linking and interpreting (hard).
- Why not get computers to do more of the hard
Goble 03
10Hard Work using the Syntactic Web
- Complex queries involving background knowledge
- Find information about animals that use sonar
but are not either bats, dolphins or whales - Locating information in data repositories
- Travel enquiries
- Prices of goods and services
- Results of human genome experiments
- Delegating complex tasks to web agents
- Book me a holiday next weekend somewhere warm,
not too far away, and where they speak French or
11Motivation Knowledge Management
- Knowledge workers are overwhelmed with
information - from intranets, emails, external newslines
- but may still lack the information required
- They need information identified
- by semantics, not just keywords
- by their interests and their task context
- in a form appropriate to their current physical
context - mobile phone, PDA, blackberry, laptop,
12Knowledge access
- context-aware tools for access to
semantically-annotated knowledge - search, browse, share, summarise
- integrated into day-to-day business processes
- automatic knowledge delivery based on current
context - activity, location, device, interests
- support multiple end-user devices
13XML is a first step
- Semantic markup
- HTML ? layout
- use bold font
- Insert an image here
- XML ? content
- this part of the document is the product price
- this document describes a telecommunications
- ltplaygt
- lttitlegtThe Life and Death of King
Johnlt/titlegt - ltDramatis Personaegt
- ltpersonagtThe Earl of PEMBROKElt/personagt
- ltpersonagtThe Earl of ESSEXlt/personagt
- lt/Dramatis Personaegt
- ltStagedirgtSCENE England, the
Court.lt/Stagedirgt - ltactgtAct 1
- ltscenegtScene I.
- ltspeechgt
- ltspeakergtJohnlt/speakergt
- ltlinegtNow, Chatillon, what would
France with us?lt/linegt - lt/speechgt
- Standard search engine
- WWW pages indexed
- maps keywords to WWW pages
- QuizXML
- A finer-grained index
- maps keywords to documents and the XML tags in
which they occur
16 17XML is a first step
- Metadata (with limitations)
- within documents, not across documents
- prescriptive, not descriptive
- No commitment on vocabulary and modelling
primitives (subclass, instance, etc) - ltvehiclegt
- ltcargtford
- ltenginegtxyz123-4lt/enginegt
- ltmodelgtmondeogtlt/mondeogt
- lt/cargt
- lt/vehiclegt
- RDF and ontologies are the next step
18What are Ontologies?
- Ontologies provide a shared and common
understanding of a domain (medicine, finance, ) - a shared specification of a conceptualisation
- Concept map
- A simple example - Yahoo
- BusinessEconomy gt Finance gt Banking
- for WWW, defined using RDF(S) OWL
20Ontology of People and their Roles
Programme Mgr
Project Mgr
21Structure of an Ontology
- Typically two distinct components
- Names for important concepts and relationships in
the domain - Elephant is a concept whose members are a kind of
animal - Herbivore is a concept whose members are those
animals who eat only plants - Background knowledge/constraints on the domain
- Adult_Elephants weigh at least 2,000 kg
- No individual can be both a Herbivore and a
22Why develop an ontology?
- Define web resources more precisely and make them
amenable to machine processing - Make domain assumptions explicit
- Easier to change domain assumptions
- Easier to understand and update legacy data
- Separate domain and operational knowledge
- Re-use separately
- A community reference for applications
- To share a consistent understanding of what
information means
23Ontologies - Some Examples
- General purpose ontologies
- The Upper Cyc Ontology, http//www.cyc.com/cyc-2-1
/index.html - IEEE Standard Upper Ontology, http//suo.ieee.org/
- Domain and application-specific ontologies
- RDF Site Summary RSS, http//groups.yahoo.com/grou
p/rss-dev/files/schema.rdf - Dublin Core, http//dublincore.org/
- UMLS, http//www.nlm.nih.gov/research/umls/
- Open Biological Ontologies http//obo.sourceforge
.net/ - FOAF www.foaf.org
- Ontologies in a wider sense
- Agrovoc, http//www.fao.org/agrovoc/
- UNSPSC, http//eccma.org/unspsc/
- DAML.org library http//www.daml.org/
24Ontology and Logic
- Reasoning over ontologies
- Inferencing capabilities
- X is author of Y ? Y is written by X
- X co-wrote D Y co-wrote D ?
- X and Y collaborate
- Cars are a kind of vehicle
- Vehicles have 2 or more wheels ?
- Cars have 2 or more wheels
25RDF and RDF-S
- W3C standards
- RDF-S defines the ontology
- classes and their properties and relationships
- There are books and authors. Authors write books.
- RDF defines the instances of these classes and
their properties - Mark Twain is an author
- Mark Twain wrote Adventures of Tom Sawyer
- Adventures of Tom Sawyer is a book
26An example RDF Schema
Annotation of WWW resources and semantic links
hasName (http//www.famouswriters.org/twain/mark
, Mark Twain) hasWritten (http//www.famousw
riters.org/twain/mark, http//www.books.org/ISB
N00001047582) title (http//www.books.org/ISBN0
0001047582, The Adventures of Tom
Sawyer) XML version ltrdfDescription
ltshasNamegtMark Twainlt/shasNamegt ltshasWritten
rdfresourcehttp//www.books.org/ISBN0001047/gt lt
- Searching RDF-annotated web resources
29RDF metadata annotations
Data (WWW document)
Annotation (metadata)
Lost information
- Subjective
- One of several interpretations
- Not exhaustive
30RDF as an Enrichment
31Precision and recall - the IR dilemma
- Trade-off between precision and recall
- recall - how many of relevant were found
- precision - how many of found were relevant
- Holy grail high precision high recall
- QuizRDF offers both
- separately
- closely-coupled
32Indexing data model
33Multidimensional Indexing
- Traditional search engine indexing
- term ? documents
- employee ? URI1, URI3, URI9
- miller ? URI3, URI7
- QuizRDF indexing
- ltliteral,class,propertygt ? URIs
- ltgeorge, Employee, first_namegt ? URI2
- ltmiller, Employee, last_namegt ? URI1, URI3
- ltmiller, Employee, ?gt ? URI1, URI3, URI7
34QuizRDF demo
35Two Retrieval Channels
Browser interface
Keyword query
- Precise
- Machine readable
- Subjective
- Incomplete
- Higher precision
- Original content
- Complete
- Imprecise
- Higher recall
- Combination of
- User familiar keyword search
- More precise RDF querying
- Data and metadata as complementary
- Low threshold, high ceiling
- Works on non-RDF information
- Exploits RDF where it exists
- Integrates browsing and querying
- Fits users info seeking behavior
37Conclusions about RDF(S)
- Next step up from plain XML
- (small) ontological commitment to modeling
primitives - possible to define domain vocabulary
- limited reasoning
- subsumption, but no transitivity, symmetry,
- limited expressive power
- no cardinality constraints, equality,
38Web Ontology Language Requirements
- Desirable features identified for Web Ontology
Language - Extends existing Web standards
- Such as XML, RDF, RDFS
- Easy to understand and use
- Should be based on familiar KR idioms
- Formally specified
- Of adequate expressive power
- Possible to provide automated reasoning support
39OWL Language
- OWL is based on Description Logics knowledge
representation formalism - OWL (DL) benefits from many years of DL research
- Well defined semantics
- Formal properties well understood (complexity,
decidability) - Known reasoning algorithms
- Implemented systems (highly optimised)
- Three species of OWL
- OWL Full maximum expressivity, undeciable
- OWL DL based on SHIQ DL, decidable
- OWL Lite - subset of OWL DL, most efficient
40Why OWL?
- OWL Web Ontology Language
- Owls superior intelligence is known throughout
the Hundred Acre Wood, as are his talents for
Writing, Spelling, other Educated and Special
tasks. - "My spelling is Wobbly. It's good spelling, but
it Wobbles, and the letters get in the wrong
- XML, RDF, OWL language stack
- Increasingly sophisticated search
- QuizXML
- subdocument searching
- QuizRDF
- browsing by concept and across relations
- searching on metadata and full-text
- Next steps in semantic search
- identification of named entities within documents
- Exploitation of world knowledge
- KIM (Ontotext)
43The KIM Platform
- A platform offering services and infrastructure
for - (semi-) automatic semantic annotation
- ontology population
- semantic indexing and retrieval of content
- query and navigation
- Based on an Information Extraction technology
- Aim to underpin Semantic Web applications
- by providing a metadata generation technology
- in a standard, consistent, and scalable framework
- PROTON - a light-weight upper-level ontology
- 250 NE classes
- 100 relations and attributes
- covers mostly NE classes, and to a smaller degree
general concepts
45Ontologies II
46KIM World KB
- Aims to cover the most popular entities in the
world - Entities of general importance like the ones
that appear in the news - KIM knows about
- Organizations, all important sorts of business,
international, political, government, sport,
academic - Specific people, (e.g. Politicians)
- Locations countries, regions, cities, roads,
47KIM World KB Content
- Collected from various sources, like geographical
and business intelligence gazetteers. - KIM also learns from documents indexed
- via GATE information extraction
- KB scale
- RDF Statements Small KB Full KB
- - explicit 444,086 2,248,576
- - after inference 1,014,409 5,200,017
48KIM Scaling on Data
- The Semantic Repository is based on Sesame/OWLIM.
- Our practical tests demonstrate a perfect
performance on top of - 1.2M entity descriptions
- about 15M explicit statements
- above 30M statements after forward chaining.
- Fulltext indexing with Lucene
- .5M docs, retrieval in milliseconds
49Semantic Annotation
50Simple Usage Highlight, Hyperlink, and
51Simple Usage Explore and Navigate
52People search for People
- A recent large-scale human interaction study on a
personal content IR system, carried out by
Microsoft demonstrated that - The most common query types in our logs were
People/places/things, Computers/internet and
Health/science. In the People/places thing
category, names were especially prevalent. Their
importance is highlighted by the fact that 25 of
the queries involved peoples names ... . In
contrast, general informational queries are less
53Semantic Queries
- The standard IR query is
- give me documents that contain the words
company, Europe, telecommunication - KIM provides indexing retrieval wrt NEs
- More precise specification and satisfaction of
information needs - specify the NEs we are interested in, and to
restrict them by their attributes and relations - Give me documents that mention a company in
Europe from the telecommunications industry
54Precision in Semantic Search
- KIM can match
- a query Documents concerning a telecom company
in Europe, John Smith, and a date in the first
half of 2002. - With a document containing At its meeting on
the 10th of May, the board of Vodafone appointed
John G. Smith as CTO" - Classical IR cannot do the required reasoning
- Vodafone is a mobile operator, which is a kind of
telecom company - Vodafone is in the UK, which is a part of Europe.
- 5th of May is a "date in first half of 2002
- John G. Smith matches John Smith.
55Entity Pattern Search
56Pattern Search Entity Results
57Entity Pattern Search KIM Explorer
58Predefined Pattern Search
59Pattern Search Multiple-Entity Results
60Pattern Search, Referring Documents
61Document Details
62KIM - summary
- KIM is a platform for
- semantic annotation,
- ontology population,
- semantic indexing and retrieval,
- providing an API for remote access and
integration, - based on Information Extraction (IE) using mature
HLT (GATE). - powered by massive world knowledge
- http//www.ontotext.com/kim
- Periodic agent search for named entities
- e.g. a person in an organisation
- Returns relevant documents and metadata
- Proactive knowledge delivery
- Linked to device indepedence module (see later)
- Based upon KIM architecture
- Result-led indexing
- Adds relevant pages to next crawl list
64SEKTAgent demo
- Uses Google for traditional search
- Augments results with relevant data aggregated
from distributed (and semantically annotated)
data - Offers distributed query interface
tap.stanford.edu for more information
- Searching for semantic web documents and
ontologies - See swoogle.umbc.edu
68Google vs. Swoogle
- How to find a popular ontology that defines the
concept of person? - Ask Google?
- Type Person filetyperdf
- Type Person filetypeowl
- More complicated query person rdfsClass
filetyperdf - Ask Swoogle?
- Type person in document search
- 1 http//xmlns.com/foaf/0.1/index.rdf
69Find Time Ontology
We can use a set of keywords to search ontology.
For example, time, before, after are basic
concepts for a Time ontology.
70Beyond search, beyond documents
- a long list of documents is rarely the ultimate
information need of the end user - theres too much relevant information!
- support for the next step - the analysis of the
returned information - e.g. key points on a topic from a large document
you dont want to read - e.g. creation of a digest of information from
multiple documents about Bushs statements on a
given topic
71Search Engine trends
- Seamless and integrated
- one search engine for Web and desktop
- implicit queries based on user activity
- Personalisation
- based on user interaction
- Beyond document lists
- sub-document analysis
- Taxonomies and classification
- taxonomy / enterprise search growing at 10 p.a.
- Ontologies and semantic annotation
- A coherent approach to all these issues
72Knowledge Sharing
- Sharing knowledge through an organisation
- learning from success and failures of others
- avoiding duplication of effort
- (Virtual) communities of practice
- Groups with shared interests who will benefit
from collaboration and sharing knowledge - (Using WWW technology to increase collaborative
73Communities the Semantic Web
- Communities require a shared conceptual
vocabulary - Consensual, evolving concept map
- Ontologies!
- OntoShare
- automates sharing of knowledge in an
organisation via community-based RDF(S) ontologies
- Sharing and Classifying resources according to an
Ontology - Informs users when relevant document added to
store - Ontology-based personalisation
- Provides knowledge store for browsing and
76OntoShare Sharing knowledge
- User shares knowledge
- WWW document
- Any textual data
- Can supply annotation
77OntoShare Sharing knowledge
- System automatically extracts keywords summary
- System assigns knowledge to concepts
78OntoShare Sharing knowledge
- System emails an alert to selected users based on
match to user profile
79OntoShare Evolving Ontologies
- OntoShare automatically suggests changes to
concept characterisation - Concept characterisations evolve over time
80OntoShare Evolving Ontologies
- User can suggest new concepts for ontology at any
time - System emails community on suggestion (à la
Usenet) and counts votes
81Finding People Collaboration
- Use of personal profiles
- Who else is interested in this document?
- Who else is interested in this topic?
- Encouraging exchange of tacit knowledge
- Discussion threads around shared knowledge
- Adding value to the knowledge stored
82SWAP Semantic Web and Peer-to-Peer
- Distributed Knowledge Management
- Different participants with different
conceptualizations of their domain - Different knowledge sources
- Physically distributed, dynamic environment
- Peer-To-Peer Approach
- Decentralized nature Local control
- Symmetry Everyone is provider and consumer
- P2P networks as a reflection of social networks
- Flexible collaboration beyond hierarchical
83Case Study The Bibster System
- Scenario Sharing of bibliographic metadata in a
Peer-to-Peer network - Bibliographic metadata is created and maintained
in a decentralized manner, - Researchers are willing to share their data
- Use of semantics is crucial in this setting
- The Bibster system allows users to
- Easily share bibliographic data
- Save work in finding this data
- Avoid re-typing this data by hand
84Semantic Methods in Bibster
- Semantic representation and querying of metadata
- Extraction and classification from e.g. BibTeX
files - Semantic Web Research Community Ontology andACM
Topic hierarchy as light-weight ontologies - Peer selection using semantic topologies
- Scalability requires intelligent query routing
- Semantic descriptions of peers expertise as
basis for peer selection - Semantic duplicate detection
- Highly redundant and inconsistent representation
of bibliographic metadata - Semantic similarity measures to detect duplicates
85Bibster Screenshot
Open Source http//bibster.sourceforge.net/
86NLG - Summarisation
- NLG takes as input structured data in a knowledge
base or ontology and produces natural language
text - Applied to provide automatic documentation of
ontologies or generate textual reports from
formal knowledge - Keeps texts constantly up-to-date so they reflect
changes in the ontology - OntoSum, University of Sheffield
87The Property Hierarchy
- Special linguistically-motivated properties
introduced to make the NLG modules more generic - active-action (e.g. works-for)
- passive-action (e.g., published-by)
- Attribute (e.g. has-age, has-web-address)
- part-whole (e.g., consists-of)
- All properties from the ontology were made
sub-properties of one of these 4 - Attribute properties recognised using heuristics,
such as property name starts with has
88Summary Structuring
- Capture regular patterns can be applied
recursively - Describe-Instance -gt Describe-Attributes, Descri
be-Part-Whole, Describe-Active-Actions, Describe
-Passive-Actions - Describe-Attributes -gt
- attribute(Instance, Attribute),
- Describe-Attributes
- Collect all subproperties of Attribute property
relating to Instance - Attribute(John, hasMobileNumber)..
89Ontology-Based Aggregation
- Joining attribute and part-whole properties with
the same first argument to have more coherent
sentences - ATTR(Researcher XXX, Appellation
Dr)ATTR(Researcher XXX, string
my_email_at_sheff)ATTR(Researcher XXX, string
012344567)ATTR(Researcher XXX, string
www.mypage.ac.uk) - Without aggregationKalina Bontcheva has a Dr
appellation. Kalina Bontcheva has email
my_email_at_shef.com. Kalina Bon - With aggregationKalina Bontcheva has a Dr
appellation, email my_email_at_shef.com and
90Lexicalisation of Classes Properties
- 3 options
- Specified by ontology engineer
- Same as concept/property name
- Added manually when parameterising OntoSum
91Description of HSBC
Financial Institution
92Description of HSBC
93Innovative aspects
- Can tailor summary to device profile
- Apply length restriction
- e.g. for text message for mobile phone
- Generate HTML for web browser or plain text for
email - See device independence (next!)
- Readability heuristics
- introduce lists when verbalising more than 3
attributes - Use of ontology mapping rules to run same system
on multiple ontologies
94Related work
- Wilcock (Helsinki)
- Fully automatic, no lexicon
- Talking OWLs, ISWC-03
- Some manual input
- More effort, more fluency
- OntoSum based on MIAKT
- Bontcheva, NLDB04
95OntoSum demonstration
96Device Independence
- context-aware tools for access to
semantically-annotated knowledge - search, browse, share, summarise
- integrated into day-to-day business processes
- automatic knowledge delivery based on current
context - activity, location, device, interests
- support multiple end-user devices
97Device independence
- 3 approaches
- Hand-craft different sites for different devices
- Labour intensive, difficult to maintain
- Extend HTML to describe interaction, navigation
and selection - Server software generates output in suitable
format using CC/PP - Inflexible difficult to control output
precisely - No support for large volume sites
- Unclear what extensions are necessary and
sufficient - SEKT approach
- Use templates to format data content appropriate
for each class of device - Fine control of output based on CC/PP profiles
- can handle large volumes of structured data -
XML databases - device-dependencies coded in the templates, e.g.
mouse capability
98Device Profiles in RDF
- CC/PP - W3C RDF standard for describing device
characteristics - CC/PP vocabularies define device components and
component attributes - UAProf is an application of CC/PP adopted by many
terminal device manufacturers - An ontology of devices inheritance and
specialisation - Profile references and Profile Diffs are sent
with an information request - javax.ccpp package for processing profiles
99User Profiles
- Effective presentation must take user preferences
accessibility issues into account - Font size
- Colour preference
- Hi res/Lo res
- Device characteristics and preference/
accessibility requirements need to be combined - Effective screen size depends on both physical
size and user preferences (e.g. font size) - Specialisation/extension of UAProf
100Profile Engine
- The Profile engine combines device and user
profiles to generate a set of conditions - The engine can be queried by other applications
- PROLOG is being used as a prototyping language
- Arithmetic calculations of effective screen size
(for example) require more than RDF/OWL - DL (DIG) interface to SWI-Prolog
101Content Adaptation
- The content adaptation engine uses conditions
generated by profile engine queries - Example conditions
- Screen size x font size ?
- number of characters of text
- GraphicsSupported?
- Colour or BW
- Device characteristic or
- Accessibility issue
102Content Generation
- Different content must be generated for different
devices - The current context (set of conditions) will be
made available to SEKT applications - Natural Language Processing techniques are be
used to generate or modify information - Mobile phone 400 character text message
- PC multimedia document
- NLG describing ontology-based knowledge in
natural language (OntoSum!)
103Device Independence
- A functional presentation of a resource should be
available via any suitable device - Requirements include content selection, layout
transformation and style selection - At present, no one language can be interpreted by
all clients - It follows that content must be formatted for the
target device on the server
- Declarative templates are used to format the
(XML-based) data - Context (conditions) can be used to select
templates, and sections within templates - Template 1 WML
- InputEnabled?
- Template 2 HTML
- GraphicsWanted?
- Separation of data storage, processing and
display - W3C working group on device independence
- No standard for templates (yet)
UAProf (RDF(S))
Device Properties
User preferences
Repurposed Information
Raw Information
Profiling engine
Content Adaptation
(syntactic semantic)
106Device Independence demo
107Device Independence Summary
- Device and User profiles need to be combined
using a suitable ontology - A profile reasoning engine is used to generate
conditions on the format - Content can be generated according to the context
(set of conditions) - NLP techniques can be used to generate/summarise
text (semantic) - Templates are used to transform the results to a
format suitable for the device at hand (syntactic)
- Semantic Web technology can offer enhancements to
a range of KM tools - Search, Share, Summarise, Deliver
- Also
- Visualisation
- RDF or OWL statements as a graph
- Integration of heterogeneous information
- Outstanding Issues
- Trade-off between reasoning and scalability
- Where does the metadata come from?
- Only KIM starting to address this point
- See also SEKT project (www.sekt-project.com)
- Who will find the killer app?!
- Plenty of topics still on the research agenda
- Peter Haase, University of Karlsruhe
- Kalina Bontcheva, University of Sheffield
- Naso Kiryakov, Ontotext
- Ian Horrocks, University of Manchester
- Tim Glover Alistair Duke, BT
110Thank you questions?
- Heres a few for you
- What are the semantic web layers?
- Name 3 ontologies in widespread use today
- Name 3 semantic search tools
- What RDF ontology is used to characterise devices
- Why use NLG techniques on ontological
information? - What are the advantages of RDF over XML? And OWL
over RDF? - Names 3 trends in search engine development
- Describe briefly the way(s) in which metadata can
WIN A PRIZE!!!!!
John Davies Next Generation Web Research,
BT john.nj.davies_at_bt.com