Title: Ontotext @ JRC
1Ontotext _at_ JRC
2Semantic Web
- The Semantic Web is the abstract representation
of data on the WWW, based on the RDF and other
standards - SW is being developed by the W3C, in
collaboration with a large number of researchers
and industrial partners - http//www.w3.org/2001/sw/
- http//www.SemanticWeb.org
3Semantic Web (II)
- "The Semantic Web is an extension of the current
web in which information is given well-defined
meaning, better enabling computers and people to
work in cooperation. Berners-Lee et al. 2001 - The spirit
- Automatically processable metadata regarding
- the structure (syntax) and
- the meaning (semantics)
- of the content.
- Presented in a
- standard form
- Dynamic interpretationfor unforeseen purposes
4Semantic Web Languages
- RDF(S) the next slides
- SHOE, XOL, etc the pioneers
- Topic Maps a metadata language with limited
impact - OIL Ontology Interchange Language, the basis of
the next two http//www.ontoknowledge.org/oil/ - Description Logics-based multilayered language
- DAMLOIL the predecessor of OWL, not to be
developed - OWL the W3C standard for Semantic Web ontology
language, http//www.w3.org/2001/sw/WebOnt/ - Extends RDF(S), but also constraints it
- Has multiple layers (Lite, DL, Full)
- Transitive/symmetric/etc properties,
disjointness, cardinality restrictions
5Semantic Web Problems
- Critical mass of metadata is necessary
- Still lack of consensus on many issues (like
query languages) - Lack of practices at the proper scale and
complexity - Lack of robust Semantic (in our days RDFS)
repositories - Should be as flexible, multi-purpose and easy to
use as HTTP servers and - As efficient in structured knowledge management
as RDBMS
6What are Sirma Ontotext?
- Established in 1992 as a Bulgarian AI Lab.
- Current structure
- Sirma Group International Corp, Montreal, Canada
- 8 subsidiary companies the most important ones
follow below. - Sirma AI, Sofia
- The RD backbone of the group with two divisions
- Sirma Solutions e-Business, banking, C3,
e-Publishing, consultancy - Ontotext Lab Knowledge and Language Engineering.
- EngView Systems, Montreal
- CAD/CAM systems and applications.
- WorkLogic.Com, Ottawa
- Web-based collaboration, workflow, e-Gov.
7Software Development and Research since 1992
- Track record of success large companies and
government organizations in US, Canada, Western
Europe and Bulgaria - Top-3 Software Company in Bulgaria
- About 70 developers
- ISO 2001 Certificate
- 1999 EIST prize winner
8Sirma Businesses and Domains
- Diverse business, ranging from COTS products to
custom projects, consultancy, and outsourcing
services. - Major areas
- AI expert systems (beside Ontotext)
- b2b market places
- CAD/CAM (for packaging, quality control)
- e-Government, CSCW, Groupware, Workflow
- Banking
- C3/C4 Systems (military, airport traffic)
- VOIP billing systems
- e-Publishing, Proofing tools.
9Ontotext Lab
- An RD lab of Sirma for
- Knowledge and Language Engineering
- Research and core technology development for
- knowledge discovery, management, and engineering.
- Specialized for applications in Semantic Web,
Knowledge Management, and Web Services. - Aside from the scientific matters, most of us are
just professional software developers.
10Leading Semantic Web Technology Provider
- Ontotext is a leading Semantic Web technology
provider, being - the developer of the KIM Semantic Annotation
Platform and - a co-developer of the GATE language engineering
platform - a co-developer of the Sesame semantic repository
and OWLIM high-performance OWL reasoner - the developer of the WSMO4J semantic web services
API - a partner in the SWAN Semantic Web Annotator
project. - Ontotext is part of most of the major European
research projects in the field the most
successful Bulgarian participant in FP6.
11Mission
- A critical mass of research in a number of AI
areas made efficient KM almost possible. - the technology on the market is mostly of two
sorts - Expensive black boxes
- Academic prototypes
- Our mission is
- To develop and popularize open, skillfully
engineered tools... - For Information Extraction and Knowledge
Management, - Which considerably reduce the cost for
implementation and use of KM applications.
12Major Research Areas
- We focus on building cutting-edge expertise and
technology in the following areas - ontology design, management, and alignment
- knowledge representation, reasoning
- information extraction (IE), applications in IR
- semantic web services
- upper-level ontologies and lexical semantics
- NLP POS, gazetteers, co-reference resolution,
named entity recognition (NER) - machine learning (HMM, NN, etc.)
13Academic Technology Partners
- NLP Group, Sheffield University, UK
- Digital Enterprise Research Institute (DERI),
Institut für Informatik, Innsbruck, Austria,
andNational University of Ireland, Galway - Aduna (Aidministrator) b.v., The Nederland's
- Linguistic Modelling Lab.CLPOI, Bulgarian
Academy of Sciences - British Telecommunications Plc, (BT), UK.
- Froschungszentrum Informatik (FZI) and Institut
AIFBKarlsruhe, Germany.
14Customers
- SemanticEdge GmBH, Berlin, Germany
- QinetiQ Ltd, UK
- Fairway Consultants, UK
15Research Projects
- We were/are part of a number of FP5 research
projects - On-To-Knowledge - the project which invented
OIL.Ontology Middleware Module and a DAMLOIL
reasoner. - VISION - Towards Next Generation Knowledge
Management. - OntoWeb - Ontology-based information exchange for
knowledge management . - SWWS - Semantic Web enabled Web Services.
16Research Projects (II)
- FP6 integrated projects that started Jan 2004,
durations 3 years - SEKT Semantic Knowledge Technologies. Targeting
a synergy of Ontology and Metadata Technology,
Knowledge Discovery and Human Language
Technology. - DIP Data, Information, and Process Integration
with Semantic Web Services. - PrestoSpace Preservation towards storage and
access. Standardized Practices for Audiovisual
Contents in Europe. - Infrawebs Intelligent Framework for Generating
Open (Adaptable) Development Platforms for
Web-Service Enabled Applications Using Semantic
Web Technologies, Distributed Decision Support
Units and Multi-Agent-Systems
17Introduction to Ontologies
- Despite the formal definitions, ontologies are
- Conceptual models or schemata
- Represented in a formalism which allows
- Unambiguous semantic interpretation
- Inference
- Can be considered a combination of
- DB schema
- XML Schema
- OO-diagram (e.g. UML)
- Subject hierarchy/taxonomy (think of Yahoo)
- Business logic rules
18Introduction to Ontologies (II)
- Imagine a DB storing
- John is a son of Mary.
- It will be able to "answer" just
- Which are the sons of Mary? Which son is John?
- An ontology with a definition of the family
relationships. It could infer - John is a child of Mary (more general)
- Mary is a woman
- Mary is the mother of John (inverse)
- Mary is a relative of John (generalized inverse).
- The above facts, would remain "invisible" to a
typical DB, which model of the world is limited
to data-structures of strings and numbers.
19Products
- The Ontology Middleware Module (OMM) is an
enterprise back-end for formal KR and KM
applications based on Semantic Web standards - An extension of the Sesame RDF(S) repository
that adds a Knowledge Control System. - OMM integration options Built-In, RMI, SOAP,
HTTP.
20Products
- BOR a DAMLOIL reasoner.
- Proprietary GATE components
- Hash Gazetteer. A high-performance lookup tool.
- Hidden Markov Model Learner. A stohastic module
for filtering annotations, disambiguation,
(etc.,) based on confidence measures. - The News Collector is a web service, collecting
and indexing articles from the top-10 global news
wires - About 1000 articles/day, annotated and indexed
using KIM - Used to validate the heuristics and resources of
KIM
21Products (II)
- The KIM Platform (the next slides),
http//www.ontotext.kim. - SWWS Studio (http//swws.ontotext.com)
- Semantic Web Service description development
environment - Developed in the course of the SWWS project
- Based on WSMO (http//www.wsmo.org)
- WSMO4J (http//wsmo4j.sourceforge.net)
- A WSMO API and a reference implementation
- for building Semantic Web Services applications
- Used in WSMO Studio, (http//www.wsmostudio.org/)
- The basis for ORDI, used in OMWG
(http//www.omwg.org) - Used in projects DIP, SEKT, Infrawebs
22OWLIM
- OWLIM is a high-performance OWL repository
- Storage and Inference Layer (SAIL) for Sesame RDF
database - OWLIM performs OWL DLP reasoning
- It is uses the IRRE (Inductive Rule Reasoning
Engine) for forward-chaining and total
materialization - In-memory reasoning and query evaluation
- OWLIM provides a reliable persistence, based on
RDF N-Triples - OWLIM can manage millions of statements on
desktop hardware - Extremely fast upload and query evaluation even
for huge ontologies and knowledge bases
23Scalability Upload and Reasoning
24Scalability Query Answering
- Q2 Pattern of 12 statement-joins and LIKE
literal constraint
25OWLIM under LUMB Benchmark
- The Lehigh Univ. evaluation is one of the most
comprehensive benchmark experiments published
recently (ISWC 2004, WSJ 2005) - Synthetically generated OWL knowledge bases
- The biggest set generated is LUMB(50,0) 6M
explicit statements - 14 queries, checking different inferences
- OWLIM on LUMB
- On a desktop machine OWLIM loads LUMB(50,0) in 10
min - The only other systems known to load it, does
this for 12 hours - All the queries are answered correctly
- Based on this we can claim that
- OWLIM is the fastest OWL repository in the world!
26JOCI
- Jobs Contacts Intelligence, Innovantage,
Fairway Consultants - Gathering recruitment-related information from
web-sites of UK organizations - Offering services on top of this data to
recruitment agencies, job portals, and other. - JOCI uses KIM for information extraction (IE,
text-mining) - JOCI makes use of a domain ontology to
- support the IE process,
- to structure the knowledge base with the obtained
results, and - facilitate semantic queries.
- Sirma is shareholder in Fairway Consultants
27JOCI Dataflow
UK Web Space
Web UI
Information Extraction
KIM Server
Single-Document IE
Semantic Repository
Focused Crawler
Crawler
Classifier
Object Consolidation
Document Store
28JOCI Vacancy Consolidation/Matching
Consolidated Vacancy
locatedIn
Vacancy 1
Vacancy 2
hasJobTitle
locatedIn
IT Applications Support Analyst
Support Analyst
locatedIn
sub-string
Glasgow
U.K.
Scotland
subRegionOf
subRegionOf
type
type
City
Country
subClassOf
Location
29JOCI Statistics
- The figures below are indicative and reflect an
old state of the JOCI system - The actual figures are to be announced after the
launch of JOCI - Web-sites inspected 0.5M
- Web-sites with vacancy announcements 30K
- Extracted vacancies 100K
30The KIM Platform
- A platform offering
- services and infrastructure for
- (semi-) automatic semantic annotation and
- ontology population
- semantic indexing and retrieval of content
- query and navigation over the formal knowledge
- Based on Information Extraction technology
31KIM Whats Inside?
- The KIM Platform includes
- Ontologies (PROTON KIMSO KIMLO) and KIM World
KB - KIM Server with a set of APIs for remote access
and integration - Front-ends Web-UI and plug-in for Internet
Explorer.
32The AIM of KIM
- Aim to arm Semantic Web applications
- by providing a metadata generation technology
- in a standard, consistent, and scalable framework
33What KIM does? Semantic Annotation
34Simple Usage Highlight, Hyperlink, and
35Simple Usage Explore and Navigate
36Simple Usage Enjoy a Hyperbolic Tree View
37KIM is Based On
- KIM is based on the following open-source
platforms - GATE the most popular NLP and IE platform in
the world, developed at the University of
Sheffield. Ontotext is its biggest
co-developer.www.gate.ac.uk and
www.ontotext.com/gate - OWLIM OWL repository, compliant with Sesame
RDF database from Aduna B.V. www.ontotext.com/owl
im - Lucene an open-source IR engine by Apache.
jakarta.apache.org/lucene/
38How KIM Searches Better
- KIM can match a Query like
- Documents about a telecom company in Europe, John
Smith, and a date in the first half of 2002. - With a document containing
- At its meeting on the 10th of May, the board of
Vodafone appointed John G. Smith as CTO" - The classical IR could not match
- Vodafone with a "telecom in Europe, because
- Vodafone is a mobile operator, which is a sort of
a telecom - Vodafone is in the UK, which is a part of Europe.
- 5th of May with a "date in first half of 2002
- John G. Smith with John Smith.
39Entity Pattern Search
40Pattern Search Entity Results
41Entity Pattern Search KIM Explorer
42Semantic Metadata in KIM
- Provides a specific metadata schema,
- focusing on named entities (particulars),
- as well as number and time-expressions,
addresses, etc., - everything specific, apart from the general
concepts. - Defines specific tasks for generation and usage
of the metadata which are well-understood and
measurable. - Why not metadata about general things
(universals)? - It is too complex
- but we leave the door open.
- The particulars seem to provide a good 80/20
compromise.
43World Knowledge in KIM
- Rationale
- The ontology is encoded in OWL Lite and RDF.
- provide common knowledge about world entities
- KIM bets on scale and avoids heavy semantics
- minimum modeling of common-sense, almost no
axioms - The ontology is encoded in OWL Lite and RDF.
- In addition, a number of rules (generative
axioms) are defined, e.g. - ltX,locatedIn,Ygt and ltY,subRegionOf,Zgt gt
ltX,locatedIn,Zgt - Axioms of this sort are supported by OWLIM and
they provide a consistent mechanism for custom
extensions to the OWL or RDF(S) semantics with
respect to a particular ontology
44PROTON
- Name. PROTON is an acronym for
- Proto Ontology
- ex-names BULO (basic upper-level ontology), GO
(generic ontology) - not a Russian space rocket ?
- proto used in the sense of primary,
beginning, giving rise to, vs. first in
time or oldest - connotations positive, fundamental, elemental,
in favour of, even romantic (like a
science-fiction novel from the 60-ies) ? - Intended usage. A Basic Upper-Level Ontology like
PROTON - used for - ontology population
- knowledge modelling and integration strategy of a
KM environment - generation of domain, application, and other
ontologies.
45PROTON Design
- Design principles
- domain-independence
- light-weight logical definitions
- Compliance with popular metadata standards
- good coverage of concrete and/or named entities
(i.e. people, organizations, numbers) - no specific support for general concepts (such as
apple, love, walk), however the design
allows for such extensions
46Some Figures
- PROTON defines about
- 250 classes and 100 properties
- Providing coverage of most of the upper-level
concepts necessary for semantic annotation,
indexing, and retrieval - A modular architecture, allowing for great
flexibility of usage and extension - SYSTEM module - contains a few meta-level
primitives (6 classes and 7 properties)
introduces the notion of 'entity', which can have
aliases - TOP module - the highest, most general,
conceptual level, consisting of about 20 classes - UPPER module - over 200 general classes of
entities, which often appear in multiple domains.
47PROTON Ontology Language
- The current version of the ontology is encoded in
OWL Lite. - A few custom entilement rules (axioms) are also
defined for usage in tools that support them, for
instance - Premise
- ltxxx, protontroleHolder, yyygt
- ltxxx, protontroleIn, zzzgt
- ltyyy, rdftype, protontAgentgt
- Consequent
- ltyyy, protontinvolvedIn, zzzgt
- Axioms of this sort are interpreted by OWLIM
- PROTON is portable to any OWL(Lite)-compliant
tool. - PROTON can be used without such axioms either.
48Other Standards Relations
- ADL Feature Type Thesaurus and GNS
- the backbone of the Location branch
- on its turn aligned with the geographic feature
designators, of the GNS database of NIMA - PROTON is more coarse-grained, taking about 80
out of 300 types. - Dublin Core
- the basic element set available as properties of
protontInformationResource and protontDocument
classes - the resource type vocabulary is mapped to
sub-classes of InformationResource. - OpenCyc and WordNet consulted and referred to in
glosses. - ACE (Automatic Content Extraction) annotation
types covered. - FOAF assure easy mapping (e.g. the Account
class was added). - DOLCE, EuroWordnet Top, and others consulted to
various extent.
49Other Standards Compliance
- Other models are not directly imported (for
consistency reasons) - The mapping of the appropriate primitives is
easy, on the basis of - a compliant design, and
- formal notes in the PROTON glosses, which
indicate the appropriate mappings. - For instance, in PROTON, a protontinLanguage
property is defined - as an equivalent of the dclanguage element in
Dublin Core - with a domain protontInformationResource
- and a range protontLanguage
50KIM World KB
- A quasi-exhaustive coverage of the most popular
entities in the world - What a person is expected to have heard about
that is beyond the horizons of his country,
profession, and hobbies. - Entities of general importance like the ones
that appear in the news - KIM knows
- Locations mountains, cities, roads, etc.
- Organizations, all important sorts of business,
international, political, government, sport,
academic - Specific people, etc.
51KIM World KB Entity Description
- The NE-s are represented with their Semantic
Descriptions via - Aliases (Florida FL)
- Relations with other entities (Person hasPosition
Position) - Attributes (latitude longitude of geographic
entities) - their proper Class
52The Scale of KIM World KB
RDF Statements Small KB Full KB
- explicit 444,086 2,248,576
- after inference 1,014,409 5,200,017
Instances
- Entity 40,804 205,287
- Location 12,528 35,590
- Country 261 261
- Province 4,262 4,262
- City 4,400 4,417
- Organization 8,339 146,969
- Company 7,848 146,262
- Person 6,022 6,354
- Alias 64,589 429,035
53KIM IE Pipeline
54JAPE Grammars
- Jape grammars are based on the last MUSE version
- Class/instance information included
- Better class granularity in grammars
- Relation recognition grammars - LocatedIn and
HasPositionWithinOrganization
55Disambiguation Filtering
- Simple disambiguation (longest match), e.g. San
Francisco Journal - Based on the main alias, e.g. Beijing
- By priority of the class, instance or relative
class priority - E.g. Brand Microsoft vs. Company Microsoft
Corp. - We assign a priority (1-1000) to each class and
instance - For pairs of classes we define relative priority
- If the difference between the priorities is
greater than a certain threshold the possible
reference to the entity with the lower priority
is ignored - Still to be improved
56KIM Scaling on Data
- The Semantic Repository is based on OWLIM
- In our practical tests we observe perfect
performance on top of - 1.2M of entity descriptions
- about 15M explicit statements
- above 30M statements after forward chaining.
- Document and Annotation storage and indexing with
Lucene - One million docs, processed on a 1000-worth
machine - retrieval in milliseconds.
57Entity Ranking a sketch for Jan-May 2004
No Instance Label Rank
1 Country_T.5 United States 0.032
4 Country_T.IZ Republic of Iraq 0.011
6 Person_T.51 George W. Bush 0.010
9 Country_T.IS State of Israel 0.006
11 DayOfWeek_T.4 Tuesday 0.005
12 NewsAgency_T.6 The Associated Press 0.005
14 InternationalOrganization_T.13 United Nations 0.005
27 Country_T.CH People's Republic of China 0.004
32 City_T.3068 New York 0.004
36 InternationalOrganization_T.18 European Union 0.004
40 Person_T.115 Ariel Sharon 0.003
43 Country_T.JA Japan 0.003
44 Country_T.UK United Kingdom 0.003
45 CountryCapital_T.93 Baghdad 0.003
58SWAN/KIM Cluster Architecture
- At present, KIM is used for massive semantic
annotation in the context of the SWAN and SEKT
projects - Here are some of its features
- support for a virtually unlimited number of
annotators - centralized ontology storage and querying
- centralized meta-data (annotations) and document
storage, indexing, and querying - support for multiple crawlers (or other data
sources) - dynamic reconfiguration of the cluster (e.g.
staring new crawlers or annotators on demand).
59SWAN/KIM Cluster Console
60SWAN Project Semantic Web Annotator
- Large Scale Annotation of human language for the
Semantic Web using Human Language Technology
(HLT). - Hosted by DERI (NUIG, Galway) and involves also
- GATE team (from the Sheffield University's NLP
Group) and - Ontotext Lab.
- For more details take a look at
http//deri.ie/projects/swan/ - The current status
- KIM Cluster of 7 servers in DERI
- Above 0.5TB shared storage
- 6 AMD64 Opterons, 6 Xeons, 36GB RAM
61CoreDB Name and Goals
- CoreDB is a component of KIM
- Stands for Co-Occurrence and Ranking of Entities
DB - In a nutshell, it is designed to allow fast
queries of the sort - Q1 the number of appearances of UK in
documents during Jan 2005 - Q2 all people co-occurring with John Smith and
some bank institution in documents from the
second half of 2003 - Q3 Q2 where the documents contain fraud and
the name of the institution contains capital
62CoreDB Functionality
- It allows asking in a structured manner for
- The number of references to entities in a
(sub-)set of documents - The entities, which co-occur together with other
entities - Entities can be constrained by
- Class (and its sub-classes)
- Keyword/token in one of its names/aliases/labels
- Documents can be constrained according to DC-like
features - Date (range could be any date in the doc)
- Type (exact match could be any string)
- Authors
- Title and Sub-title
- Keyword/token in the content, authors or the
title fields
63The Scale of Ambition
- The major point is to allow such queries in
efficient manner over data with the following
cardinality - 106 entities/terms
- 107 documents
- 102 entities occurring in an average document
- This means managing and querying efficiently 109
entity occurrences - We had tested the current implementations with
107 occurrences and it answers the basic
queries in milliseconds.
64CoreDB Applications
- Detection of associative links between
entities, based on co-occurrence in documents - It is an alternative of the detection of strong
links based on local context parsing - Ranking, measuring popularity, of an entity over
a set of documents - The ranking is as good/relevant/representative as
the set of documents is - Computing timelines (changes over time) for
entity ranking or co-occurrence - How did our popularity in the IT press changed
during June (i.e. What is the effect of this
1.5MEuro media campaign ?!?) - How does the strength of association between
organization X and RDF changes over Q1 ?
65Implementation
- It is a new component in the architecture of KIM
- Having an API (part of the KIM API), allows
different implementations - There are now a couple of RDBMS-based
implementations - Derby (free, open-source, 100 Java, was
Cloudscape from IBM) - ORACLE (v. 10g)
- The Derby implementation does not allow for
efficient searches involving keywords - The ORACLE implementation is used also for
FTS-style indexing of the document contents - Makes possible efficient combination of semantic
and keyword search (which is already available
through the SemanticQuery API) - In both RDBMS implementations
- Part of the ontology and the KB are replicated
- Same with part of the document and index related
information
66Ontotext Facts
- Founded year 2000
- 14 employees (permanent, without the shared
personnel and associates) - Daily statistics for http//www.ontotext.com,
over 150 visits 2000 hits - Number of scientific publications above 30
- Number of projects running 9
- More than 20 partners we directly cooperate with
on projects - Average age about 28
- Number of servers per developer 0.7
67Ontotext Lab
- Robust Technology
- and Professional Services for
- Knowledge and Language Engineering
- http//www.ontotext.com