Title: CORPORUM-OntoExtract
1CORPORUM-OntoExtract Ontology Extraction Tool
Author Robert Engels Company CognIT a.s
2Overview
- On-To-Knowledge project
- CORPORUM
- CORPORUM-OntoExtract
- Discussion
- Conclusion
3What is Knowledge Management?
Knowledge Management is the collection of
processes that govern the creation,
dissemination, and utilization of knowledge.
--- Brian Newman, 1991
4What is On-To-Knowledge (OTK) project?
Goals develop tools and methods for supporting
knowledge management relying on sharable and
reusable knowledge ontologies. The technical
backbone of On-To-Knowledge is the use of
ontologies for the various tasks of information
integration and mediation.
5What is On-To-Knowledge (OTK) project?
- European project in EU Information Society
Technologies (IST) Program EU-IST-10132 - Duration 2.5 years, January 2000 - June 2002
- Total effort cost 26 personyears, 2.5 M EUR
- Partners
- CognIT a.s
- AIdministrator
- AIFB (University of Karlsruhe)
- BT Research
- Enersearch
- Swiss Life Information Systems Research Group
6CognIT a.s
- Established in Halden, Norway in 1996.
- 20 employees - 3 with PhD
- CORPORUMTM
- Develops Technology for
- intelligent search by means of agents
- text analysis and extraction
- structuring and fusing data to build knowledge
- knowledge bases and feedback of experience
- data mining and text mining
7On-to-Knowledge workbench
- CORPORUM-OntoExtract extract ontologies from
unstructured documents and represent them in
XML/RDF/OWL - CORPORUM-OntoWrapper extract ontologies from
structured documents and represent them in
XML/RDF/OWL - RDF-DB (Sesame)
- RDF-Ferret interface between users and RQL
- OntoEdit (Ontology Editor)
- RQL engine query RDF-DB
- DAML-OIL representation language
8The OnToKnowledge system architecture
9 Introduction of CORPORUM
CORPORUM is a tool for information retrieval and
extraction developed by CognIT a.s.
- crawl the internet and intranet
- analyzing relevance and content
- maintain knowledge base (RDF-DB)
Features
- focus on the content
- searches, cataloguing, summaries and extractions
can be performed according to user interests - founded on CognlTs Mimir technology
10The overall CORPORUM architecture
11Introduction of CORPORUM
Core technology -- MIMIR includes
- Linguistic analysis through all levels and
generate user interested ontology in RDF. - Similar analysis obtain documents which are most
pertinent to a specific analyzed text.
(information retrieval and extraction)
12Classical Natural Language processing
decomposed.
13Mimir architecture
14Introduction of CORPORUM
Informaton distribution
Histogram showing where the desired content in
the document can be found and to what degree it
is pertinent.
15CORPORUM-OntoExtract
- The web-based version of a CORPORUM version
- Use same architecture as the CORPORUM
- Extract ontologies from unstructured web pages
- Represent extracted ontologies in XML/RDF/OIL
16CORPORUM-OntoExtract
- CMOntoBuild taken care of overall control of the
system and co-ordinating all information flows - CMWebHandler responisble for collecting all
(text-) documents from a specific site - CMCogLib analysis texts, extracts information,
exports a variety of formats - CMLexEn language dependent support module for
CMCoglib - CMWebInteract communication component that takes
care of all interaction of CORPORUM-OntoExtract
with the RDF database. Responsible for querying
the RDF-DB, as well as submitting final analysis
results. - DOMhandler integrated in CMWebInteract, the
OpenXML DOM handler takes care of the
interpretation of the results which are returned
from the RDF server
17CORPORUM-OntoExtract performs the following tasks
- CMOntoBuild is invoked by the user
- CMWebHandler is invoked by CMOntoBuild
- CMWebHandler retrieves the domain that is
specified from the intra/internet and returns it
to CMOntoBuild - CMOntoBuild passes texts to the CMCoglib that
analyses, interprets and extracts information
from these texts, and returns a basic RDF
representation to CMOntoBuild - CMOntoBuild now analyses the generated RDF and
queries the RDF Ontology repository to try to
find knowledge that can augment the previously
generated RDF - When all querying that could be performed is
done, and the RDF is augmented, the final RDF
ontology for a specific document is sent to the
RDF server together with a - reference to the original text.
18Client/Server based System Architecture of
CORPORUM-OntoExtract
19The overall CORPORUM architecture
20CORPORUM-OntoExtract output
- Namespace definitions
- Dublin Core based metadata
- Property definitions
- Ontology
- Facts/instances
- Cross-taxonomic relations
21Discussion on use of CORPORUM technology in
OntoExtract
Content in natural language vs. content in
structure
- CORPORUM-OntoExtracte can capture content without
considering the layout and structure of the
texts. - In some cases, the structure of texts has to be
considered. Contracts, licenses. - CORPORUM-OntoWrapper
22Discussion on use of CORPORUM technology in
OntoExtract
Diversity of web pages (unknown intention)
- Diversity of documents on the web
- It is difficult to analyze a text according to
the intention of the writers - Combination of CORPORUM-OntoExtract with
CORPORUM-OntoWrapper might some of these issues
23Discussion on use of CORPORUM technology in
OntoExtract
Representational issues (A-box vs. T-box
reasoning)
- TBox Tbox consists of (class) concept inclusion
axioms (and/or equivalence) -- e.g., "C subsumes
D. - ABox Abox consists of individual/tuple
membership axioms - e.g., "x is an instance of C"
or "ltx,ygt is an instance of R". - Most of the CORPORUM-OntoExtract generated
knowledge is TBox knowledge.
24Discussion on use of CORPORUM technology in
OntoExtract
Domain specificity of extracted knowledge
- Since the ontologies are extracted from specified
domains, the extracted information is expected to
be restricted in these domains. - Positive while many of the searches will also be
rather domain specific, and knowledge about
cross-taxonomic relations might come in very
handy. - Negative one may like to build up domain
independent knowledge bases.
25Conclusion
- CORPORUM helps web become more semantic.
- Semantic-based technology.
- Enhance usability of formal knowledge
representations for end-users - Decrease initial efforts when defining an
ontology in new domains
26Conclusion
- Dynamicity of the analysis, i.e. ease of use in
dynamic environments - Offer new ways of navigating knowledge bases and
documents sets by visualization of contents and
by means of semantic-based, graphic structures - Extract of content-based meta-data from
documents, such as important concepts, semantic
structures, etc. - Ability to offer domain-specific information as
related-keywords
27Comments
- Description is too general. No examples and
details. - Weak sentences. Complicate sentence structures.
28Questions