Title: Formal and Informal Approaches to Indexing
1Formal and Informal Approaches to Indexing
Assimilating Data Knowledge
- Presentation for NIH/BCIG
- Sept. 2002
- Gary Berg-Cross
- Knowledge Strategies/SLAG
2The Problem
- We are drowning in information but starved for
knowledge. This level of information is clearly
impossible to be handled by present means.
Uncontrolled and unorganized information is no
longer a resource in an information society,
instead it becomes the enemy." - -- John Naisbitt, author of 1982
bestseller Megatrends - Common Approaches
- Systematically formalize and standardize
OR - do what is practical even if not well
founded. -
3Outline
- Thematic Overview
- Access to nuggets
- IR, summarization Portals
- Structured Data, Warehousing Metadata
- XML Knowledge
- Concluding Thoughts
4We are flooded more is on the way
Which file has my article on Wolfram?
Whos using Topic Maps for healthcare data?
After Tim Finin (UMBC), Intelligent Information
Systems on the Web and in the Aether
5Evolving User Tasks and Technology
- Pull technology
- User requests information in an interactive
manner - 3 access tasks
- Retrieval (classical IR systems)
- Browsing (hypertext)
- Browsing and retrieval (modern digital libraries
and web systems)
- Push technology
- automatic and permanent pushing of information to
user - software agents
- example news service
- filtering (retrieval task) relevant information
for later inspection by user
6Simple Model of Hospital Information System (HIS)
Data Movement
Vertical silo" or "stovepipe" phenomenon At
present most clinical software systems are
closed with little or no interoperability
between them.
MD needed to share this information and know
what it is.
Sharing and exchange of clinical data are
currently impeded by the lack of standards for
electronic health records and the lack of
harmonization between different clinical
computing systems. Clinical data are locked in
a variety of different incompatible databases
7In the Context of such Complexity
- How can we find anything?
- How do we gather information that is distributed
over various computer systems and represented
using different formats? - If we find something, how do we know that it is
complete? - How can large amount of information be
analyzed? - How do we integrate diverse types of info?
- Can I create summaries?
- What information/derived information can we
believe trust?
8Evolving Old Devices to Aid Digital Access
- Indexes (weak links, no semantics)
- Glossaries
- Keywords e.g.
- metadata, metadata architecture, gene map,
dynamic query, standard - Thesauri
- Catalogs
- Cross-references
- Can these be integrated/assimilated?
- To traditional uses we add desire for
- Easy maintenance of links.
- End of cross-reference nightmare.
- So we can focus on semantics of linking, while
machines will do the addressing job.
9General Themes Throughout
- We need bridging work in the middle
- There is something like a continuation from data
to knowledge with lots of action in the middle - Sharing metadata
- Enterprise architectures
- Data semantic models experience as a baseline
for XML. - (Besides continued deepening of existing
approaches) Integration and reuse to test reality
as we go - We simplify organizational efforts in 2 competing
ways work approximately scruffy or formally
neat - There are scruffy approaches, models etc.
- Granular complexity, heterogeneity drives us to
scruffy pragmatics - Scruffy models may be formalized into neat models
and visa versa - Text codes as HTML or XML
- Ontologies are realized by scruffier data models,
glossaries and vocabularies
10Data to Knowledge Continuum
Text search (key words)
Data Queries
Knowledge Management
Un-Structured
Middle
Structured
Few general principles K is a kludge
inelegant shortcuts necessary
Theories should be neat elegant
Parsimonious understand exactly how theories
behave
Metadata Management data mining
Ongoing research to construct formal agent
ontologies, like KB development, is difficult
XML queries text mining
11Recent View of Index Management
- Content is a set of document units
- Files, pages, images, sections (now XML elements)
- Each document unit should be reusable outside
its creation context - Classical tools to achieve reusability
- Unique identification and network-addressability
of document units - External indexing using controlled vocabulary,
domain-expert indexers, and/or DB-based indexing
tools - Internal annotation by standard metadataand
metadata-aware technology
12This Static Integrated Index Management
- Makes strong assumption ...
- Stability of documents content and address
- Universally understood vocabulary for Index
Subjects - Integration of Metadata and External Indexation
Documents
13The Real Document World
- Is dynamic - Documents are moving targets
- Index Subjects make sense only in local
vocabulary/context
- Internal metadata and external indexing are
managed separately and are likely to be
inconsistent
Falling Behind
?
14Information Access Contrasted to Data Access
XML may allow this. Covered later.
Can I integrate these?
15Scruffy Vs. Neat Approaches Computational Models
- Neats believe that engineered programming/logic
is king, - Scruffy methods appear sloppy, succeeding by
luck, no insights about how intelligence works - Neat generally involves provable algorithms,
starts top-down, modeling higher level behavior. - IT neats' try to build systems that
process/reason in a formal way
- Scruffies favor looser, more ad-hoc methods
driven by empirical knowledge. - To a scruffy, neat methods appear to be hung
up on formalism irrelevant to the hard-to-capture
common sense' - Scruffies (Brooks etc.) favor a bottom-up
approach to produce complex behavior from the
interactions of simple rules. - Scruffies' may mix approaches to see what
happens
But we may mix a scuffy method with a formal
representation visa versa
16Formal Representation Example (Bipartite SM)
   Person Gary - ( measure ) -gt height
_at_68 _at_unitsinches ( measure ) -gt weight
_at_180 ( part ) -gt hair -gt ( has_color ) -gt
color brown ( part ) -gt body -gt ( part )
-gt stomach -gt (has_thickness)-gt
thicknessthick ( part ) -gt eye - (
shape ) -gt circular ( has_size ) -gt
sizesmall ( has_color ) -gt colorbrown
Person Instance-Of Class,
Primitive, Relation, Set, Thing
Subclass-Of Entity, Individual,
Individual-Thing, Thing .
17Logic is Easy, IM is Hard
- The knowledge needed to solve a commonsense
reasoning problem is typically much more
extensive and general than the knowledge needed
to solve difficult problems. McCarthy - the knowledge needed to solve well-formulated
problems in fields such as physics or mathematics
is bounded. - In contrast, there are no a priori limitations to
the facts that are needed to solve commonsense
problems (manage documents) the given knowledge
may be incomplete - one may have to use approximate concepts and
approximate theories - one will generally have to use nonmonotonic
reasoning to reach conclusions and one will need
some ability to reflect upon one's own reasoning
processes.
18Explicit vs. Implicit Knowledge
- Commonsense knowledge is often implicit, whereas
the knowledge needed to solve well-formulated
difficult problems is often explicit. - E.g., the knowledge needed to solve integrals is
explicitly found in a standard calculus book. - However, the knowledge needed to arrange a
meeting or talk exists in vague, implicit form. - Tacit knowledge must first be made explicit, - a
time-consuming task requiring a serious knowledge
engineering (KE) effort.
19Advances may add Scruffy or Formal Info in formal
or informal ways
Knowledge Extraction
Progress/Quality Of result
Time
For example, formal testing process is
fundamentally slow and cannot be conducted
exhaustively. Consequently, some argue that the
usual case for model testing has to be
approximate non-exhaustive testing i.e. some
subset of the possible tests are chosen and
executed.
20Does Granular Differences Limit the Scope of
Formal Approaches?
- Movement from primitive to compound/conglomerate
concepts
Complexity challenges Complete, formal
understanding
Moderately complex. ltcc1ccgt
Conglomerate 1 ltc1cngt
Conglomerate2 ltc2cgt
Glue?
Compound 1 ltb1,bngt
Compound 2 ltb2,bxgt
Compound 3 ltb13by.gt
Working Base b1,b2,b3
Ultimate Base P1,P2,P3
21My work environment moving to enterprise
approach
Text Base
XML Tools
TEXT
ANALYTICAL
PROCESSING
PROCESSING
SYSTEM
SYSTEM
CORPORATE DATA WAREHOUSE
Data
User
provider
Data
User
provider
Data
User
GLOBAL METADATA
provider
DATA ENTRY
PRINTING
INCLUDING REPOSITORIES
Data
User
provider
ELECTRONIC
DOCUMENTS
Data
User
provider
Data
User
provider
Source Systems
RAW
Tables
MULTIDIMENSIONAL
DATA
STATISTICS
MD Repository
Gneralized software
and reusable
software components
METADATA
PROCESSING
SYSTEM
22Abstracting Silos and Integration
Silo Integration
Silo Integration
Silo Integration
Research
Data/DW App
XML App
Text App
App/ Data Level
Indexes
RDF/MD
Models/ MD
Tools
MD Level
Tools
Tools
Inter-Silo Integration?
Ontologies, lexicons, standard vocabularies
General Scruffy Knowledge
23Focus of Current Work on Integration
DOD Ent Arch EI/DS CHCS .
MD Repository Registries HDD .
Portals eBPS
Arch Models ..
XML Schemas
Docs/ Web Pages
App/ Data Level
HTML MD Indexes
RDF/MD
Models/ MD
Modeling Tools
MD Level
Portal Tools
XML Tools
Common Warehouse Model, MetaIntegration Tools
Standardize Knowledge (Scruffy Limit)
Standards medical dictionaries, HL7 models
vocabularies
24Focus and Goals of Work
- Build a scalable, integrated, standardized
information system infrastructure to provide - Library and Forum Portal with document management
- Explore standards and tools to annotate
semi-structured information with concepts
obtained from DOD concept-oriented, controlled
terminology. - Help the user to index information
- Tools to help integrate data, model metadata
represented/produced by MHS logical and system
Enterprise Architectures which will serve as a
prototype for later work including - DOD HA System integration and development
- DOD VA enterprise sharing (models before data)
- Ushik repository built on 11179
- Design a metadata Repository with database system
infrastructure to stores content-dependent
metadata, concept-oriented, indexing metadata
(data annotations), and links of the metadata to
the primary data sets in Systems and Projects
Information and knowledge sharing through common
representation languages, ontologies and
protocols -combines neat and scruffy techniques
25Some Information Access Requirements Documents,
Portals Content etc.
- Text retrieval, search and processing
- Challenges Contests
- TREK text retrieval, MUC (NLP),Tipster
(summaries) - Summarization to simplify browsing create better
content indexes - Integrating of text and data
- Can I have a report please???
26Information Access AKA information retrieval
(IR)
- Goal is to
- Find a needle, help users query documents to
satisfy information needs - make things easy to find
- Existence is assumed in some formal fashion
- Copy theory of knowledge query reuse (what
about implicit content?) - Context is challenging
Categorization
27Typical Indexing Arrangement
Document Sources
Stands alone
Full text representation Most complete
representation High computational cost
Set of index terms or keywords extracted directly
form text Or specified by human subjects
(information science) Most concise
representation Poor quality of retrieval
Text Index
28The Retrieval Process After Indexing
Index is just a list, no integration or
summary. What if we formalize content a
bit? What if we allow fuzzy Concepts?
Coordinated
- Summarizer
- Creates a topic-based (thematic) summary of
document content. - Its outputs might includeÂ
- a sentence-extracted summary,Â
- keywords and phrases for each of the topics in a
document. - Do we have general principles or allow scruffy
heuristics?
29We Now Have Expanded Content Search Techniques
AltaVista automatically surfs and indexes the
web. Yahoo catalogues and organizes useful
web sites. Fulcrum provide a way to query both
full-text and RDB sources from a
single user interface Excite also tracks
queries and classifies customers. Firefly
provides builds customer profiles. Alexa
collects webpages and their usage. Google
ranks the reference importance of web
pages. Junglee integrates diverse sources
into a VDB (weak
summarization). Verity,
Infoseek , Inktomi (HotBot) . . . (after
Widerhold) Yahoo organizes indexes material but
not by general principles -seems scruffy at the
macro level. Is any of the metadata in these
efforts reusable beyond keywords?
30Automatic Text Summarization Indexed to Manage
Explosion Simplify Messy Population of Portals
- Historically relied on the extraction of key
sentences from the summarized document - Many tradeoffs
- May convey explanations missing in the original.
- But extracted sentences may contain extraneous
information, which stretches the length of the
summary and increases the chance of introducing
incoherence. - Because sentences are extracted without context,
at best they can be incoherent and at worst, they
can convey misleading information. - The summary extract also lacks balance and text
structure (Paice, 1990). We may allow human
editing. - In the last 10 years the focus has been to
develop summary generation techniques that can do
better than naive extraction. Theoretical
foundations, including cognitive models are
making this NEAT.
31Making Web Pub NEATER (Part 1)Dublin Core
Medical Core Standard
- Metadata model of indices for use with Web
content resulted from a meeting in Dublin( Ohio!) - Provides a generic model for Core components
- Title, Author, Keywords, Description, Publisher,
Resource Type, Format, Resource Identifier,
Language, etc.. - Dublin Core for Medicine Medical Core Metadata
(MCM) Added some resource types - meeting, pathology images, radiology images,
patient educational material, review, practice
guidelines, etc... - Implementation of MeSH information in the Dublin
Core Metadata - Malet G, Munoz F, Appleyard R, Hersh W. J Am Med
Inform Assoc 1999 Mar-Apr6(2)163-72 - NOTE Dr. Malet published the first "list of
medical sites" on the Internet --- survives today
as Medical Matrix (www.medmatrix.org)
32This is a Public Document Portal. DOD has its own
Documents by subject with indexes, etc. It
includes structured data and text!!!
33Example of Categorization Push for Portals
- A categorization engine is used for sorting
documents into the folders based on a taxonomy.
- People try general principles but usually wind up
with a hybrid, since this document/ knowledge
engineering is hard. - The categorization engine may do this based on
metadata in the documents, based on business
rules, based on the content of the document,
based on search criteria or filters, or some
other scheme.
34Based on Concepts Rather than Words
- Access is increasingly concept-based in an
informal way to handle Context ? Its Important - The words prices, prescription, and patent'' are
highly likely to co-occur with the medical sense
of drug'' - Abuse, paraphernalia, and illicit'' are likely
to co-occur with the illegal drug sense of drug
- Church and Liberman1991
35The Universe of Portals Info using a Scruffy
Information Directory?
Glossary of professional terms
Categorize
Spreadsheets Project Catalogs
Understanding for Collaboration
ID
Management metadata Data collection Databases
DWs Publications
Summarize
Data Stores/Marts
Data Bases
Unstructured XML Documents
Corporate
Corporate
Structured Data Reports
Region
Region
Service
Service
Markets
Markets
36Towards and intelligent Enterprise
Integrated Enterprise Data Models
clinical resource
Patient info
mktg resource
Multiple Communities
BI Data
Data Warehouse, ODS, Data Mart
Other Apps, External ContentSources
BI Apps
BI Meta data
37Lets Move to Structured Data
Grounding Of Instances
Formally Defined MD
Standards
- Useful to Consider Three Levels
- Integration examples
38Metadata Consistency Issues
- A Hierarchy of Collection Contexts for
information (earlier drug example) - What we mean depends on these contexts.
- Architecture collections
- Data Models Functional collections
- Messages Data Elements
- Clinical vocabulary domains
A problem is the disparate nature of the
metadata collection and reuse. Because there is
no coordinated MD repository, the same metadata
may be defined repeatedly by various
groups/departments. Hence as a field,
healthcare metadata are inconsistent.
39Federal Enterprise Architecture Framework (FEAF)
40Architecture Examples
High Level Operational Graphic (OV-1)
Views
Domain and Naming are not typically managed
between the conceptual products, during the
product development cycle and especially
instances of the Architecture.
41Need for Standard Terms to Support Business Model
Mapping
An activity in Model 1
Different definitions
Activities in Model 2
Too high a granularity, too informal.
42Levels Scope Info Integration Conceptual
Spaces
Each part maintains a body of information, not
easily coordinated or interoperable with other
collections.
Evolved over time w/o the benefit of
strict data standardization policies
enforcement Need to exchange info and use
it- Needs Semantic Interoperability
Coordination (MD XML tag etc.)
43Why Coordination Interoperability is Hard
- Groups found it easier to do each part themselves
- at the least the first time, now they are
legacies - Lack of Enterprise Architecture approach
- meant gaps in Architecture data models etc.
- No coding scheme is comprehensive
- Drugs, lab tests, signs and symptoms
- Lack of a common business area data model
- some standards and products are competing for
this - Structure is not coordinated with
terms/vocabulary - Duplication of MD and xml tags
- Proprietary interests
44Data Warehouse Architecturean opportunity to
standardize integration
Load all the data periodically into a
warehouse. Separate operational from decision
support DBMS.
OLAP / Decision support/ Data cubes/ data mining
User queries
Relational database (warehouse)
Data extract, transform load (ETL)
Data cleaning/ scrubbing
MD capture
Data source
Data source
Data source
45Kinds of Metadata
- There are several kinds of meta-data that people
commonly talk about. - One is structural meta-data
- schemas, interface definitions, and other
data-structure-like things, which describe how
information is put together. - E.g. database schema in a database system,
- E.g. a Web site map that describes how the Web
pages are connected to one another - Process meta-data
- Descriptive Definitional
- As seen in information retrieval.
- It's things like keyword descriptions and other
content-oriented descriptions of information.
There are different tools for each major kind,
although some integration
46Neat Attempt OMG CWM Metadata Standard
- Metadata is used for building, maintaining,
managing, and using DB collections such as data
marts warehouses. - Most data management, analysis MD driven tools
have their own infrastructure use different MD
representations - Metamodel are needed to exchange MD
- Object Management Group (OMG) developed a
standard, the Common Warehouse Metamodel (CWM),
to help manage MD - It provides a framework for representing metadata
about data sources, data targets, transformations
and analysis, and the processes and operations
that create and manage warehouse data and provide
lineage information about its use. - Note, there are new algebras to manipulate MD (P.
Bernstein of Microsoft)
47DOD HA Data Warehouse Repository
MHS Metadata Repository as Metadata Hub
48An overall MD Repository process
Had someone already written this Schema?
Outlined in the XML/EDI Group's Repository white
paper in 1999
49Common Warehouse Model (Neat! An attempt to solve
the consistency problems)
Standard definitions on metadata for all these
subjects (UML formalism).
50Example of Areas Covered
51Contact Info
52Keys Index Model
Index Instances of the Index class represent the
ordering of the instances of some other Class,
the Index is said to span the Class. Indexes
normally have an ordered set of attributes of
the Class instance they span that make up the
key of the index this set of relationships is
represented by the IndexedFeature class that
indicates how the attributes are used by the
Index instance. The Index class is intended
primarily as a starting point for tools that
require the notion of an index.
53About Extensible Markup Language(XML)
- tag-based, data format, simple SGML subset, for
structured document/web interchange - XML shares some things in common with the display
format-oriented HTML. Both formats save their
information in plain text files. - XML is focused on document structure rather than
on document formatting. - XML defines a set of tags used for representing
text as various pieces of information - an
address, a phone number, a price, etc. - XML creates an environment where text may be
communicated as information. - But language requires syntax, vocabulary, and
semantics. - Tag myopia only formally defines syntax (other
part is HARD)
54XML DTDs/Schemas take a step in the Neat direction
Schemas help by .
relating common termsbetween documents by
tag labels
lt CV gt
private
after Frank van Harmelen and Jim Hendler
55XML semi-structure
- XML allows for structure. With XML, the embedding
of one element in another declares the structure
of the data. - Simply having the Address element as a
sub-element of Patient "tells" the receiving
application that this address belongs to this
person. - ltPatientgtltAddressgt 12 N. Grove Road Potomac MD
zip20854" lt/Addressgt    ltAddressgt 41 S.
Soldier Road Arlington VA lt/Addressgtlt/Persongt - flat files don't easily allow for structure.
- A piece of information like Patient or address
marked by the presence of tags is called an
element. - Elements are further enriched by attaching
name-value pairs (for example, zip20854" in the
example above) called attributes.
56Pure XML -- Schema LanguageDocument Type
Definitions (DTDs)
- lt!ELEMENT element-type content-modelgt
- Defines content model of an element type
- Element-type is the name of the element (or tag)
- Content-model is a regular expression defining
structure of sub-elements - Data if a leaf
- lt!ATTLIST element-type attribute-name
attribute-typegt - Defines for elements named element-type
associated attributes and their types - Element-type is the name of the element (or tag)
57XML DTDs exist for things likeltPrescriptiongt
- ltMedication.NamegtAmoxicillinlt/Medication.Namegt
- ltFormgt250 mg. Capsulelt/Formgt
- ltDispensegt30lt/Dispensegt
- ltDosage Amount"1"gt1 cap(s) lt/Dosagegt
- ltInstructionsgt3 times daily until
gonelt/Instructionsgt - ltRefill Number"0"gtno refillslt/Refillgt
- ltSubstitutegtcan substitute generic
equivalentlt/Substitutegt - lt/Prescriptiongt
Drug form?
Early tags were not carefully named. Creates a
legacy problem. Coding helps make it a
Processable form, but there are no semantics.
58Can map Relational Data to XML
R
?R? ?tuple? ?A? a1 ?/A? ?B? b1 ?/B? ?C? c1
?/C? ?/tuple? ?tuple? ?A? a2 ?/A? ?B? b2
?/B? ?C? c2 ?/C? ?/tuple? ?/R?
(XML Tree)
59Tags are Names Requires Work
- Assuming you have developed a robust markup
language for data exchange, you still need to
perform the following tasks - Metadata Mapping. Like DWs you must understand
the metadata between systems that will
communicate using your new markup language. You
must map the physical storage of the application
to the elements and attributes in the markup
language. - Data Mapping. You must map the data content as
well, e.g. if the source and target systems
expect a different set of valid values for an
element, you must provide the rules for the
translation. You may also have to combine data
mapping and metadata mapping -- for example, when
a set of source data values maps to one place in
the target under one set of conditions, and to a
different place in the target under a different
set of conditions. Data Mapping can be especially
problematic when the source contains many default
values -- but the default value is not valid in
the target. - Null Mapping. If one system allows many values to
be null, but the other system cannot handle
nulls, some allowance must be made for this.
60Making XML Neater Structural Schemas
- Problems with DTDs
- no data types or specialization/extension of
types - no "higher level" modeling (classes,
relationships, constraints, etc.) - Integration schemas
- primitive data types
- integers, dates, and the like, based on our
experience with SQL, Java primitives, byte
sequences - cardinality constraints
- Inheritance.
- Making kind-of relations explicit would make both
understanding and maintenance easier - markup languages are now commonly pressed into
service as "data modeling languages" and
"conceptual modeling languages" although (to some
of us) the particular features of (SGML/XML)
markup languages render them unsuitable to the
task.
61WWW (S or N?)
- Starts with minimal formalism and proceeds a to
add scruffy complexity
Qualitative Change
Growth/ Effectiveness concept
Web pages
S
MML/RDF/SOAP.. Semantic Foundation
N
HTML/HTTP
time
Late 90s
62XML the Semantic Web Thrust
- Web Builds on simple but neat start. Now messy.
- Semantic web"coined by Tim Berners-Lee entails
adding "concept/meaning" information to Web
content - Globalize link structured collections of
information in a general auto-processable way - agent systems cooperate to facilitate resource
discovery, intelligent browsing, e- commerce,
etc. - Use a simple ontological approach structuring
relies on the eXtensible Markup Language (XML)
the Resource Description Framework (RDF) RDF
Schema. - The RDF model is like an ERA model, but is open
to interpretations -relationships are not rigid
definitions Obj attributes are not fixed in
class definitions. - Instead they are linked by a Uniform Resource
Indicator (URI). Thus anyone can make a
relationship to a topic anyone can provide such
a view of meaning via a URI.
63Semantic Web (SW) would do what?
- Concept-based search
- ? keyword-based search
- Semantic navigation ? link-based
navigation - Personalization
- ? one size fits all
- Query answering
- ? document retrieval
- Services
- ? CGI calls, but service-description languages,
negotiation, service composition, etc
After Tim Finin
64Semantic Summarization Tagging
- Metadata would greatly enhance the ability to
link Web content semantically --- or by meaning,
rather than just by keyword or Web master
arrangement - Automated semantic tagging
- Semantic information has traditionally added
manually --- VERY costly and not practical for
the way content is created today - Example -- MeSH "coding" by the NLM
- Unclear whether it can be done consistently among
various human indexers - Scruffy tagging summariztion tools may be
beneficial -- even if not perfect - Return to this in Topic Map discussion
65Three Layered SW Architecture
Logic Layer Formal Semantics and Agent Reasoning
Support DAML-OIL
RDF Schema Layer (Bickley Guha, 2000) Defines
simple Ontological Vocabulary (Class/ Sub-class)
to help ensure model/MD consistency
Data/Resource Instance Layer Uses a subject,
property, object model, statement syntax for
metadata RDF
66Resource Description Framework (RDF/RDF-Schema)
- Metadata model
- The designer can describe objects, add properties
to define and describe them, and also make
statements about the objects (statements about
relationships between resources). - The specification comes in two sections
- Basic instance model/syntax (viewed as directed,
labeled graphs) - RDF Schemas
67Resource Description Framework (RDF)
- Metadata is useful for information retrieval
(esp. if no other schema info or semantics is
available) - Idea representation independent encoding of MD
as triples (Resource, PropertyType,
Value) - (NIH, Protocolcreator, Cancer Protocol), (Cancer
Protocol, DescriptionlTitle, breast cancer), ...
DCName
Cancer Protocol
www.NIH..
DescriptionTitle
Breast Cancer
Maps into Logic
68Resource Description Format Metadata Role
- RDF is essentially an extended layer on top of
XML and uses a simple data model expressed in
XML syntax as the basis for a language for
representing properties of Web resources/collectio
ns. - Resources include images, documents and the
relationships held between them. - RDF provides interoperability between
applications that exchange information. - When XML data is in RDF format, applications can
understand the data without knowing who sent it. - XML points to a resource to scope and uniquely
identify a set of properties known as the schema.
69Example of an RDF Model
Sanctioned-by
Www.HHS/HC-gp
RDF subject
RDF Subject
RDF Predicate
RDF Object
70The Challenge of a Semantic Web Semantic web
languages today
- Limited semantics
- Besides RDF there is
- DAML Darpa Agent Markup Languagehttp//www.daml
.org/ (OIL) - with another under development by the W3C
- OWL Ontology Web Languagehttp//www.w3.org/2001
/sw/ - Reasoning limited to inheritance
71Topic Maps (TM) Approach
- One of my favorite examples of a light
formalization reification attempt -still
underway - TMs are a collection of topics (semantically
meaningful) their relationships with a
standardized notation for interchangeably
representing information about the structure of
information resources used to define topics - Topic Maps link these topics with external
references objects), such as resources behind
URLs - XTM - XML-based interchange format for topic maps
- look like a semantic net
Essentially we have a weak semantic model of
indexed document topics
72Recapitulation Why Coordination
Interoperability is Hard
- As before groups find it easier to do each part
themselves - at the least the first time, now they are
legacies - Lack of integrating (Enterprise Architecture)
approaches now mean gaps between RDF etc. - No vocabulary scheme is yet comprehensive
- Semantics of Drugs, lab tests, signs and symptoms
- Lack of a common metadata models
- some standards and products are competing for
this - Structure is not coordinated with
terms/vocabulary - Duplication of MD and now rdf language
- Proprietary interests
73Many Efforts Need Ontologies for Assimilation
- Essence of the interoperability problem is
semantics (e.g. as in a shared conceptual model
of a particular application domain idea
knowledge base), not syntax - Ontologies provide a vocabulary for representing
knowledge about a domain and for describing
specific situations in a domain (tool for
defining and describing domain-specific
vocabularies) --- idea language for
communication - For data/knowledge translation and transformation
(provide a solution to the translation problem
between different terminologies) for fusion and
refinement of existing knowledge --- idea
interoperation - As reusable building blocks to build systems that
solve particular problems in the application
domain --- idea model reuse
74Some Appropriate Middle Modeling Steps
- Recognize the interplay between S N and their
evolution - Push to capture essential semantic relationships
e.g. express, constrain, validate the
relations - Push to free effort from artificial syntax
requirements locked into transfer interchange/
import/export (make it concept based) - Support principles of semantic transparency as
a/the preeminent concern for scruffy needs - Make models accessible to and tuned for use by
the principal domain experts and 'end users' as
stakeholders - Work needs to become sufficiently formal to
support testing for conceptual integrity
75Mapping Account Glossary Content
Conceptual Graph Model (Mapable to RDF (Corby et
al, 2000))
Account A customer, usually an institution or
another organization, that purchases a companys
products or services.
AccountCustomer - ( prototype ) -gt
institution,organization State Customer
-gt(Agent) -gt purchase -gt product/service (
poss) -gt institution
Uses a Class Hierarchy such as Entity Legal
Entity Organization Customer Organization
.
Human action Purchase
76On the Other hand.
- Formal Knowledge Engineering hasnt been a clear
victory - KE may need to be a permanent task
- Continuous Knowledge Engineering is an
alternative approach to KE that embraces the
philosophy that knowledge systems are open-ended,
dynamic artifacts that develop through a learning
process in reaction to their environment. - Implicit knowledge must first be made explicit,
which is a time-consuming task requiring a
serious knowledge engineering effort.
77Opportunities
- The vast amounts of information with little or no
structure published over the World- Wide-Web
raise a host of new, challenging problems for
data-mining research examples include web
resource discovery and topic distillation web
structure/ linkage mining intelligent web
searching and crawling personalization of web
content. - Knowledge Discovery in Biological Data Management
Systems and Bioinformatics High-performance data
mining tools will play a crucial role in the
analysis of the ever-growing databases of genetic
sequences accumulated over the course of large
bioinformatics efforts (e.g., the Human Genome
Project).
78Discussion
- Scruffy vs. Neat Reasoning
- Knowledge Soup The Chaos and Complexity of the
Human Mind - John F. Sowa http//residentassociates.org/com/So
up2.htm - Neat vs Scruffy A review of Computational Models
for Spatial Expressions. Amitabha Mukerjee
Center for Robotics, http//www.cs.albany.edu/ami
t/review.html - J. McCarthy, From Here to Human-Level
Intelligence, Proceedings of the Fifth
International Conference on Principles of
Knowledge Representation and Reasoning (KR'96),
Cambridge, MA, November 1996, Morgan Kaufmann,
San Mateo, CA (1996), pp. 640646. - Info Access References
- C. Paice, "Constructing literature abstracts by
computer Techniques and prospects," Information
Processing and Management, vol. 26, pp. 171-186,
1990. - SUMMARIST Automated Text Summarization
http//www.isi.edu/cyl/summarist/summarist.html - Church K. and M. Liberman (1991) "A Status Report
on the ACL/DCI". In Proc. of the 7th Annual
Conference of the UW Centre for the New OED and
Text Research Using Corpora, pp. 84-91.
79Data Models Metadata references
- Phil Bernstien "Representing and Reasoning About
Mappings between Domain Models,"Â 18th National
Conference on Artificial Intelligence (AAAI
2002), Edmonton, Canada http//www.cs.washington
.edu/homes/jayant/Pubs/SemanticsAAAI02.pdf - MetaIntegration tools http//www.metaintegration.n
et/ - Implementation of MeSH information in the Dublin
Core Metadata Malet G, Munoz F, Appleyard R,
Hersh W. J Am Med Inform Assoc 1999
Mar-Apr6(2)163-72 - Ontologies Knowledge Models
- Towards continuous knowledge engineering by Klaas
Schilstra thesis Delft University of Technology
http//www.kbs.twi.tudelft.nl/Publications/PhD/200
2-Schilstra-PhD.html - Towards Situated Knowledge Acquisition, Tim
Menzies, http//www.phil.canterbury.ac.nz/tom_best
or/etexts/Menzies2020Towards20Situated20Knowle
dge20Acquisition.htm - Formal Ontology, Conceptual Modelling, and
Knowledge Engineering, http//www.ladseb.pd.cnr.it
/infor/ontology/Papers/OntologyPapers.html - DAML Homepage, http//www.daml.org/
80XML Registries
- W3C http//www.wc3.org http//www.wc3.com
- US Federal CIO Council http//xml.coverpages.org/C
IO-Council-XML-DevelopersGuidenceVersion1.pdf - ASC X12 Reference Model
http//www.x12.org/x12org/comments/X12Reference_Mo
del_For_XML_Design.pdf - Metadata registries USHIK etc.
http//www.bls.gov/ore/pdf/st000010.pdf
81RDF Metadata Registries
- The Open Metadata Registry Prototypes - Dublin
Core Metadata Initiative http//wip.dublincore.org
8080/registry/Registry - SCHEMAS Project Registry http//www.schemas-forum.
org/registry/ - DESIRE Registry http//www.ukoln.ac.ukÂ
- SWAG- WebNS Registry http//webns.net/
- Xmlns.com Registry http//xmlns.com/
- ULIS Open Metadata Registry http//avalon.ulis.ac
.jp/registry/
82Semantic Web Topic maps
- W3C Semantic Web Site http//www.w3.org/2001/sw/
- SemanticWeb.orghttp//www.semanticweb.org/
- The emerging semantic web Selected papers from
the first Semantic web working symposium Edited
by Isabel Cruz, Stefan Decker, Jérôme Euzenat,
and Deborah McGuinnessVolume 75 in the Frontiers
in artificial intelligence and applications
seriesIOS press, Amsterdam (NL), 2002 300pp.,
hardcover, ISBN 1 58603 255 0 (IOS press) - Markup Languages Comparison and Examples
http//trellis.semanticweb.org/expect/web/semantic
web/comparison.html - Work in progress on Topic Maps http//www.topicmap
s.net/