Formal and Informal Approaches to Indexing presentation

About This Presentation

Transcript and Presenter's Notes

Title: Formal and Informal Approaches to Indexing

1
Formal and Informal Approaches to Indexing
Assimilating Data Knowledge

Presentation for NIH/BCIG
Sept. 2002
Gary Berg-Cross
Knowledge Strategies/SLAG

2
The Problem

We are drowning in information but starved for
knowledge. This level of information is clearly
impossible to be handled by present means.
Uncontrolled and unorganized information is no
longer a resource in an information society,
instead it becomes the enemy."
-- John Naisbitt, author of 1982
bestseller Megatrends
Common Approaches
Systematically formalize and standardize
OR
do what is practical even if not well
founded.

3
Outline

Thematic Overview
Access to nuggets
IR, summarization Portals
Structured Data, Warehousing Metadata
XML Knowledge
Concluding Thoughts

4
We are flooded more is on the way
Which file has my article on Wolfram?
Whos using Topic Maps for healthcare data?
After Tim Finin (UMBC), Intelligent Information
Systems on the Web and in the Aether
5
Evolving User Tasks and Technology

Pull technology
User requests information in an interactive
manner
3 access tasks
Retrieval (classical IR systems)
Browsing (hypertext)
Browsing and retrieval (modern digital libraries
and web systems)

Push technology
automatic and permanent pushing of information to
user
software agents
example news service
filtering (retrieval task) relevant information
for later inspection by user

6
Simple Model of Hospital Information System (HIS)
Data Movement
Vertical silo" or "stovepipe" phenomenon At
present most clinical software systems are
closed with little or no interoperability
between them.
MD needed to share this information and know
what it is.
Sharing and exchange of clinical data are
currently impeded by the lack of standards for
electronic health records and the lack of
harmonization between different clinical
computing systems. Clinical data are locked in
a variety of different incompatible databases
7
In the Context of such Complexity

How can we find anything?
How do we gather information that is distributed
over various computer systems and represented
using different formats?
If we find something, how do we know that it is
complete?
How can large amount of information be
analyzed?
How do we integrate diverse types of info?
Can I create summaries?
What information/derived information can we
believe trust?

8
Evolving Old Devices to Aid Digital Access

Indexes (weak links, no semantics)
Glossaries
Keywords e.g.
metadata, metadata architecture, gene map,
dynamic query, standard
Thesauri
Catalogs
Cross-references

Can these be integrated/assimilated?
To traditional uses we add desire for
Easy maintenance of links.
End of cross-reference nightmare.
So we can focus on semantics of linking, while
machines will do the addressing job.

9
General Themes Throughout

We need bridging work in the middle
There is something like a continuation from data
to knowledge with lots of action in the middle
Sharing metadata
Enterprise architectures
Data semantic models experience as a baseline
for XML.
(Besides continued deepening of existing
approaches) Integration and reuse to test reality
as we go
We simplify organizational efforts in 2 competing
ways work approximately scruffy or formally
neat
There are scruffy approaches, models etc.
Granular complexity, heterogeneity drives us to
scruffy pragmatics
Scruffy models may be formalized into neat models
and visa versa
Text codes as HTML or XML
Ontologies are realized by scruffier data models,
glossaries and vocabularies

10
Data to Knowledge Continuum
Text search (key words)
Data Queries
Knowledge Management
Un-Structured
Middle
Structured
Few general principles K is a kludge
inelegant shortcuts necessary
Theories should be neat elegant
Parsimonious understand exactly how theories
behave
Metadata Management data mining
Ongoing research to construct formal agent
ontologies, like KB development, is difficult
XML queries text mining
11
Recent View of Index Management

Content is a set of document units
Files, pages, images, sections (now XML elements)
Each document unit should be reusable outside
its creation context
Classical tools to achieve reusability
Unique identification and network-addressability
of document units
External indexing using controlled vocabulary,
domain-expert indexers, and/or DB-based indexing
tools
Internal annotation by standard metadataand
metadata-aware technology

12
This Static Integrated Index Management

Makes strong assumption ...
Stability of documents content and address
Universally understood vocabulary for Index
Subjects
Integration of Metadata and External Indexation

Documents
13
The Real Document World

Is dynamic - Documents are moving targets

Index Subjects make sense only in local
vocabulary/context

Internal metadata and external indexing are
managed separately and are likely to be
inconsistent

Falling Behind
?
14
Information Access Contrasted to Data Access
XML may allow this. Covered later.
Can I integrate these?
15
Scruffy Vs. Neat Approaches Computational Models

Neats believe that engineered programming/logic
is king,
Scruffy methods appear sloppy, succeeding by
luck, no insights about how intelligence works
Neat generally involves provable algorithms,
starts top-down, modeling higher level behavior.
IT neats' try to build systems that
process/reason in a formal way

Scruffies favor looser, more ad-hoc methods
driven by empirical knowledge.
To a scruffy, neat methods appear to be hung
up on formalism irrelevant to the hard-to-capture
common sense'
Scruffies (Brooks etc.) favor a bottom-up
approach to produce complex behavior from the
interactions of simple rules.
Scruffies' may mix approaches to see what
happens

But we may mix a scuffy method with a formal
representation visa versa
16
Formal Representation Example (Bipartite SM)
Person Gary - ( measure ) -gt height
_at_68 _at_unitsinches ( measure ) -gt weight
_at_180 ( part ) -gt hair -gt ( has_color ) -gt
color brown ( part ) -gt body -gt ( part )
-gt stomach -gt (has_thickness)-gt
thicknessthick ( part ) -gt eye - (
shape ) -gt circular ( has_size ) -gt
sizesmall ( has_color ) -gt colorbrown
Person Instance-Of Class,
Primitive, Relation, Set, Thing
Subclass-Of Entity, Individual,
Individual-Thing, Thing .
17
Logic is Easy, IM is Hard

The knowledge needed to solve a commonsense
reasoning problem is typically much more
extensive and general than the knowledge needed
to solve difficult problems. McCarthy
the knowledge needed to solve well-formulated
problems in fields such as physics or mathematics
is bounded.
In contrast, there are no a priori limitations to
the facts that are needed to solve commonsense
problems (manage documents) the given knowledge
may be incomplete
one may have to use approximate concepts and
approximate theories
one will generally have to use nonmonotonic
reasoning to reach conclusions and one will need
some ability to reflect upon one's own reasoning
processes.

18
Explicit vs. Implicit Knowledge

Commonsense knowledge is often implicit, whereas
the knowledge needed to solve well-formulated
difficult problems is often explicit.
E.g., the knowledge needed to solve integrals is
explicitly found in a standard calculus book.
However, the knowledge needed to arrange a
meeting or talk exists in vague, implicit form.
Tacit knowledge must first be made explicit, - a
time-consuming task requiring a serious knowledge
engineering (KE) effort.

19
Advances may add Scruffy or Formal Info in formal
or informal ways
Knowledge Extraction
Progress/Quality Of result
Time
For example, formal testing process is
fundamentally slow and cannot be conducted
exhaustively. Consequently, some argue that the
usual case for model testing has to be
approximate non-exhaustive testing i.e. some
subset of the possible tests are chosen and
executed.
20
Does Granular Differences Limit the Scope of
Formal Approaches?

Movement from primitive to compound/conglomerate
concepts

Complexity challenges Complete, formal
understanding
Moderately complex. ltcc1ccgt
Conglomerate 1 ltc1cngt
Conglomerate2 ltc2cgt
Glue?
Compound 1 ltb1,bngt
Compound 2 ltb2,bxgt
Compound 3 ltb13by.gt
Working Base b1,b2,b3
Ultimate Base P1,P2,P3
21
My work environment moving to enterprise
approach
Text Base
XML Tools
TEXT
ANALYTICAL
PROCESSING
PROCESSING
SYSTEM
SYSTEM
CORPORATE DATA WAREHOUSE
Data
User
provider
Data
User
provider
Data
User
GLOBAL METADATA
provider
DATA ENTRY
PRINTING
INCLUDING REPOSITORIES
Data
User
provider
ELECTRONIC
DOCUMENTS
Data
User
provider
Data
User
provider
Source Systems
RAW
Tables
MULTIDIMENSIONAL
DATA
STATISTICS
MD Repository
Gneralized software
and reusable
software components
METADATA
PROCESSING
SYSTEM
22
Abstracting Silos and Integration
Silo Integration
Silo Integration
Silo Integration
Research
Data/DW App
XML App
Text App
App/ Data Level
Indexes
RDF/MD
Models/ MD
Tools
MD Level
Tools
Tools
Inter-Silo Integration?
Ontologies, lexicons, standard vocabularies
General Scruffy Knowledge
23
Focus of Current Work on Integration
DOD Ent Arch EI/DS CHCS .
MD Repository Registries HDD .
Portals eBPS
Arch Models ..
XML Schemas
Docs/ Web Pages
App/ Data Level
HTML MD Indexes
RDF/MD
Models/ MD
Modeling Tools
MD Level
Portal Tools
XML Tools
Common Warehouse Model, MetaIntegration Tools
Standardize Knowledge (Scruffy Limit)
Standards medical dictionaries, HL7 models
vocabularies
24
Focus and Goals of Work

Build a scalable, integrated, standardized
information system infrastructure to provide
Library and Forum Portal with document management
Explore standards and tools to annotate
semi-structured information with concepts
obtained from DOD concept-oriented, controlled
terminology.
Help the user to index information
Tools to help integrate data, model metadata
represented/produced by MHS logical and system
Enterprise Architectures which will serve as a
prototype for later work including
DOD HA System integration and development
DOD VA enterprise sharing (models before data)
Ushik repository built on 11179
Design a metadata Repository with database system
infrastructure to stores content-dependent
metadata, concept-oriented, indexing metadata
(data annotations), and links of the metadata to
the primary data sets in Systems and Projects

Information and knowledge sharing through common
representation languages, ontologies and
protocols -combines neat and scruffy techniques
25
Some Information Access Requirements Documents,
Portals Content etc.

Text retrieval, search and processing
Challenges Contests
TREK text retrieval, MUC (NLP),Tipster
(summaries)
Summarization to simplify browsing create better
content indexes
Integrating of text and data
Can I have a report please???

26
Information Access AKA information retrieval
(IR)

Goal is to
Find a needle, help users query documents to
satisfy information needs
make things easy to find
Existence is assumed in some formal fashion
Copy theory of knowledge query reuse (what
about implicit content?)
Context is challenging

Categorization
27
Typical Indexing Arrangement
Document Sources
Stands alone
Full text representation Most complete
representation High computational cost
Set of index terms or keywords extracted directly
form text Or specified by human subjects
(information science) Most concise
representation Poor quality of retrieval
Text Index
28
The Retrieval Process After Indexing
Index is just a list, no integration or
summary. What if we formalize content a
bit? What if we allow fuzzy Concepts?
Coordinated

Summarizer
Creates a topic-based (thematic) summary of
document content.
Its outputs might include
a sentence-extracted summary,
keywords and phrases for each of the topics in a
document.
Do we have general principles or allow scruffy
heuristics?

29
We Now Have Expanded Content Search Techniques
AltaVista automatically surfs and indexes the
web. Yahoo catalogues and organizes useful
web sites. Fulcrum provide a way to query both
full-text and RDB sources from a
single user interface Excite also tracks
queries and classifies customers. Firefly
provides builds customer profiles. Alexa
collects webpages and their usage. Google
ranks the reference importance of web
pages. Junglee integrates diverse sources
into a VDB (weak
summarization). Verity,
Infoseek , Inktomi (HotBot) . . . (after
Widerhold) Yahoo organizes indexes material but
not by general principles -seems scruffy at the
macro level. Is any of the metadata in these
efforts reusable beyond keywords?
30
Automatic Text Summarization Indexed to Manage
Explosion Simplify Messy Population of Portals

Historically relied on the extraction of key
sentences from the summarized document
Many tradeoffs
May convey explanations missing in the original.
But extracted sentences may contain extraneous
information, which stretches the length of the
summary and increases the chance of introducing
incoherence.
Because sentences are extracted without context,
at best they can be incoherent and at worst, they
can convey misleading information.
The summary extract also lacks balance and text
structure (Paice, 1990). We may allow human
editing.
In the last 10 years the focus has been to
develop summary generation techniques that can do
better than naive extraction. Theoretical
foundations, including cognitive models are
making this NEAT.

31
Making Web Pub NEATER (Part 1)Dublin Core
Medical Core Standard

Metadata model of indices for use with Web
content resulted from a meeting in Dublin( Ohio!)
Provides a generic model for Core components
Title, Author, Keywords, Description, Publisher,
Resource Type, Format, Resource Identifier,
Language, etc..
Dublin Core for Medicine Medical Core Metadata
(MCM) Added some resource types
meeting, pathology images, radiology images,
patient educational material, review, practice
guidelines, etc...
Implementation of MeSH information in the Dublin
Core Metadata
Malet G, Munoz F, Appleyard R, Hersh W. J Am Med
Inform Assoc 1999 Mar-Apr6(2)163-72
NOTE Dr. Malet published the first "list of
medical sites" on the Internet --- survives today
as Medical Matrix (www.medmatrix.org)

32
This is a Public Document Portal. DOD has its own
Documents by subject with indexes, etc. It
includes structured data and text!!!
33
Example of Categorization Push for Portals

A categorization engine is used for sorting
documents into the folders based on a taxonomy.
People try general principles but usually wind up
with a hybrid, since this document/ knowledge
engineering is hard.
The categorization engine may do this based on
metadata in the documents, based on business
rules, based on the content of the document,
based on search criteria or filters, or some
other scheme.

34
Based on Concepts Rather than Words

Access is increasingly concept-based in an
informal way to handle Context ? Its Important
The words prices, prescription, and patent'' are
highly likely to co-occur with the medical sense
of drug''
Abuse, paraphernalia, and illicit'' are likely
to co-occur with the illegal drug sense of drug
Church and Liberman1991

35
The Universe of Portals Info using a Scruffy
Information Directory?
Glossary of professional terms
Categorize
Spreadsheets Project Catalogs
Understanding for Collaboration
ID
Management metadata Data collection Databases
DWs Publications
Summarize
Data Stores/Marts
Data Bases
Unstructured XML Documents
Corporate
Corporate
Structured Data Reports
Region
Region
Service
Service
Markets
Markets
36
Towards and intelligent Enterprise
Integrated Enterprise Data Models
clinical resource
Patient info
mktg resource
Multiple Communities
BI Data
Data Warehouse, ODS, Data Mart
Other Apps, External ContentSources
BI Apps
BI Meta data
37
Lets Move to Structured Data
Grounding Of Instances
Formally Defined MD
Standards

Useful to Consider Three Levels
Integration examples

38
Metadata Consistency Issues

A Hierarchy of Collection Contexts for
information (earlier drug example)
What we mean depends on these contexts.
Architecture collections
Data Models Functional collections
Messages Data Elements
Clinical vocabulary domains

A problem is the disparate nature of the
metadata collection and reuse. Because there is
no coordinated MD repository, the same metadata
may be defined repeatedly by various
groups/departments. Hence as a field,
healthcare metadata are inconsistent.
39
Federal Enterprise Architecture Framework (FEAF)
40
Architecture Examples
High Level Operational Graphic (OV-1)
Views
Domain and Naming are not typically managed
between the conceptual products, during the
product development cycle and especially
instances of the Architecture.
41
Need for Standard Terms to Support Business Model
Mapping
An activity in Model 1
Different definitions
Activities in Model 2
Too high a granularity, too informal.
42
Levels Scope Info Integration Conceptual
Spaces
Each part maintains a body of information, not
easily coordinated or interoperable with other
collections.
Evolved over time w/o the benefit of
strict data standardization policies
enforcement Need to exchange info and use
it- Needs Semantic Interoperability
Coordination (MD XML tag etc.)
43
Why Coordination Interoperability is Hard

Groups found it easier to do each part themselves
at the least the first time, now they are
legacies
Lack of Enterprise Architecture approach
meant gaps in Architecture data models etc.
No coding scheme is comprehensive
Drugs, lab tests, signs and symptoms
Lack of a common business area data model
some standards and products are competing for
this
Structure is not coordinated with
terms/vocabulary
Duplication of MD and xml tags
Proprietary interests

44
Data Warehouse Architecturean opportunity to
standardize integration
Load all the data periodically into a
warehouse. Separate operational from decision
support DBMS.
OLAP / Decision support/ Data cubes/ data mining
User queries
Relational database (warehouse)
Data extract, transform load (ETL)
Data cleaning/ scrubbing
MD capture
Data source
Data source
Data source
45
Kinds of Metadata

There are several kinds of meta-data that people
commonly talk about.
One is structural meta-data
schemas, interface definitions, and other
data-structure-like things, which describe how
information is put together.
E.g. database schema in a database system,
E.g. a Web site map that describes how the Web
pages are connected to one another
Process meta-data
Descriptive Definitional
As seen in information retrieval.
It's things like keyword descriptions and other
content-oriented descriptions of information.

There are different tools for each major kind,
although some integration
46
Neat Attempt OMG CWM Metadata Standard

Metadata is used for building, maintaining,
managing, and using DB collections such as data
marts warehouses.
Most data management, analysis MD driven tools
have their own infrastructure use different MD
representations
Metamodel are needed to exchange MD
Object Management Group (OMG) developed a
standard, the Common Warehouse Metamodel (CWM),
to help manage MD
It provides a framework for representing metadata
about data sources, data targets, transformations
and analysis, and the processes and operations
that create and manage warehouse data and provide
lineage information about its use.
Note, there are new algebras to manipulate MD (P.
Bernstein of Microsoft)

47
DOD HA Data Warehouse Repository
MHS Metadata Repository as Metadata Hub
48
An overall MD Repository process
Had someone already written this Schema?
Outlined in the XML/EDI Group's Repository white
paper in 1999
49
Common Warehouse Model (Neat! An attempt to solve
the consistency problems)
Standard definitions on metadata for all these
subjects (UML formalism).
50
Example of Areas Covered
51
Contact Info
52
Keys Index Model
Index Instances of the Index class represent the
ordering of the instances of some other Class,
the Index is said to span the Class. Indexes
normally have an ordered set of attributes of
the Class instance they span that make up the
key of the index this set of relationships is
represented by the IndexedFeature class that
indicates how the attributes are used by the
Index instance. The Index class is intended
primarily as a starting point for tools that
require the notion of an index.
53
About Extensible Markup Language(XML)

tag-based, data format, simple SGML subset, for
structured document/web interchange
XML shares some things in common with the display
format-oriented HTML. Both formats save their
information in plain text files.
XML is focused on document structure rather than
on document formatting.
XML defines a set of tags used for representing
text as various pieces of information - an
address, a phone number, a price, etc.
XML creates an environment where text may be
communicated as information.
But language requires syntax, vocabulary, and
semantics.
Tag myopia only formally defines syntax (other
part is HARD)

54
XML DTDs/Schemas take a step in the Neat direction
Schemas help by .
relating common termsbetween documents by
tag labels
lt CV gt
private
after Frank van Harmelen and Jim Hendler
55
XML semi-structure

XML allows for structure. With XML, the embedding
of one element in another declares the structure
of the data.
Simply having the Address element as a
sub-element of Patient "tells" the receiving
application that this address belongs to this
person.
ltPatientgtltAddressgt 12 N. Grove Road Potomac MD
zip20854" lt/Addressgt ltAddressgt 41 S.
Soldier Road Arlington VA lt/Addressgtlt/Persongt
flat files don't easily allow for structure.
A piece of information like Patient or address
marked by the presence of tags is called an
element.
Elements are further enriched by attaching
name-value pairs (for example, zip20854" in the
example above) called attributes.

56
Pure XML -- Schema LanguageDocument Type
Definitions (DTDs)

lt!ELEMENT element-type content-modelgt
Defines content model of an element type
Element-type is the name of the element (or tag)
Content-model is a regular expression defining
structure of sub-elements
Data if a leaf
lt!ATTLIST element-type attribute-name
attribute-typegt
Defines for elements named element-type
associated attributes and their types
Element-type is the name of the element (or tag)

57
XML DTDs exist for things likeltPrescriptiongt

ltMedication.NamegtAmoxicillinlt/Medication.Namegt
ltFormgt250 mg. Capsulelt/Formgt
ltDispensegt30lt/Dispensegt
ltDosage Amount"1"gt1 cap(s) lt/Dosagegt
ltInstructionsgt3 times daily until
gonelt/Instructionsgt
ltRefill Number"0"gtno refillslt/Refillgt
ltSubstitutegtcan substitute generic
equivalentlt/Substitutegt
lt/Prescriptiongt

Drug form?
Early tags were not carefully named. Creates a
legacy problem. Coding helps make it a
Processable form, but there are no semantics.
58
Can map Relational Data to XML
R
?R? ?tuple? ?A? a1 ?/A? ?B? b1 ?/B? ?C? c1
?/C? ?/tuple? ?tuple? ?A? a2 ?/A? ?B? b2
?/B? ?C? c2 ?/C? ?/tuple? ?/R?
(XML Tree)
59
Tags are Names Requires Work

Assuming you have developed a robust markup
language for data exchange, you still need to
perform the following tasks
Metadata Mapping. Like DWs you must understand
the metadata between systems that will
communicate using your new markup language. You
must map the physical storage of the application
to the elements and attributes in the markup
language.
Data Mapping. You must map the data content as
well, e.g. if the source and target systems
expect a different set of valid values for an
element, you must provide the rules for the
translation. You may also have to combine data
mapping and metadata mapping -- for example, when
a set of source data values maps to one place in
the target under one set of conditions, and to a
different place in the target under a different
set of conditions. Data Mapping can be especially
problematic when the source contains many default
values -- but the default value is not valid in
the target.
Null Mapping. If one system allows many values to
be null, but the other system cannot handle
nulls, some allowance must be made for this.

60
Making XML Neater Structural Schemas

Problems with DTDs
no data types or specialization/extension of
types
no "higher level" modeling (classes,
relationships, constraints, etc.)
Integration schemas
primitive data types
integers, dates, and the like, based on our
experience with SQL, Java primitives, byte
sequences
cardinality constraints
Inheritance.
Making kind-of relations explicit would make both
understanding and maintenance easier
markup languages are now commonly pressed into
service as "data modeling languages" and
"conceptual modeling languages" although (to some
of us) the particular features of (SGML/XML)
markup languages render them unsuitable to the
task.

61
WWW (S or N?)

Starts with minimal formalism and proceeds a to
add scruffy complexity

Qualitative Change
Growth/ Effectiveness concept
Web pages
S
MML/RDF/SOAP.. Semantic Foundation
N
HTML/HTTP
time
Late 90s
62
XML the Semantic Web Thrust

Web Builds on simple but neat start. Now messy.
Semantic web"coined by Tim Berners-Lee entails
adding "concept/meaning" information to Web
content
Globalize link structured collections of
information in a general auto-processable way
agent systems cooperate to facilitate resource
discovery, intelligent browsing, e- commerce,
etc.
Use a simple ontological approach structuring
relies on the eXtensible Markup Language (XML)
the Resource Description Framework (RDF) RDF
Schema.
The RDF model is like an ERA model, but is open
to interpretations -relationships are not rigid
definitions Obj attributes are not fixed in
class definitions.
Instead they are linked by a Uniform Resource
Indicator (URI). Thus anyone can make a
relationship to a topic anyone can provide such
a view of meaning via a URI.

63
Semantic Web (SW) would do what?

Concept-based search
? keyword-based search
Semantic navigation ? link-based
navigation
Personalization
? one size fits all
Query answering
? document retrieval
Services
? CGI calls, but service-description languages,
negotiation, service composition, etc

After Tim Finin
64
Semantic Summarization Tagging

Metadata would greatly enhance the ability to
link Web content semantically --- or by meaning,
rather than just by keyword or Web master
arrangement
Automated semantic tagging
Semantic information has traditionally added
manually --- VERY costly and not practical for
the way content is created today
Example -- MeSH "coding" by the NLM
Unclear whether it can be done consistently among
various human indexers
Scruffy tagging summariztion tools may be
beneficial -- even if not perfect
Return to this in Topic Map discussion

65
Three Layered SW Architecture
Logic Layer Formal Semantics and Agent Reasoning
Support DAML-OIL
RDF Schema Layer (Bickley Guha, 2000) Defines
simple Ontological Vocabulary (Class/ Sub-class)
to help ensure model/MD consistency
Data/Resource Instance Layer Uses a subject,
property, object model, statement syntax for
metadata RDF
66
Resource Description Framework (RDF/RDF-Schema)

Metadata model
The designer can describe objects, add properties
to define and describe them, and also make
statements about the objects (statements about
relationships between resources).
The specification comes in two sections
Basic instance model/syntax (viewed as directed,
labeled graphs)
RDF Schemas

67
Resource Description Framework (RDF)

Metadata is useful for information retrieval
(esp. if no other schema info or semantics is
available)
Idea representation independent encoding of MD
as triples (Resource, PropertyType,
Value)
(NIH, Protocolcreator, Cancer Protocol), (Cancer
Protocol, DescriptionlTitle, breast cancer), ...

DCName
Cancer Protocol
www.NIH..
DescriptionTitle
Breast Cancer
Maps into Logic
68
Resource Description Format Metadata Role

RDF is essentially an extended layer on top of
XML and uses a simple data model expressed in
XML syntax as the basis for a language for
representing properties of Web resources/collectio
ns.
Resources include images, documents and the
relationships held between them.
RDF provides interoperability between
applications that exchange information.
When XML data is in RDF format, applications can
understand the data without knowing who sent it.
XML points to a resource to scope and uniquely
identify a set of properties known as the schema.

69
Example of an RDF Model

Sanctioned-by
Www.HHS/HC-gp
RDF subject
RDF Subject
RDF Predicate
RDF Object
70
The Challenge of a Semantic Web Semantic web
languages today

Limited semantics
Besides RDF there is
DAML Darpa Agent Markup Languagehttp//www.daml
.org/ (OIL)
with another under development by the W3C
OWL Ontology Web Languagehttp//www.w3.org/2001
/sw/
Reasoning limited to inheritance

71
Topic Maps (TM) Approach

One of my favorite examples of a light
formalization reification attempt -still
underway
TMs are a collection of topics (semantically
meaningful) their relationships with a
standardized notation for interchangeably
representing information about the structure of
information resources used to define topics
Topic Maps link these topics with external
references objects), such as resources behind
URLs
XTM - XML-based interchange format for topic maps
look like a semantic net

Essentially we have a weak semantic model of
indexed document topics
72
Recapitulation Why Coordination
Interoperability is Hard

As before groups find it easier to do each part
themselves
at the least the first time, now they are
legacies
Lack of integrating (Enterprise Architecture)
approaches now mean gaps between RDF etc.
No vocabulary scheme is yet comprehensive
Semantics of Drugs, lab tests, signs and symptoms
Lack of a common metadata models
some standards and products are competing for
this
Structure is not coordinated with
terms/vocabulary
Duplication of MD and now rdf language
Proprietary interests

73
Many Efforts Need Ontologies for Assimilation

Essence of the interoperability problem is
semantics (e.g. as in a shared conceptual model
of a particular application domain idea
knowledge base), not syntax
Ontologies provide a vocabulary for representing
knowledge about a domain and for describing
specific situations in a domain (tool for
defining and describing domain-specific
vocabularies) --- idea language for
communication
For data/knowledge translation and transformation
(provide a solution to the translation problem
between different terminologies) for fusion and
refinement of existing knowledge --- idea
interoperation
As reusable building blocks to build systems that
solve particular problems in the application
domain --- idea model reuse

74
Some Appropriate Middle Modeling Steps

Recognize the interplay between S N and their
evolution
Push to capture essential semantic relationships
e.g. express, constrain, validate the
relations
Push to free effort from artificial syntax
requirements locked into transfer interchange/
import/export (make it concept based)
Support principles of semantic transparency as
a/the preeminent concern for scruffy needs
Make models accessible to and tuned for use by
the principal domain experts and 'end users' as
stakeholders
Work needs to become sufficiently formal to
support testing for conceptual integrity

75
Mapping Account Glossary Content
Conceptual Graph Model (Mapable to RDF (Corby et
al, 2000))
Account A customer, usually an institution or
another organization, that purchases a companys
products or services.
AccountCustomer - ( prototype ) -gt
institution,organization State Customer
-gt(Agent) -gt purchase -gt product/service (
poss) -gt institution
Uses a Class Hierarchy such as Entity Legal
Entity Organization Customer Organization
.
Human action Purchase
76
On the Other hand.

Formal Knowledge Engineering hasnt been a clear
victory
KE may need to be a permanent task
Continuous Knowledge Engineering is an
alternative approach to KE that embraces the
philosophy that knowledge systems are open-ended,
dynamic artifacts that develop through a learning
process in reaction to their environment.
Implicit knowledge must first be made explicit,
which is a time-consuming task requiring a
serious knowledge engineering effort.

77
Opportunities

The vast amounts of information with little or no
structure published over the World- Wide-Web
raise a host of new, challenging problems for
data-mining research examples include web
resource discovery and topic distillation web
structure/ linkage mining intelligent web
searching and crawling personalization of web
content.
Knowledge Discovery in Biological Data Management
Systems and Bioinformatics High-performance data
mining tools will play a crucial role in the
analysis of the ever-growing databases of genetic
sequences accumulated over the course of large
bioinformatics efforts (e.g., the Human Genome
Project).

78
Discussion

Scruffy vs. Neat Reasoning
Knowledge Soup The Chaos and Complexity of the
Human Mind
John F. Sowa http//residentassociates.org/com/So
up2.htm
Neat vs Scruffy A review of Computational Models
for Spatial Expressions. Amitabha Mukerjee
Center for Robotics, http//www.cs.albany.edu/ami
t/review.html
J. McCarthy, From Here to Human-Level
Intelligence, Proceedings of the Fifth
International Conference on Principles of
Knowledge Representation and Reasoning (KR'96),
Cambridge, MA, November 1996, Morgan Kaufmann,
San Mateo, CA (1996), pp. 640646.
Info Access References
C. Paice, "Constructing literature abstracts by
computer Techniques and prospects," Information
Processing and Management, vol. 26, pp. 171-186,
1990.
SUMMARIST Automated Text Summarization
http//www.isi.edu/cyl/summarist/summarist.html
Church K. and M. Liberman (1991) "A Status Report
on the ACL/DCI". In Proc. of the 7th Annual
Conference of the UW Centre for the New OED and
Text Research Using Corpora, pp. 84-91.

79
Data Models Metadata references

Phil Bernstien "Representing and Reasoning About
Mappings between Domain Models," 18th National
Conference on Artificial Intelligence (AAAI
2002), Edmonton, Canada http//www.cs.washington
.edu/homes/jayant/Pubs/SemanticsAAAI02.pdf
MetaIntegration tools http//www.metaintegration.n
et/
Implementation of MeSH information in the Dublin
Core Metadata Malet G, Munoz F, Appleyard R,
Hersh W. J Am Med Inform Assoc 1999
Mar-Apr6(2)163-72
Ontologies Knowledge Models
Towards continuous knowledge engineering by Klaas
Schilstra thesis Delft University of Technology
http//www.kbs.twi.tudelft.nl/Publications/PhD/200
2-Schilstra-PhD.html
Towards Situated Knowledge Acquisition, Tim
Menzies, http//www.phil.canterbury.ac.nz/tom_best
or/etexts/Menzies2020Towards20Situated20Knowle
dge20Acquisition.htm
Formal Ontology, Conceptual Modelling, and
Knowledge Engineering, http//www.ladseb.pd.cnr.it
/infor/ontology/Papers/OntologyPapers.html
DAML Homepage, http//www.daml.org/

80
XML Registries

W3C http//www.wc3.org http//www.wc3.com
US Federal CIO Council http//xml.coverpages.org/C
IO-Council-XML-DevelopersGuidenceVersion1.pdf
ASC X12 Reference Model
http//www.x12.org/x12org/comments/X12Reference_Mo
del_For_XML_Design.pdf
Metadata registries USHIK etc.
http//www.bls.gov/ore/pdf/st000010.pdf

81
RDF Metadata Registries

The Open Metadata Registry Prototypes - Dublin
Core Metadata Initiative http//wip.dublincore.org
8080/registry/Registry
SCHEMAS Project Registry http//www.schemas-forum.
org/registry/
DESIRE Registry http//www.ukoln.ac.uk
SWAG- WebNS Registry http//webns.net/
Xmlns.com Registry http//xmlns.com/
ULIS Open Metadata Registry http//avalon.ulis.ac
.jp/registry/

82
Semantic Web Topic maps

W3C Semantic Web Site http//www.w3.org/2001/sw/
SemanticWeb.orghttp//www.semanticweb.org/
The emerging semantic web Selected papers from
the first Semantic web working symposium Edited
by Isabel Cruz, Stefan Decker, Jérôme Euzenat,
and Deborah McGuinnessVolume 75 in the Frontiers
in artificial intelligence and applications
seriesIOS press, Amsterdam (NL), 2002 300pp.,
hardcover, ISBN 1 58603 255 0 (IOS press)
Markup Languages Comparison and Examples
http//trellis.semanticweb.org/expect/web/semantic
web/comparison.html
Work in progress on Topic Maps http//www.topicmap
s.net/

Write a Comment

User Comments (0)

About PowerShow.com

Formal and Informal Approaches to Indexing PowerPoint PPT Presentation