Title: The Utility of XML
1The Utility of XML
Martin Doerr
Center for Cultural Informatics
Institute of Computer Science
Foundation for Research and Technology - Hellas
Heraklion, May 25, 2001
2XML is
- XML is a compromise between databases and free
texts - It takes the better from both sides without being
perfect on either side. - It is readable. It allows to disambiguate
meaning. - It is simple.
- It is rich enough to open a new systems paradigm.
3What is a Document ?
- A composite statement a unit relating known
facts, items and categories with new knowledge -
linguistic or by other media. - It has an inner logic the pure rendered
knowledge, independent from language and form. - It has a meaningful structure The sequence,
arrangement or linking used to render the inner
logic. - It has a presentation Structure and style to
assist perception and impression
4A document
5The statements.
- Diego Velasquez is Spanish.
- Diego Velasquez lived 1599-1660.
- Diego Velasquez painted Juan de Pareja.
- Juan de Pareja is a painting.
- Juan de Pareja has dimension 81,3X69,9cm
- Juan de Pareja is Moorish.
- Juan de Pareja is a painter.
- Philipp IV sent Velazquez to Italy.
- ..
6Another document
7Whats Wrong with HTML
- If written properly, normal HTML may reflect
document presentation, but it cannot adequately
represent the semantics structure of data
Artifact Title
Artist Name
ltBgtMONET, ClaudeltBgtltBRgt Haystacks at Chailly at
SunriseltBRgt 1865ltBRgt Oil on canvasltBRgt 30 x 60 cm
(11 7/8 x 23 3/4 in.)ltBRgt San Diego Museum of Art
ltBRgt ltPgt ltIMG SRChttp//
Image Reference
8User Problems/ Design Reasons
- Preserving info units who said that /
self-contained - Entering data
- what can I say,
- what should I say,
- how can I say it.
- Rendering data how to tell my child, the
public - Accessing data querying, mediation
- Reusing data transmission to other
environments, merging, evolution of local system,
preservation for future use.
9In Technical Terms
- Transformation under preservation of meaning
- Correct adaptation of presentation without
knowing meaning - Packaging information for presentation 1
document - Sequencing categories for data input.
- Interpretation of intended meaning - searching
- Automatic relating of common meaning merging
of different statements
10Whats wrong with
- Free texts Clear packaging, rendering for one
target, not machine processable (poor querying,
categories uncomprehensive), poorly reusable, no
help to enter data, transform data.. - HTML Solves platform-independence of
presentation, weak connection between meaning and
presentation structure not far better than free
text. - Databases Clear logical structure,
categorization, machine processable, excellent
querying, difficult presentation, transformation,
merging, evolution, no information units - XML Clear packaging, logical structure, machine
processable if correctly used, clear separation
and relation of meaningful structure and
presentation. - Helpful to enter data, easy to extend,
transform, present. Can be queried, structure not
independent from user view.
11XML and databases
- Databases
- Schema first Prior to data, complete,
inflexible analysis of all categories and their
relations. - Table structures indexes prepared, excellent
consistency enforcement. - XML
- Data first structure explanatory, can come
second, need not be formalized, extensible,
DTDs can be combined - semi-structured flexible, but reduced guarantee
if a question can be answered, reduced
consistency enforcement. - Embedded schema each instance carries the
schema it uses - querying by parsing without index structures
ideal transport format.
12Data First, Embedded Schema
- This document carries the interpretation with it.
It is readable without knowledge of the schema.
ltARTISTgt ltNAMEgtltFIRSTgtClaudelt/FIRSTgtltLASTgtMonet
ltTITLEgtHaystacks at Chailly at
Sunriselt/TITLEgt ltDATEgt1865lt/DATEgt
ltMATERIALgtOil on canvaslt/MATERIALgt
ltDIM Metriccmgt
ltDIM Metricingt
ltHEIGHTgt11 7/8lt/HEIGHTgtltWIDTHgt23
3/4lt/WIDTHgtlt/DIMgt ltLOCATIONgtSan
Diego Museum of Artlt/LOCATIONgt
ltIMAGE Filehttp//
/hayricks.jpg/gt lt/ARTIFACTgt
13Whats important
- Data first delayed analysis, preserves data.
- Embedded schema facilitates data transport,
readable in the future. - Separation of semantics and presentation
enables information reuse. - Guides and controls data entry
- Same meaning can be encoded in multiple formats
- DTD design depends on purpose Transport,
presentation, data entry
14Useful Applications
- Prescription for documentation / input
- Data transfer between systems (middle ware)
- Document bases with full query access.
- Combine database with XML documents
mission-critical data in tables and DTD, rich
extensible structures in DTD only. - Create data for long-term use even machine
readable from paper! - Create information sets for multiple presentation
15Final Remark
- How to encode meaning without structure
ambiguities - gt use RDF/ RDFS
- How to standardize meaning of element types
(tags) ? - gt use ontologies e.g. formulated in RDFS!