Title: Introduction to XML
1Introduction to XML
- COAP 2180, Spring 2007
- Webster University Geneva
- Daniel K. Schneider
- Senior lecturer (MET) at TECFA, University of
Geneva
2Objectives
- History and design rationale for XML
- Markup languages
- Basics of the XML formalism
- XML on the Web
- Sample XML languages / applications
3History
SGML
Standardized General Markup Language
HTML
HyperText Markup Language
1995/98
1990
1985
XML
eXtensible Markup Language
4The XML standard 1998 2000
- T. Bray, J. Paoli, and C. M. Sperberg-McQueen
(Eds.), Extensible Markup Language (XML) 1.0, W3C
Recommendation 10- February-1998,
http//www.w3.org/TR/1998/REC-xml-19980210/ . - T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E.
Maler (Eds.), Extensible Markup Language (XML)
1.0 (Second Edition), W3C Recommendation 6
October 2000, http//www.w3.org/TR/2000/REC-xml-20
001006/ .
5Why XML (1) ?
- Electronic data interchange is critical in
todays networked world and needs to be
standardized - Examples
- Banking funds transfer
- Education e-learning contents
- Scientific data
- Chemistry ChemML,
- Genetics BSML (Bio-Sequence Markup Language),
- Each application area has its own set of
standards for representing information - XML has become the basis for all new generation
data interchange formats (markups)
6Why XML (2)
- Earlier electronic formats were based on plain
text with line headers indicating the meaning of
fields - Does not allow for nested structures, no standard
type language - Tied too closely to low level document structure
(lines, spaces, etc) - Each XML based standard defines what are valid
elements, using XML type specification languages
(i.e. grammars) to specify the syntax - E.g. DTD (Document Type Descriptors) or XML
Schema - Plus textual descriptions of the semantics
- XML allows new tags to be defined as required
- A wide variety of tools is available for parsing,
browsing and querying XML documents/data
7Why XML (3)
- SGML is more difficult
- XML implements a subset of its features
- HTML will not do it
- HTML is very limited in scope, its a language
(vocabulary) for delivering web pages - XML is Extensible, unlike HTML
- Users can add new tags, and separately specify
how the tag should be handled for display - XML is a formalism for defining vocabularies
(i.e. a meta-language), HTML is just a SGML
vocabulary
8Design rationale for XML (1)
- XML must be easily usable over the Internet
- XML must support a wide variety of applications
- XML must be compatible with SGML
- It must be easy to write programs that process
XML documents - The number of optional features in XML must be
kept small
9Design rationale for XML (2)
- XML documents should be clear and easily
understood - The XML design should be prepared quickly
- The design of XML must be exact and concise
- XML documents must be easy to create
- Keeping an XML document size small is of minimal
importance
10XML is a formalism to create markup languages
- Markup
- text added to the data content of a document in
order to convey information about data - Marked-up document contains
- data and
- information about that data (markup)
- Markup language
- formalized system for providing markup
- Definition of markup language specifies
- what markup is allowed
- how markup is distinguished from data
- what markup means
112 ways to look at the XML universe
- (1) XML as formalism to define vocabularies (also
called applications) - Example DTD
- lt!ELEMENT page (title, content, comment?)gt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT content (PCDATA)gt
- lt!ELEMENT comment (PCDATA)gt
- Exemple of an XML document
- ltpagegt
- lttitlegtHello XML friendlt/titlegt
- ltcontentgt
- Here is some content )
- lt/contentgt
- ltcommentgt
- Written by DKS/Tecfa,
- lt/commentgt
- lt/pagegt
- (2) XML as a set of languages for defining
- Contents
- Graphics
- Style
- Transformations and queries
- Data exchange protocols
- ..
12Kinds of XML-based languages (1)
- XML-related languages can be categorized into the
following classes - XML accessories, e.g. XML Schema
- Extends the capabilities specified in XML
- Intended for wide, general use
- XML transducers e.g. XSLT
- Converts XML input data into output
- Associated with a processing model
- XML applications, e.g XHTML
- Defines constraints for a class of XML data
- Intended for a specific application area
13Kinds of XML-based languages (2)
- Less formally speaking ways to use XML
- Behind the scenes as a standard and easily
transformed format for information - As a transfer syntax, to exchange information in
a machine-parsable form - As a method of delivery direct to the user,
usually in combination with a stylesheet
14The W3C XML framework for documents
The W3C consortium defines many XML-based
languages ... details later
15XML information structures (1)
Example 1 A possible book structure
- Book
- FrontMatter
- BookTitle
- Author(s)
- PubInfo
- Chapter(s)
- ChapterTitle
- Paragraph(s)
- BackMatter
- References
- Index
16XML information structures (2)
- Premise A text is the sum of its component parts
- A ltBookgt could be defined as containingltFrontMat
tergt, ltChaptergts, ltBackMattergt - ltFrontMattergt could containltBookTitlegt
ltAuthorgts ltPubInfogt - A ltChaptergt could containltChapterTitlegt
ltParagraphgts - A ltParagraphgt could containltSentencegts or
ltTablegts or ltFiguregts - Components chosen for book markup language should
reflect anticipated use .
17XML information structures (3)
A corresponding XML fragment (based on a
corresponding XML application)
end element
begin element
- ltBookgtltFrontMattergt ltBookTitlegtXML Is
Easylt/BookTitlegt - ltAuthorgtTim Colelt/Authorgt
- ltAuthorgtTom Habinglt/Authorgt
- ltPubInfogtCDP Press, 2002lt/PubInfogt
- lt/FrontMattergt
- ltChaptergt
- ltChapterTitlegtFirst Was SGMLlt/ChapterTitlegt
- ltParagraphgtOnce upon a time lt/Paragraphgt
- lt/Chaptergt
- lt/Bookgt
18XML information structures (4)
- Example 2 Movies
- Elements can have attributes
- ltmoviesgt
- ltmovie genre"action" star"Halle Berry"gt
- ltnamegtCatwomanlt/namegt
- ltdategt(2004)lt/dategt
- ltlengthgt104 minuteslt/lengthgt
- lt/moviegt
- ltmovie genre"horror" star"Halle Berry"gt
- ltnamegtGothikalt/namegt
- ltdategt(2003)lt/dategt
- ltlengthgt98 minuteslt/lengthgt
- lt/moviegt
- ltmovie genre"drama" star"Halle Berry"gt
- ltnamegtMonsteraposs Balllt/namegt
- ltdategt(2001)lt/dategt
- ltlengthgt111 minuteslt/lengthgt
attribute
19What is an XML document ?
- An XML document is a content marked up with XML
(can be a file, a string, a message content or
any other sort of data storage) - There are 2 levels of conforming documents
- Well-formed respects the XML syntax
- Valid In addition, respects one (or more)
associated grammars (schemas).
20What is a well-formed XML document (1) ?
- Well-formed documents follow basic syntax rules
e.g. - there is an XML declaration in the first line
- there is a single document root
- all tags use proper delimiters
- all elements have start and end tags
- But can be minimized if empty ltbr/gt instead of
ltbrgtlt/brgt - all elements are properly nested
- ltauthorgt ltfirstnamegtMarklt/firstnamegt
- ltlastnamegtTwainlt/lastnamegt
lt/authorgt - appropriate use of special characters
- all attribute values are quoted
- ltsubject schemeLCSHgtMusiclt/subjectgt
21What is a well-formed XML document (2) ?
- Good example
- ltaddressBookgt
- ltpersongt
- ltnamegt ltfamilygtWallacelt/familygt
ltgivengtBoblt/givengt lt/namegt - ltemailgtbwallace_at_megacorp.comlt/emailgt
- ltaddressgtRue de Lausanne, Genèvelt/addressgt
- lt/persongt
- lt/addressBookgt
- Bad example
- ltaddressBookgt
- ltaddressgtRue de Lausanne, Genève
ltpersongtlt/addressgt - ltnamegt
- ltfamilygtSchneiderlt/familygt
ltfirstNamegtNinalt/firstNamegt - lt/namegt
- ltemailgtnina_at_nina.namelt/emailgt
- lt/persongt
- ltnamegtltfamilygt Muller lt/familygt ltnamegt
- lt/addressBookgt
22What is a valid XML document (3) ?
- Parser (i.e. that program that reads the XML) can
check markup of individual document against rules
expressed in a schema (DTD, XML Schema, etc.) - Typically, a schema (grammar)
- Defines available elements
- Defines attributes of elements
- Defines how elements can be embedded
- Defines mandatory and optional information
- Authoring tools usually can enforce rules of
DTD/Schema while document is edited
23Document Type Definitions (DTDs 1)
- XML document types can be specified using a DTD
- DTD constraints structure of XML data
- What elements can occur
- What attributes can/must an element have
- What subelements can/must occur inside each
element, and how many times. - DTD does not constrain data types
- All values represented as strings in XML
- DTD definition syntax
- lt!ELEMENT element (subelements-specification) gt
- lt!ATTLIST element (attributes) gt
- more details later
- Valid XML documents refer to a DTD (or other
Schema)
24Document Type Definitions (DTDs 2)
Application should know DTD
External Public DTD Declaration
lt?xml version"1.0" encoding"ISO-8859-1"?gt
lt!DOCTYPE test PUBLIC "-//Webster//DTD test
V1.0//EN"lttestgt "test" is a document element
lt/testgt
test name of the root element
External DTD Declaration referring to a file or a
URL
lt?xml version"1.0" encoding"ISO-8859-1"?gt
lt!DOCTYPE test SYSTEM "test.dtd"gtlttestgt "test"
is a document element lt/testgt
DTD is defined in file test.dtd
Internal DTD Declaration
DTD is defined inside XML
lt!DOCTYPE test lt!ELEMENT test EMPTYgt
gtlttest/gt
25XML Schemas
- XML Schema is a more sophisticated schema
language which addresses the drawbacks of DTDs.
Supports - Typing of values
- E.g. integer, string, etc
- Also, constraints on min/max values
- User-defined, complex types
- Many more features, including
- uniqueness and foreign key constraints,
inheritance - XML Schema is itself specified in XML syntax,
unlike DTDs - More-standard representation, but verbose
- XML Scheme is integrated with namespaces
- BUT XML Schema is significantly more
complicated than DTDs.
26XML Namespaces (1)
- Various XML languages can be mixed
- However there can be a naming conflict, different
vocabularies (DTDs) can use the same names for
elements ! How to avoid confusion ? - Namespaces
- Qualify element and attribute names with a label
(prefix) - unique_prefixelement_name
- An XML namespace is a collection of names
(elements and attributes of a markup vocabulary) - identified by xmlnsprefixURL reference
- xmlnsxlink"http//www.w3.org/1999/xlink"
27XML Namespaces (2)
The STORY Element May contain xlink names
- Example Use of XLinks requires
- a namespace definition
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt
- ltSTORY xmlnsxlink"http//www.w3.org/1999/xlink
"gt - ltTitlegtThe Webmasterlt/Titlegt
-
- ltINFOSgt ltDategt30 octobre 2003 - lt/Dategt
- ltAuthorgtDKS - lt/Authorgt
- ltA xlinkhrefhttp//jigsaw.w3.org/css-valid
ator/check/referer - xlinktype"simple"gtCSS Validatorlt/Agt
lt/INFOSgt - lt/STORYgt
Title belong to default name space
href belongs to xlink name space
28Processing instructions
- XML is read by machines
- Processing instructions (PI) can tell a program
how to deal with contents of a given XML document - E.g. to tell a web browser to use a stylesheet
with an XML content, the following PI is used - lt?xml-stylesheet typestyle hrefsheet ?gt
- Style is the type of style sheet to access and
sheet is the name and location of the style
sheet. - lt?xml version"1.0" encoding"ISO-8859-1"?gt
- lt?xml-stylesheet href"stepbystep.cssÂ
- type"text/css"?gt
29XML on the WEB
- Any XML content can be displayed in most modern
browsers - Ways to use XML
- XHTML HTML rewritten in XML
- Any XML document together with a CSS stylesheet
or an XSLT transformation - Specialized formats like SVG (vector graphics),
X3D (3d vector graphics), MathML (formulas) - Combinations of the above (more difficult !)
- A wordprocessor plus output filters
30XHTML
- XHTML is HTML that respects XML syntax
- E.g. all tags must be closed
- Tags are defined in lower-case
- Note XHTML strict is HTML without formatting
information - No attributes like  alignÂ
- NOTE IE explorer can display XHTML, but it can
not handle XHTML Â served as XMLÂ by a server, it
doesnt support included vocabularies either.
31XML CSS
- Document-centered XML and CSS 2 is easy
- To apply a style sheet to a document, use the
following syntax for each element - selector attribute1value1 attribute2value2
- selector is the element name from the XML
document. - attribute and value are the style attributes and
attribute values to be applied to the document. - Example
- ARTIST colorred font-weightbold
- will display the text of the ARTIST element in a
red boldface type.
32XML XSLT (1)
- XSLT is a transformation language that can
translate from XML to anything - Also, works well with Mozilla/Firefox and IE 6 /
7
- Translated into HTML (as an example)
- lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
"http//www.w3.org/TR/REC-html40/ strict.dtd"gt - lthtmlgtltheadgtlttitlegtHello Cocoon
friendlt/titlegtlt/headgt - ltbody bgcolor"ffffff"gt
- lth1 align"center"gtHello Cocoon friendlt/h1gt
- ltp align"center"gt Here is some content ) lt/pgt
- lthrgt Written by DKS/Tecfa, adapted from S.M./the
Cocoon samples - lt/bodygtlt/htmlgt
- XML Source
- lt?xml version"1.0"?gt
- ltpagegt
- lttitlegtHello Cocoon friendlt/titlegt
- ltcontentgtHere is some content ) lt/contentgt
- ltcommentgtWritten by DKS/Tecfa, adapted from
S.M./the Cocoon samples lt/ commentgt - lt/pagegt
33XML XSLT (2)
- The XSLT stylesheet used for the translation
- lt?xml version"1.0"?gt
- ltxslstylesheet xmlnsxsl"http//www.w3.org/1999/
XSL/Transform"gt - ltxsltemplate match"page"gt
- .....
- lthtmlgt ltheadgt lttitlegt ltxslvalue-of
select"title"/gt lt/titlegt lt/headgt - ltbody bgcolor"ffffff"gt ltxslapply-templates
/gt lt/bodygt - lt/htmlgt
- lt/xsltemplategt
- ltxsltemplate match"title"gt
- lth1 align"center"gt ltxslapply-templates/gt
lt/h1gt - lt/xsltemplategt
- ltxsltemplate match"content"gt
- ltp align"center"gt ltxslapply-templates/gt lt/pgt
- lt/xsltemplategt
- ltxsltemplate match"comment"gt
34XML XSLT XSLFO publication framework
- XSLT transforms data (from XML to any XML or even
other formats) - XSL-FO is a style language (mainly used to
produce PDF documents)
35XML in the documentation world
- XML is popular in the documentation world
- Specialized vocabularies to write huge documents
(e.g. DocBook or DITA) - Domain-specific vocabularies to enforce
semantics, e.g. legal markup, news syndication
36SVG
- SVG Scalable Vector Graphics (as powerful as
Flash) - Partically supported in Firefox, plugin needed
for IE - lt?xml version"1.0" standalone"no"?gt
- ltsvg width"270" height"170" xmlns"http//www.w3
.org/2000/svg"gt - ltrect x"5" y"5" width"265" height"165"
style"fillnonestrokebluestroke-width2" /gt - ltrect x"15" y"15" width"100" height"50"
fill"blue" stroke"black"
stroke-width"3" stroke-dasharray"9 5"/gt - ltrect x"15" y"100" width"100" height"50"
fill"green" stroke"black"
stroke-width"3" rx"5" ry"10"/gt - ltrect x"150" y"15" width"100" height"50"
fill"red" stroke"blue"
stroke-opacity"0.5" fill-opacity"0.3"
stroke-width"3"/gt - ltrect x"150" y"100" width"100" height"50"
style"fillredstrokebluestroke-width1"/gt - lt/svggt
37MathML
- Mathematical formulas
- Firefox, plugin needed for IE
- Example
- ltmrootgt
- ltmrowgt
- ltmngt1lt/mngt
- ltmogt-lt/mogt
- ltmfracgt
- ltmigtxlt/migt
- ltmngt2lt/mngt
- lt/mfracgt
- lt/mrowgt
- ltmngt3lt/mngt
- lt/mrootgt
38Metadata (1)
- Metadata are data about data
- Many repositories rely on metadata since
- Repository contents are books, images, software,
people, whatever - User wants to find, identify, select, obtain /
use - But contents dont have enough information to
insure optimal retrieval - metadata can be
- embedded in a resource
- separate entity linked to/from resource
- dissociated database entry
39Metadata (2)
- Most popular standard Dublin core
- 15 elements, all optional, all repeatable
- Dublin Core (and most other standards) are
RDF-based - RDFResource Description Framework Model Syntax
- Recommendation of W3C, 1999
- Generic architecture for metadata
- set of conventions for applications exchanging
metadata - allow semantics to be defined by different
resource description communities - accommodate mixing of metadata from diverse
sources - RDF also is the basis of the semantic web (OWL,
etc.)
40XML Query languages
- XPath (also part of XSLT)
- 13 axes (navigation directions in the tree)
- child (/), descendant (//), following-sibling,
following - NameTest, predicates
- E.g,
- doc(bib.xml)//booktitleHarry Potter/ISBN
- XQuery (superset of XPath)
- FLWOR expression
- for x in doc(bib.xml)//booktitle
- Harry Potter/ISBN,
- y in doc(imdb.xml)//movie
- where y//novel/ISBN x
- return y//title
41Data integration and exchange languages
- Web services (SOAP, WSDL, UDDI)
- Amazon.com, eBay,
- Domain specific data exchange schemas (gt1000)
- legal document exchange languages
- business information exchange
- RSS XML news feeds
- CNN, slashdot, blogs,
42More !!
- The languages presented before are just a subset
of the XML galaxy ! - .
- In this course we mainly will deal with
- The XML formalism, editing XML content
- Defining DTDs
- Associating CSS stylesheets
- Transforming XML data with XSLT
43Summary
- XML has a wide range of applications
- XML is just a formalism (meta-language), unlike
HTML - The W3C framework includes
- General purpose (accessory, transducing, ..)
languages such as XML Schema, XSLT, XPath,
XQuery, Xlink, RDF, - Useful languages for contents (vector graphics,
multimedia animation, formulas - Other organizations
- Define domain-specific vocabularies
- Define alternative XML-based general purpose
languages - XML is mostly used behind the scene, but
increasingly directly for web contents (via XSLT
mostly)
44References - Slides
- I borrowed contents from several ppt found on the
web, in particular from - Frank Tompa and Airi Salminen (2002), University
of Waterloo, Introduction to XML - John A. Mess, Introduction to XML
- Marty Kurth, (2004) NYLA, A Practical
Introduction to XML in Libraries - Pete Johnston, UKOLN, University of Bath,
http//www.ukoln.ac.uk/ - Ted Glaza, Introduction to XML
- Roy Tennant, eScholarship, California Digital
Library - Avi Silberschatz, Henry F. Korth, S. Sudarshan,
Database System Concepts, http//www.db-book.com/
- Carey, New Perspectives on XML (PPT slides
provided by the author of our textboook) - Karl Aberer, XML and Semistructured Data
http//lsirpeople.epfl.ch/aberer/