Title: XML 101: A Technical Introduction to XML
1XML 101A Technical Introduction to XML
- 20 November 2002
- Bank of Montreal Database Users Group
- Ian GRAHAM
- IT Strategy, IBS, Technology and Solutions, BMO
Financial Group - E ltian.graham_at_bmo.comgt
- T (416) 513.5656 / F (416) 513.5590
- To download this talk http//www.utoronto.ca/ian
/talks/
2Presentation Outline
- What is XML (basic introduction)
- Defining language dialects and constraints
- DTDs, namespaces, and schemas
- XML processing
- Parsers and parser interfaces XML processing
tools - XML databases
- High-level issues, and references
- XML messaging / web services
- Why, and some issues/example
- Conclusions
3What is XML?
- A base-level syntax
- for encoding structured, text-based information
(words, characters, ...) - A text-based syntax
- XML is written using printable Unicode
characters. Explicit binary data is not allowed - Supports extensible data formats
- XML lets you define your own elements
(essentially data types), within the constraints
of the syntax rules - Designed as a universal format
- The syntax rules ensure that all XML processing
software MUST identically handle a given piece
of XML data. -
- If you can read and process it, so can
anybody else -
4XML A Simple Example
Flags character encoding used in file
XML Declaration (this is XML)
lt?xml version"1.0" encoding"iso-8859-1"?gt
ltpartorders xmlnshttp//myco.org/Spec/pa
rtordersgt ltorder refx23-2112-2342
date25aug1999-123423hgt ltdescgt Gold
sprockel grommets, with matching
hamster lt/descgt ltpart
number23-23221-a12 /gt ltquantity
unitsgrossgt 12 lt/quantitygt ltdeliveryDate
date27aug1999-1200h /gt lt/ordergt ltorder
refx23-2112-2342 date25aug1999-12
3423hgt . . . Order something else . . .
lt/ordergt lt/partordersgt
Black XML tags and markup Blue - encoded text
data
5Example Revisited
ltpartorders xmlnshttp//myco.org/Spec/
partorders gt ltorder refx23-2112-2342
date25aug1999-123423hgt ltdescgt Gold
sprockel grommets, with matching
hamster lt/descgt ltpart
number23-23221-a12 /gt ltquantity
unitsgrossgt 12 lt/quantitygt ltdeliveryDate
date27aug1999-1200h /gt lt/ordergt ltorder
refx23-2112-2342 date25aug1999-12
3423hgt . . . Order something else . . .
lt/ordergt lt/partordersgt
6XML Data Model - A Tree
ltpartorders xmlns"..."gt ltorder date"..."
ref"..."gt ltdescgt ..text..
lt/descgt ltpart /gt ltquantity /gt
ltdelivery-date /gt lt/ordergt ltorder ref".."
.../gt lt/partordersgt
7XML Design goals
- Simple but reliable
- Strict syntax rules, to eliminate syntax errors
- syntax defines structure (hierarchically), and
names structural parts (element names) -- it is
self-describing data - Extensible and mixable
- Can create your own language of tags/elements
- Can mix one language with another, and still
reliably separate / process the data - Designed for a distributed environment
- Can have remote (webbed) data, and retrieve and
use it reliably
8XML Processing The XML Parser
parser Interface
- The parser must verify that the XML is
syntactically correct - Such data is said to be well-formed
- The minimal requirement to be XML
- A parser MUST stop processing if the data isnt
well-formed - E.g., stop processing and throw an exception to
the XML-based application. The XML 1.0 spec
requires this behaviour
XML parser
XML-based application
XML data
9Special Issues Characters and Charsets
- XML specification defines characters allowed as
whitespace in tags ltelement id 23.112
/gt - You cannot use EBCIDIC character NEL as
whitespace - Must make sure to not do so!
- What if you want to include characters not
defined in the encoding charset (e.g., Greek
characters in an ISO-Latin-1 document) - Use character references. For example
9824 -- the spades character (?)
9824th character
in the Unicode character set - Also, a reminder that binary data is forbidden
- must be encoded as printable characters (e.g.
using Base64)
10Parsers and DTDs
parser interface
- A DTD can define external parts (entities) to be
included in - But . what if the parser cant find the external
parts (firewall?)? - That depends on the type there are two types of
XML parsers - one that MUST retrieve all parts
- one that can ignore them (if it cant find them)
parser
XML-based application
XML data
DTD
11Two types of XML parsers
- Validating
- Must retrieve all entities and process all of
the DTD. Will stop processing and indicate a
failure if it cannot - It must also test and verify other things in the
DTD -- instructions that define syntactic
document rules (allowed elements, attributes,
etc.). - Non-validating (well-formed only)
- Tries retrieve all parts, but will cease
processing the DTD content at the first part
(entity) it cant find, - But this is not an error -- the parser simply
makes available the XML data (and the names of
any unresolved parts) to the application.
Application behavior will depend on parser type
Many parsers can operate in either mode (config)
12Presentation Outline
- What is XML (basic introduction)
- Defining language dialects and constraints
- DTDs, namespaces, and schemas
- XML processing
- Parsers and parser interfaces XML processing
tools - XML databases
- High-level issues, and references
- XML messaging / web services
- Why, and some issues/example
- Conclusions
13Defining constraints / languages
- Two ways of doing so
- XML Document Type Declaration (DTD) -- Part of
core XML spec. - XML Schema (often called XSD) -- New
specification (2001), which allows for richer
constraints on XML documents. - What DTDs and/or schema specify
- Allowed element and attribute names, hierarchical
nesting rules element content/type restrictions - Adding dialect specifications implies two classes
of XML data - Well-formed XML that is syntactically correct
- Valid XML that is well-formed and consistent
with a specific DTD (or Schema) - Schemas are more powerful than DTDs
- Often used for type validation, or for defining
low-level type constraints (integer, varchar,
datetime, etc.) constraints on values.
14DTD Example
lt!DOCTYPE transfers lt!ELEMENT transfers
(fundsTransfer) gt lt!ELEMENT fundsTransfer
(from, to) gt lt!ATTLIST fundsTransfer
date CDATA REQUIREDgt lt!ELEMENT from
(amount, transitID?, accountID,
acknowledgeReceipt ) gt lt!ATTLIST from
type (intrabankinternalother) REQUIREDgt
lt!ELEMENT amount (PCDATA) gt . . .
Omitted DTD content . . . lt!ELEMENT to
EMPTY gt lt!ATTLIST to account CDATA
REQUIREDgt gt lttransfersgt ltfundsTransfer
date"20010923T123434Z"gt . . . As with
previous example . . .
15XML Namespaces
- Mechanism for identifying different spaces for
XML names - That is, element or attribute names
- This is a way of identifying different language
dialects, consisting of names that have specific
semantic (and processing) meanings. - For example ltkey/gt in one language (e.g. a
security key) can be distinguised from ltkey/gt in
another language (a database key) - Mechanism uses a special xmlns attribute to
define namespaces. - The namespace is a URL string
- But the URL does not reference anything in
particular (there may be nothing there!)
16Mixing languages together
Namespaces let you do this relatively easily
- lt?xml version "1.0" encoding "utf-8" ?gt
- lthtml xmlns"http//www.w3.org/1999/xhtml1"
- xmlnsmt"http//www.w3.org/1998/mathml gt
- ltheadgt
- lttitlegt Title of XHTML Document lt/titlegt
- lt/headgtltbodygt
- ltdiv class"myDiv"gt
- lth1gt Heading of Page lt/h1gt
- ltmtmathmlgt
- ltmttitlegt ... MathML markup . . .
- lt/mtmathmlgt
- ltpgt more html stuff goes here lt/pgt
- lt/divgt
- lt/bodygt
- lt/htmlgt
Default space is xhtml
mt prefix indicates space mathml (a different
language)
17XML Schemas
- A specification for defining XML validation rules
Specs http//www.w3.org/XML/SchemaBest-practi
ce http//www.xfront.com/BestPracticesHomepage.
html - Uses pure XML (plus namespaces) to do this
- More powerful than DTDs - can specify things like
integer types, date strings, real numbers in a
given range, etc. - Often used for type validation, or for relating
database schemas to XML models - They dont, however, let you declare entities --
those can only be done in DTDs - The following slide shows the XML schema
equivalent to our DTD
18XML Schema version of our DTD (Portion)
lt?xml version"1.0" encoding"UTF-8"?gt ltxsschema
xmlnsxs"http//www.w3.org/2001/XMLSchema"
elementFormDefault"qualified"gt
ltxselement name"accountID" type"xsstring"/gt
ltxselement name"acknowledgeReceipt"
type"xsstring"/gt ltxscomplexType
name"amountType"gt ltxssimpleContentgt
ltxsrestriction base"xsstring"gt
ltxsattribute name"currency" use"required"gt
ltxssimpleTypegt
ltxsrestriction base"xsNMTOKEN"gt
ltxsenumeration value"USD"/gt
. . . (some stuff omitted) . . .
lt/xsrestrictiongt
lt/xssimpleTypegt lt/xsattributegt
lt/xsrestrictiongt lt/xssimpleContentgt
lt/xscomplexTypegt ltxscomplexType
name"fromType"gt ltxssequencegt
ltxselement name"amount" type"amountType"/gt
ltxselement ref"transitID" minOccurs"0"/gt
ltxselement ref"accountID"/gt
ltxselement ref"acknowledgeReceipt"/gt
lt/xssequencegt . . . And still more !!! .
. .
19Presentation Outline
- What is XML (basic introduction)
- Defining language dialects and constraints
- DTDs, namespaces, and schemas
- XML processing
- Parsers and parser interfaces XML processing
tools - XML databases
- High-level issues, and references
- XML messaging / web services
- Why, and some issues/example
- Conclusions
20XML Software
- XML parsers..
- Read in XML data, checks for syntactic (and
possibly DTD/Schema) constraints, and makes data
available to an application. There are three
'generic' parser APIs - SAX Simple API to XML (event-based)
- DOM Document Object Model (object/tree based)
- JDOM Java Document Object Model (object/tree
based) - Pull evolving API (new) (pull-based / object
tree) - Lots of XML parsers and interface software
available - Unix, Linux, Windows 2000/XP, Z/OS, etc
- SAX-based parsers are fast (often as fast as you
can stream data) - DOM slower, more memory intensive (create
in-memory version of entire document -
- Validating can be much slower than non-validating
21Parser API SAX
- A) SAX Simple API for XML
- http//www.megginson.com/SAX/index.html
- An event-based interface (a push parser API)
- Parser reports events whenever it sees a
tag/attribute/text node/unresolved external
entity/other (driven by input stream) - Programmer attaches event handlers to handle
the event - Advantages
- Simple to use
- Very fast (not doing very much before you get the
tags and data) - Low memory footprint (doesnt read an XML
document entirely into memory) - Disadvantages
- Not doing very much for you -- you have to do
everything yourself - Not useful if you have to dynamically modify the
document once its in memory (since youll have
to do all the work to put it in memory yourself!)
22Parser API DOM
- B) DOM Document Object Model
- http//www.w3.org/DOM/
- An object-based interface
- Parser generates an in-memory tree corresponding
to the document - DOM interface defines methods for accessing and
modifying the tree - Advantages
- Very useful for dynamic modification of, access
to the tree - Useful for querying (I.e. looking for data) that
depends on the tree structure element.childNode("
2").getAttributeValue("boobie") - Same interface for many programming languages
(C, Java, ...) - Disadvantages
- Can be slow (needs to produce the tree), and may
need lots of memory - DOM programming interface is a bit awkward, not
terribly object oriented
23DOM Parser Processing Model
24Parser API JDOM
- B2) JDOM Java Document Object Model
- http//www.jdom.org
- A Java-specific object-oriented interface
- Parser generates an in-memory tree corresponding
to the document - JDOM interface has methods for accessing and
modifying the tree - Advantages
- Very useful for dynamic modification of the tree
- Useful for querying (I.e. looking for data) that
depends on the tree structure - Much nicer Object Oriented programming interface
than DOM - Disadvantages
- Can be slow (make that tree...), and can take up
lots of memory - New, and not entirely cooked (but close)
- Only works with Java
25Parser API Pull
- C) Pull Interfaces
- http//www.xmlpull.org/ (Java) there is also a
.NET pull API - An pull-parser interface
- API uses expressions / methods to pull specific
chunks of XML data, or to iterate over the XML - Can be built on top of a DOM model
- Advantages
- Easier to write applications that need to read in
and process XML data (easier model than a push
API, in many cases) - Has proven a very popular component in the .NET
toolkit - Disadvantages
- Can be slow if you do lots of iteration over the
XML input data - No common API across different languages
(although xmlpull.org tries to be similar to the
.NET API) not yet a real standard (still being
worked on not part of most commercial
environments)
26XML Processing XSLT
- D) XSLT eXtensible Stylesheet Language --
Transformations - http//www.w3.org/TR/xslt
- An XML language for processing/transforming XML
- Does tree transformations -- takes XML and an
XSLT style sheet as input, and produces a new XML
document with a different structure - Advantages
- Very useful for tree transformations -- much
easier than DOM or SAX for this purpose - Can be used to query a document (XSLT pulls out
the part you want) - Disadvantages
- Can be slow for large documents or stylesheets
- Can be difficult to debug stylesheets (poor error
detection much better if you use schemas)
27XSLT processing model
schema
XSLT processor
XSLT style sheet in
XML parser
XML data in
data out (XML)
XML parser
schema
document objects for data and style sheet
28XML Processing Toolkits
- Lots of them
- Java
- JAXP ( http//java.sun.com/xml/jaxp/faq.html
)dom4j ( http//www.dom4j.org ) .NET ( part
of .NET framework) others - Provide DOM, SAX, (JDOM) interfaces, plus lots of
other useful tools in a standardized way (loading
parsers, performing XSLT transformations, etc.) - JAXP is standard Java, and thus integrated with
Websphere
29Presentation Outline
- What is XML (basic introduction)
- Defining language dialects and constraints
- DTDs, namespaces, and schemas
- XML processing
- Parsers and parser interfaces XML processing
tools - XML databases
- High-level issues, and references
- XML messaging / web services
- Why, and some issues/example
- Conclusions
30XML and databases
- So where do you stick XML data
- Inside a database!?!
- But how to do this and which database type to
use - RDBMS, ORDBMS, ODB, XML??
- How you do so depends on the use cases you have
for the data. Some good-to-ask questions are - Am I talking about storing documents, or data?
- Is the XML format integral to the application
(e.g. XHTML, DocBook?) - How will the database be queried?
- Queried by XML structure, or by standard SQL
- What parts of the document need to be queried
- Do I need a text index?
- How will the data be used/retrieved?
- Passed to XML processing tools (e.g. XSLT), or
used at atomic simple type level? - The answers drive out
- What database to choose, how to map XML to tables
(O-R or table mappings), store as BLOB or broken
up ..
31XML and databases
- Upcoming technologies
- XML Query a query language for querying XML
datasets (and databases) - Uses XML schema for type casting, and validation
- Info http//www.w3.org/XML/Query
- Useful XML Database references
- http//www.xml.com/pub/a/2001/10/31/nativexmldb.ht
ml Introductory article - http//www.rpbourret.com/xml/XMLAndDatabases.htm X
ML and databases - http//www.rpbourret.com/xml/XMLDatabaseProds.htm
Products list - http//www.xmldb.org/resources.html Docs /
resource list
32Presentation Outline
- What is XML (basic introduction)
- Defining language dialects and constraints
- DTDs, namespaces, and schemas
- XML processing
- Parsers and parser interfaces XML processing
tools - XML databases
- High-level issues, and references
- XML messaging / web services
- Why, and some issues/example
- Conclusions
33XML Messaging
- Use XML as the format for sending messages
between systems - Advantages
- Common syntax self-describing (easier to parse)
- Can use common/existing transport mechanisms to
move the XML data (HTTP, HTTPS, SMTP (email),
MQ, IIOP/(CORBA), JMS, .) - Requirements
- Shared understanding of dialects for transport
(required registry namespace! ) for identifying
dialects - Shared acceptance of messaging contract
- Disadvantages
- Asynchronous transport no guarantee of delivery,
no guarantee that partner (external) shares
acceptance of contract. - Messages will be much larger than binary (10x or
more) can compress
34Common messaging model
- XML over HTTP
- Use HTTP to transport XML messages
- POST /path/to/interface.pl HTTP/1.1Referer
http//www.foo.org/myClient.htmlUser-agent
db-server-olkAccept-encoding gzipAccept-charset
iso-8859-1, utf-8, ucsContent-type
application/xml charsetutf-8Content-length
13221. . . lt?xml version1.0
encodingutf-8 ?gtltmessagegt . . . Markup in
message . . . lt/messagegt
35Some standards for message format
- Define dialects designed to wrap remote
invocation messages - XML-RPC http//www.xmlrpc.com
- Very simple way of encoding function/method call
name, and passed parameters, in an XML message. - SOAP (Simple object access protocol)
http//www.soapware.org - More complex wrapper, which lets you specify
schemas for interfaces more complex rules for
handling/proxying messages, etc. This is a core
component of Microsofts .NET strategy, and is
integrated into more recent versions of Websphere
and other commercial packages. W3c activity (who
sets the SOAP spec) is outlined at
http//www.w3.org/2000/xp/Group/
36XML Messaging Processing
- XML as a universal format for data exchange
Place order (XML/edi) using SOAP over HTTP
SOAP interface
Application
Supplier
SOAP API
Factory
SOAP
Supplier
XML/ EDI
Transport
HTTP(S) SMTP other ...
Supplier
Response (XML/edi) using SOAP over HTTP
37Web Services Model
- SOAP plus higher-level modeling for how services
are advertised, exposed and found - Uses an XML dialect, WSDL (Web Services
Description Language) to define a service - WSDL can use XML Schema to define how data is
passed between a service provider and requestor - Uses an XML dialect, UDDI (Universal Description,
Discovery and Integration) for - Describing services (high-level)
- Discovering services (registry services,
metadata) - UDDI defined using XML Schema
- Core technology for application integration
- Microsoft .NET
- IBM Websphere
- Oracle
- . Many others
38Web Services Code Development
Client code
WSDL
proxy
proxy
WS/SOAP
SOAP Requests/ responses
Write the Application!
automated code generator
WS/SOAP
XML schema
skeleton
skeleton
Validation, business logic, routing, Logging, mor
e
Middle tier code
adapter
Product System code
adapter
MECH
39Presentation Outline
- What is XML (basic introduction)
- Defining language dialects and constraints
- DTDs, namespaces, and schemas
- XML processing
- Parsers and parser interfaces XML processing
tools - XML databases
- High-level issues, and references
- XML messaging / web services
- Why, and some issues/example
- Conclusions
40W3C rec
industry std
XML (and related) Specifications
Open std
W3C draft
XML Core
XML 1.0
XML names
APIs
XSLT
JDOM
JAXP
DOM 1
XHTML 1.0
DOM 2
XML query .
XML schema
SAX 1
SAX 2
SOAP
UDDI
XML-RPC
WSDL
Style
Protocols
Web Services
Application areas
41XML 101A Technical Introduction to XML
- The End.
- Ian GRAHAM
- IT Strategy, IBS, Technology and Solutions, BMO
Financial Group - E ltian.graham_at_bmo.comgt
- T (416) 513.5656 / F (416) 513.5590