Title: Effective XML
1Effective XML
- Elliotte Rusty Harold
- elharo_at_metalab.unc.edu
- http//www.cafeconleche.org/
2Part I Syntax
3Stay with XML 1.0
- XML 1.1
- New name characters
- C0 control characters
- C1 control characters
- NEL
- Undeclare namespace prefixes
- Incompatible with
- Most XML parsers
- W3C and RELAX NG schema languages
- XOM, JDOM
4Part II Structure
5The XML Stack
6Allow All XML syntax
- CDATA sections
- Entity references
- Processing instructions
- Comments
- Numeric character references
- Document type declarations
- Different ways of representing the same core
content not different information
7Distinguish text from markup
- A DocBook element
- ltprogramlistinggtlt!CDATAltvaluegt
ltdoublegt28657lt/doublegtlt/valuegtgtlt/programlisting
gt - The content isltvaluegt ltdoublegt28657lt/doublegtlt
/valuegt - This is the sameltprogramlistinggtltvaluegt
ltdoublegt28657lt/doublegt
lt/valuegtlt/programlistinggt
8The reverse problem
- Tools that create XML from strings
- Tree-based editors like ltOxygen/gt or XML Spy
- WYSIWYG applications like OpenOffice Writer
- Programming APIs such as DOM, JDOM, and XOM
- The tool automatically escapes reserved
characters like lt, gt, or . - Just because something looks like an XML tag does
not mean it is an XML tag.
9White space matters
- Parsers report all white space in element
content, including boundary white space - An xmlspace attribute is for the client
application only, not the parser - White space in attribute values is normalized
- Parsers do not report white space in the prolog,
epilog, the document type declaration, and tags.
10Make structure explicit through markup
- Bad
- ltTransactiongtWithdrawal 2003 12 15
200.00lt/Transactiongt - Better
- ltTransaction type"withdrawal"gt
- ltDategt2003-12-15lt/Dategt
- ltAmountgt200.00lt/Amountgt
- lt/Transactiongt
11Store metadata in attributes
- Material the reader doesnt want to see
- URLs
- IDs
- Styles
- Revision dates
- Authors name
- No substructure
- Revision tracking
- Citations
- No multiple elements
12Remember mixed content
- Narrative documents
- Record-like documents
- The RSS problem
- ltitemgt
- lttitlegtXerlin 1.3 releasedlt/titlegt
- ltdescriptiongt
- Xerlin 1.3, an open source XML Editor written
in - Java, has been released. Users can extend the
- application via custom editor interfaces for
- specific DTDs. New features in version 1.3
include - XML Schema support, WebDAV capabilities, and
- various user interface enhancements. Java 1.2
- or later is required.
- lt/descriptiongt
- ltlinkgthttp//www.cafeconleche.org/news2003April7lt
/linkgt - lt/itemgt
13What you really want is this
ltdescriptiongt ltpgtlta href"http//www.xerlin.or
g"gtltstronggtXerlin 1.3lt/stronggtlt/agt,an open
source XML Editor written in Java, has been
released. Users can extend the application via
custom editor interfaces for specific DTDs. New
features in version 1.3 includelt/pgt ltulgt
ltligtXML Schema supportlt/ligt ltligtWebDAV
capabilitieslt/ligt ltligtVarious user interface
enhancementslt/ligt lt/ulgt ltpgtJava 1.2 or later
is required.lt/pgt lt/descriptiongt
14What people do is this
ltdescriptiongtltpgtlta href"http//www.xerlin.o
rg"gtltstronggtXerlin 1.3lt/stronggtlt/agt, an
open source XML Editor written in Java, has been
released. Users can extend the application via
custom editor interfaces for specific DTDs. New
features in version 1.3 includelt/pgt
ltulgt ltligtXML Schema supportlt/ligt
ltligtWebDAV capabilitieslt/ligt
ltligtVarious user interface enhancementslt/ligt
lt/ulgt ltpgtJava 1.2 or later is
required.lt/pgt lt/descriptiongt
15Prefer URLs to unparsed entities and notations
- URLs are simple and well understood
- Notations and unparsed entities are confusing and
little used - URLs dont require the DTD to be read
- Many APIs dont even support notations and
unparsed entities
16Part III Semantics
17Use processing instructions for process-specific
content
- For a very particular, even local, process
- Describes how a particular process acts on the
data in the document - Does not describe or add to the content itself
- A unit that can be treated in isolation
- Content is not XML-like.
- Applies to the entire document
18Processing instructions are not appropriate when
- Content is closely related to the content of the
document itself - Structure extends beyond a single processing
instruction - Needs to be validated
19Include all information in instance documents
- Not all parsers read the DTD
- Especially browsers
- Beware
- Default attribute values
- Parsed entity references
- XInclude
- ID type dependence (XPath, DOM, etc.)
20Encode binary data using quoted printable and/or
Base64
- Quoted printable works well for mostly text
- Base-64 for non-text data
- Can you link to the data with a URL instead?
- Can you bundle the data with XML using zip, jar,
XOP, or MIME?
21Use namespaces for modularity and extensibility
- Not hard simple cases can use one default
namespace - http URIs are normally preferred
- DTD validation is tricky
- Code to namespace URIs, not prefixes
- Avoid namespace prefixes in element content and
attribute values
22Reuse XHTML for generic narrative content
- lt!ENTITY xhtml1 SYSTEM "http//www.w3.org/TR/xht
ml1/DTD/strict.dtd"gtxhtml1 - lt!ELEMENT description Blockgt
23Choose the right schema language for the job
- DTDs
- The W3C XML Schema Language
- RELAX NG
- Schematron
24Use only what you need
- You need
- Well-formed XML 1.0
- A parser
- You probably need
- Namespaces
- You may not need
- DTDs
- Schemas
- XInclude
- SOAP
- WS-Kitchen-Sink
- etc.
25Always use a parser
- Cant use regular expressions
- Detecting encoding
- Comments and processing instructions that contain
tags - CDATA sections
- Unexpected placement of spaces and line breaks
within tags - Default attribute values
- Character and entity references
- Malformed documents
- Internal DTD Subset
- Why not?
- Unfamiliarity with parsers
- Too slow
26Layer Functionality
27Program to standard APIs
- Easier to deploy in Java 1.4/1.5
- Different implementations have different
performance characteristics - SAX is fast
- DOM interoperates
- Semi-standard
- JDOM
- XOM
- Bleeding edge
- StAX
- JAXB
28Read the complete DTD
- Be conservative in what you generate liberal in
what you accept - Important content from DTD
- Default attribute values
- Namespace declarations
- Entity references
- ID types
29Navigate with XPath
- More robust against unexpected structure
- Allow optimization by engine
- Easier to code enhanced programmer productivity
- Might be slower
30Validate inside your program with schemas
31Part IV Implementation
32Write documents in Unicode
- Prefer UTF-8
- Smaller in English
- ASCII compatible
- Normalization
- É, ü, ì and so forth
- NFC
- ICU
33Avoid Vendor Lockin Beware
- Opaque, binary data used in place of marked up
text. - Over-abbreviated, inobvious names like F17354 and
grgyt - APIs that hide the XML
- Products that focus on the "Infoset
- Alternate serializations of XML
- Patented formats
34Hang on to your relational database
35Document Namespaces with RDDL
lt!DOCTYPE html PUBLIC "-//XML-DEV//DTD XHTML RDDL
1.0//EN"
"http//www.rddl.org/rddl-xhtml.dtd"gt lthtml
xmlns"http//www.w3.org/1999/xhtml"
xmlnsxlink"http//www.w3.org/1999/xlink"
xmlnsrddl"http//www.rddl.org/"gt ltheadgt
lttitlegtMegaBank Statement Markup Language
(MBSML)lt/titlegt lt/headgt ltpgt This is the XML
namespace for the lta href"http//developer.megaba
nk.com/xml/"gtMegaBank Statement Markup
Languagelt/agt. lt/pgt ltrddlresource
xlinktype"simple" xlinkhref"http//develope
r.megabank.com/xml/spec.html"
xlinkrole"http//www.w3.org/TR/html4/"
xlinkarcrole "http//www.rddl.org/purposes
normative-reference" gt ltpgt The lta
href"http//developer.megabank.com/xml/spec.html"
gtMegaBank Statement Markup Language
Specification 1.0lt/agt lt/pgt lt/rddlresourcegt lt/bo
dygtlt/htmlgt
36Pick the correct MIME type
- application/xml
- Not text/xml!
- Don't use charset
- application/mathmlxml
- image/svgxml
- application/xsltxml
37TagSoup Your HTML
38Catalog common resources
- lt?xml version"1.0"?gt
- ltcatalog xmlns
- "urnoasisnamestcentityxmlnsxmlcatalog"
- gt
- ltpublic publicId
- "-//OASIS//DTD DocBook XML V4.2//EN"
- uri
- "file///opt/xml/docbook/docbookx.dtd"/gt
- lt/cataloggt
39Compress if space is a problem
//output OutputStream fout new
FileOutputStream("data.xml.gz") OutputStream
out new GZipOutputStream(fout)
OutputFormat format new OutputFormat(document)
XMLSerializer output new XMLSerializer(out,
format) output.serialize(doc) // input
InputStream fin new FileInputStream("data.xml.gz
") InputStream in new GZipInputStream(fin)
DocumentBuilderFactory factory
DocumentBuilderFactory.newInstance()
DocumentBuilder parser factory.newDocumentBuilde
r() Document doc parser.parse(in) // work
with the document...
40To Learn More
- This Presentation http//cafeconleche.org/slides/
sd2007west/effectivexml - Effective XML 50 Specific Ways to Improve Your
XML Documents - Elliotte Rusty Harold
- Addison-Wesley, 2003
- ISBN 0-321-15040-6
- 44.99
- http//cafeconleche.org/books/effectivexml