The Joy of SAX (and DOM, and JDOM - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

The Joy of SAX (and DOM, and JDOM

Description:

... with outside users in portable format. example: typed dependency relations ... It also defines factory classes DocumentBuilderFactory and SAXParserFactory. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 29
Provided by: billm155
Category:
Tags: dom | jdom | sax | factory | format | joy

less

Transcript and Presenter's Notes

Title: The Joy of SAX (and DOM, and JDOM


1
The Joy of SAX(and DOM, and JDOM)
  • Bill MacCartney
  • 11 October 2004

2
Roadmap
  • What are XML APIs good for?
  • Overview of JAXP
  • JAXP XML Parsers
  • SAX vs. DOM
  • SAX
  • SAX Architecture
  • Using SAX
  • SAXExample.java
  • DOM
  • DOM Architecture
  • Using DOM
  • DOMExample.java
  • JDOM

3
What are XML APIs for?
  • You want to read/write data from/to XML files,
    and you don't want to write an XML parser.
  • Applications
  • processing an XML-tagged corpus
  • saving configs, prefs, parameters, etc. as XML
    files
  • sharing results with outside users in portable
    format
  • example typed dependency relations
  • alternative to serialization for persistent
    stores
  • doesn't break with changes to class definition
  • human-readable

4
Overview of JAXP
  • JAXP Java API for XML Processing
  • Provides a common interface for creating and
    using the standard SAX, DOM, and XSLT APIs in
    Java.
  • All JAXP packages are included standard in JDK
    1.4.The key packages are

javax.xml.parsers The main JAXP APIs, which provide a common interface for various SAX and DOM parsers.
org.w3c.dom Defines the Document class (a DOM), as well as classes for all of the components of a DOM.
org.xml.sax Defines the basic SAX APIs.
javax.xml.transform Defines the XSLT APIs that let you transform XML into other forms. (Not covered today.)
5
JAXP XML Parsers
  • javax.xml.parsers defines abstract classes
    DocumentBuilder (for DOM) and SAXParser (for
    SAX).
  • It also defines factory classes
    DocumentBuilderFactory and SAXParserFactory. By
    default, these give you the reference
    implementation of DocumentBuilder and SAXParser,
    but they are intended to be vendor-neutral
    factory classes, so that you could swap in a
    different implementation if you preferred.
  • The JDK includes three XML parser implementations
    from Apache
  • Crimson The original. Small and fast. Based on
    code donated to Apache by Sun. Standard
    implementation for J2SE 1.4.
  • Xerces More features. Supports XML Schema.
    Based on code donated to Apache by IBM.
  • Xerces 2 The future. Standard implementation
    for J2SE 5.0.

6
SAX vs. DOM
SAX Simple API for XML
DOM Document Object Model
  • Java-specific
  • interprets XML as a stream of events
  • you supply event-handling callbacks
  • SAX parser invokes your event-handlers as it
    parses
  • doesn't build data model in memory
  • serial access
  • very fast, lightweight
  • good choice when
  • no data model is needed, or
  • natural structure for data model is list, matrix,
    etc.
  • W3C standard for representing structured
    documents
  • platform and language neutral(not
    Java-specific!)
  • interprets XML as a tree of nodes
  • builds data model in memory
  • enables random access to data
  • therefore good for interactive apps
  • more CPU- and memory-intensive
  • good choice when data model has natural tree
    structure

There is also JDOM more later
7
SAX Architecture
8
Using SAX
Heres the standard recipe for starting with SAX
  • import javax.xml.parsers.
  • import org.xml.sax.
  • import org.xml.sax.helpers.
  • // get a SAXParser object
  • SAXParserFactory factory SAXParserFactory.newIns
    tance()
  • SAXParser saxParser factory.newSAXParser()
  • // invoke parser using your custom content
    handler
  • saxParser.parse(inputStream, myContentHandler)
  • saxParser.parse(file, myContentHandler)
  • saxParser.parse(url, myContentHandler)

(This reflects SAX 1, which you can still use,
but SAX 2 prefers a new incantation)
9
Using SAX 2
In SAX 2, the following usage is preferred
  • // tell SAX which XML parser you want (here, its
    Crimson)
  • System.setProperty("org.xml.sax.driver",
    "org.apache.crimson.parser.XMLReaderImpl")
  • // get an XMLReader object
  • XMLReader reader XMLReaderFactory.createXMLReade
    r()
  • // tell the XMLReader to use your custom content
    handler
  • reader.setContentHandler(myContentHandler)
  • // Have the XMLReader parse input from Reader
    myReader
  • reader.parse(new InputSource(myReader))

But where does myContentHandler come from?
10
Defining a ContentHandler
  • Easiest route define a new class which extends
    org.xml.sax.helpers.DefaultHandler.
  • Override event-handling methods from
    DefaultHandler
  • (All are no-ops in DefaultHandler.)

startDocument() // receive notice of start of
document endDocument() // receive notice of end
of document startElement() // receive notice of
start of each element endElement() // receive
notice of end of each element characters() //
receive a chunk of character data error() //
receive notice of recoverable parser error //
...plus more...
11
startElement()and endElement()
The SAXParser invokes your callbacks to notify
you of events
startElement(String namespaceURI, // for use w/
namespaces String localName, // for use w/
namespaces String qName, // "qualified"
name -- use this one! Attributes
atts) endElement(String namespaceURI, String
localName, String qName)
  • For simple usage, ignore namespaceURI and
    localName, and just use qName (the qualified
    name).
  • XML namespaces are described in an appendix,
    below.
  • startElement() and endElement() events always
    come in pairs
  • ltfoo/gt will generate calls
  • startElement("", "", "foo", null)
  • endElement("", "", "foo")

12
SAX Attributes
  • Every call to startElement() includes an
    Attributes object which represents all the XML
    attributes for that element.
  • Methods in the Attributes interface

getLength() // return number of
attributes getIndex(String qName) // look up
attribute's index by qName getValue(String
qName) // look up attribute's value by
qName getValue(int index) // look up attribute's
value by index // ... and others ...
13
SAX characters()
The characters() event handler receives
notification of character data (i.e. content that
is not part of an XML element)
public void characters(char ch, // buffer
containing chars int start, // start
position in buffer int length) // num
of chars to read
  • May be called multiple times within each block of
    character datafor example, once per line.
  • So, you may want to use calls to characters() to
    accumulate characters in a StringBuffer, and stop
    accumulating at the next call to startElement().

14
SAXExample Input XML
lt?xml version"1.0" encoding"UTF-8"?gt ltdotsgt
this is before the first dot and it continues
on multiple lines ltdot x"9" y"81" /gt ltdot
x"11" y"121" /gt ltflipgt flip is on
ltdot x"196" y"14" /gt ltdot x"169" y"13"
/gt lt/flipgt flip is off ltdot x"12" y"144"
/gt ltextragtstufflt/extragt lt!-- a final comment
--gt lt/dotsgt
15
SAXExample Code
Please see SAXExample.java
16
SAXExample Input ? Output
startDocument startElement dots (0
attributes) characters this is before the
first dot and it continues on multiple
lines startElement dot (2 attributes) endElement
dot startElement dot (2 attributes) endElement
dot startElement flip (0 attributes) characte
rs flip is on startElement dot (2
attributes) endElement dot startElement dot
(2 attributes) endElement dot endElement
flip characters flip is off startElement dot
(2 attributes) endElement dot startElement
extra (0 attributes) characters
stuff endElement extra endElement
dots endDocument Finished parsing input. Got
the following dots (9, 81), (11, 121), (14,
196), (13, 169), (12, 144)
lt?xml version"1.0" encoding"UTF-8"?gt ltdotsgt
this is before the first dot and it continues
on multiple lines ltdot x"9" y"81" /gt ltdot
x"11" y"121" /gt ltflipgt flip is on
ltdot x"196" y"14" /gt ltdot x"169" y"13"
/gt lt/flipgt flip is off ltdot x"12" y"144"
/gt ltextragtstufflt/extragt lt!-- a final comment
--gt lt/dotsgt
17
DOM Architecture
18
DOM Document Structure
Document structure
XML Input
Document ---Element ltdotsgt ---Text "this is
before the first dot and it continues on
multiple lines" ---Element ltdotgt
---Text "" ---Element ltdotgt ---Text
"" ---Element ltflipgt ---Text "flip
is on" ---Element ltdotgt ---Text
"" ---Element ltdotgt ---Text ""
---Text "flip is off" ---Element ltdotgt
---Text "" ---Element ltextragt
---Text "stuff" ---Text "" ---Comment
"a final comment" ---Text ""
lt?xml version"1.0" encoding"UTF-8"?gt ltdotsgt
this is before the first dot and it continues
on multiple lines ltdot x"9" y"81" /gt ltdot
x"11" y"121" /gt ltflipgt flip is on
ltdot x"196" y"14" /gt ltdot x"169" y"13"
/gt lt/flipgt flip is off ltdot x"12" y"144"
/gt ltextragtstufflt/extragt lt!-- a final comment
--gt lt/dotsgt
  • Theres a text node between every pair of element
    nodes, even if the text is empty.
  • XML comments appear in special comment nodes.
  • Element attributes do not appear in
    treeavailable through Element object.

19
Using DOM
Heres the basic recipe for getting started with
DOM
  • import javax.xml.parsers.
  • import org.w3c.dom.
  • // get a DocumentBuilder object
  • DocumentBuilderFactory dbf
  • DocumentBuilderFactory.newInstance()
  • DocumentBuilder db null
  • try
  • db dbf.newDocumentBuilder()
  • catch (ParserConfigurationException e)
  • e.printStackTrace()
  • // invoke parser to get a Document
  • Document doc db.parse(inputStream)
  • Document doc db.parse(file)
  • Document doc db.parse(url)

20
DOM Document access idioms
OK, say we have a Document. How do we get at the
pieces of it? Here are some common idioms
// get the root of the Document tree Element
root doc.getDocumentElement() // get nodes in
subtree by tag name NodeList dots
root.getElementsByTagName("dot") // get first
dot element Element firstDot (Element)
dots.item(0) // get x attribute of first
dot String x firstDot.getAttribute("x")
21
More Document accessors
Node access methods String getNodeName() short
getNodeType() Document getOwnerDocument() boolea
n hasChildNodes() NodeList getChildNodes() Node
getFirstChild() Node getLastChild() Node getPare
ntNode() Node getNextSibling() Node getPreviousS
ibling() boolean hasAttributes() ... and more
... Element extends Node and adds these access
methods String getTagName() boolean hasAttribut
e(String name) String getAttribute(String
name) NodeList getElementsByTagName(String
name) and more Document extends Node and
adds these access methods Element getDocumentEle
ment() DocumentType getDoctype() ... plus the
Element methods just mentioned ... ... and more
...
e.g. DOCUMENT_NODE, ELEMENT_NODE, TEXT_NODE,
COMMENT_NODE, etc.
22
Creating manipulating Documents
The DOM API also includes lots of methods for
creating and manipulating Document objects
// get new empty Document from DocumentBuilder Doc
ument doc db.newDocument() // create a new
ltdotsgt Element and add to Document as
root Element root doc.createElement("dots") doc
.appendChild(root) // create a new ltdotgt
Element and add as child of root Element dot
doc.createElement("dot") dot.setAttribute("x",
"9") dot.setAttribute("y", "81") root.appendChil
d(dot)
23
More Document manipulators
Node manipulation methods void setNodeValue(Stri
ng nodeValue) Node appendChild(Node
newChild) Node insertBefore(Node newChild, Node
refChild) Node removeChild(Node oldChild) ...
and more ... Element manipulation
methods void setAttribute(String name, String
value) void removeAttribute(String name) and
more Document manipulation methods Text crea
teTextNode(String data) Comment createCommentNode
(String data) ... and more ...
24
Writing a Document as XML
  • Strangely, since JAXP 1.1, there is no simple,
    documented way to write out a Document object as
    XML.
  • Instead, you can exploit an undocumented trick
    cast the Document to a Crimson XmlDocument, which
    knows how to write itself out
  • There is a supported way to write Documents as
    XML via the XSLT library, but it is far more
    clumsy than this two-line trick.
  • Of course, one could just walk the Document tree
    and write XML using printlns.
  • JDOM remedies this with easy XML output!

import org.apache.crimson.tree.XmlDocument XmlDoc
ument x (XmlDocument) doc x.write(out,
"UTF-8")
25
DOMExample Code
Please see DOMExample.java
26
JDOM Overview
  • DOM can be awkward for Java programmers
  • Language-neutral ? does not use Java features
  • Example getChildNodes() returns a NodeList,
    which is not a List. (NodeList.iterator() is not
    defined.)
  • JDOM looks like a good alternative
  • open source project, Apache license, late beta
  • builds on top of JAXP, integrates with SAX and
    DOM
  • similar to DOM model, but no shared code
  • API designed to be easy obvious for Java
    programmers
  • exploits power of Java language collections,
    method overloading
  • rumored to become integrated in future JDKs
  • XML output is easy!
  • Key packages org.jdom, org.jdom.transform,
    org.jdom.input, org.jdom.output.

27
DOM vs. JDOM
The DOM way DocumentBuilderFactory factory
DocumentBuilderFactory.newInstance() DocumentBuil
der builder factory.newDocumentBuilder() Docume
nt doc builder.newDocument() Element root
doc.createElement("root") Text text
doc.createText("This is the root") root.appendChi
ld(text) doc.appendChild(root)
The JDOM way Document doc new
Document() Element e new Element("root") e.set
Text("This is the root") doc.addContent(e)
28
Pointers
  • Everything in this tutorial (slides, example
    code, example data) will be archived at
    http//nlp.stanford.edu/local/for your future
    reference.
  • Theres a good JAXP/SAX/DOM tutorial
    athttp//java.sun.com/xml/jaxp/dist/1.1/docs/tut
    orial/
  • You can learn more about JDOM athttp//www.jdom.o
    rg/docs/faq.html
Write a Comment
User Comments (0)
About PowerShow.com