Title: SAX
1SAX
2the SAX concept
- stream-based processing
- application code applied to each chunk of XML as
it is parsed - incremental processing
- discard irrelevant information immediately
- data-structure flexibility
- flexible architecture for interoperability with
other components - flexible internal data structure design
3SAX packages
- org.xml.sax
- core SAX interfaces and exception classes
- two concrete classes
- SAX1 and SAX2 support
- core to all SAX distributions
- org.xml.sax.helpers
- utility implementations of core interfaces
- org.xml.sax.ext
- SAX extension handlers (extra events)
4SAX2 distributions
- Aelfred2
- SAX2 version of original Aelfred parser
- part of GNU JAXP Library
- Crimson
- reference implementation for JAXP
- distributed with JDK 1.4
- part of Apache Project
- Xerces v2
- also part of Apache project
5Using SAX
producer
consumer
- initiate a parser (producer)
- stream events to a ContentHandler (consumer)
- may also implement
- ErrorHandler
- DTDHandler
- EntityResolver
6SAX Architecture
Class to Handle Content
event stream
SAX parser
parse
ContentHandler Interface
XML Source
7Event Stream
- lt?xml version"1.0" encoding"UTF-8"?gt
- ltcharacters seriesfatherted"gt
- ltcharactergt
- Ted Crilly
- lt/charactergt
- ltcharacter jobcurate"gt
- Dougal
- ltInlinegtRollerbladinglt/Inlinegt
- McGuire
- lt/charactergt
- lt/charactergt
8ContentHandler Interface
- Methods for handling events generated by parsing
the XML includes - startElement(String uri, String localName, String
qName, Attributes atts) - Fires whenever and element is found
- characters(char ch, int start, int length)
- Fires when character data found
9SAX Example
- Write a SAX program to count features in an XML
document - Input
- XML document
- Output
- Number of elements
- Number of attributes
- Number of Processing Instructions
- Number of characters in text data
10SAX example
import org.xml.sax. import org.xml.sax.helpers.
import javax.xml.parsers. import
java.io.IOException public class
DocumentStatistics public static void
main(String args) XMLReader parser try
SAXParserFactory factory factory
SAXParserFactory.newInstance () parser
factory.newSAXParser().getXMLReader()
set up the parser
11SAX example
catch (FactoryConfigurationError e)
System.err.println (e.getMessage ())
return catch (ParserConfigurationException e)
System.err.println (e.getMessage ())
return catch (SAXException e)
System.err.println ("no SAX parser") return
handle exceptions
12SAX example
parser.setContentHandler (new XMLCounter ())
for (int i 0 i lt args.length i) try
parser.parse (args i) catch
(SAXParseException e) // not well-formed
System.out.println (e.getMessage ())
catch (SAXException e) // some other
error System.out.println (e.getMessage ())
catch (IOException e) //
error at lower level System.out.println
(e.getMessage ())
parse and pass to content handler
13SAX example
import org.xml.sax. public class XMLCounter
implements ContentHandler private int
numberOfElements private int
numberOfAttributes private int
numberOfProcessingInstructions private int
numberOfCharacters public void
startDocument() throws SAXException
numberOfElements 0 numberOfAttributes
0 numberOfProcessingInstructions 0
numberOfCharacters 0
define the content handler
14SAX example
define the action when an element is detected
// this method counts the number of
elements public void startElement (String
namespaceURI, String localName, String
qName, Attributes atts) throws SAXException
numberOfElements numberOfAttributes
atts.getLength()
15SAX example
define the action when characters are detected
public void characters(char ch, int
start, int length) throws SAXException
numberOfCharacters length
16SAX example
define the action when ignorableWhitespace is
detected
public void ignorableWhitespace(char ch, int
start, int length) throws SAXException
numberOfCharacters length
17SAX example
define the action when a processingInstruction is
detected
public void processingInstruction( String
target, String data) throws SAXException
numberOfProcessingInstructions
18SAX example
define the action when the end of the document is
detected
public void endDocument() throws SAXException
System.out.println ("Number of elements "
numberOfElements) System.out.println ("Number
of attributes " numberOfAttributes)
System.out.println ("Number of characters "
numberOfCharacters) System.out.println ("Number
of processing instructions
numberOfProcessingInstructions)
19SAX example
implement the rest of the ContentHandler interface
public void endElement(String namespaceURI,
String localName, String qName) throws
SAXException public void endPrefixMapping(Strin
g prefix) throws SAXException public void
setDocumentLocator(Locator locator) public
void skippedEntity(String name) throws
SAXException public void
startPrefixMapping(String prefix, String uri)
throws SAXException
20XML Programming Models
21different approaches
- XML as text
- inspection, creation, modification using text
editors (TextPad, Emacs, Notepad. etc) - search and replace using regular expressions
(Perl, Java, etc) - text processors can be used in conjunction with
other tools such as XSLT - NB text processors must support Unicode
22different approaches
- XML as a stream of events (e.g. SAX)
- events-based parsers produce an event stream
- use a finite state machine model to process these
events - can be tricky to program
- used for one-pass processing
- do not need to build entire document structure
- fast and efficient
23event stream from XML
ltnamegtltgivengtSandylt/givengtltfamilygtBrownleelt/family
gtlt/namegt might result in the following event
stream startElementname startElementgiven conten
tSandy endElementgiven startElementfamily conte
ntBrownlee endElementfamily endElementname
A
B
C
D
24stream-based processing
- SAX parsers are small and fast because they only
process relevant data - smaller code is more robust and more secure
- scales well to multiple process calls e.g. on a
web server - fits well with streaming data across a network
25data structure flexibility
- SAX can parse XML data into suitable structures
for other components - e.g. EDI format, messaging formats
- specialised, strongly-typed data structures are
essential for many purposes (DOM too generic)
26SAX drawbacks
- no random access to XML data
- forward-only pass in document order
- cannot refer to downstream data
- like referring to objects in client-side
JavaScript - re-scanning acceptable for small files or cached
data
27different approaches
- XML as a tree
- well-formed XML has a natural tree structure
- tree programming models provide an API to
manipulate the tree - XPath, DOM, Infoset, PSVI
- manipulable model of the entire document
- easier to program, costly on memory
- navigation can be cumbersome
28 XSLT approach
- XSLT used for format conversions
- May be used in conjunction with SAX and DOM for
pre- and post-processing - XSLT parsers available as stand-alone or
embeddable components - cumbersome for complex processing
29DOM approach
- Powerful API for complex data manipulation
- complex data structure processing
- tree-walking
- manipulation of DOM interfaces
- memory-hungry
- large documents create many nodes
- multiple simultaneous access
- tree-structure may not be relevant to processing
- simple processing of data items
- location irrelevant
30SAX vs DOM memory consumption
- typical DOM implementation allocates 10bytes of
memory per byte of XML data to build the DOM tree - 3Mb (mid-sized) data file requires 30Mb memory!
- SAX only puts relevant content into data
structures in memory
31XML processing issues
- parser differences
- parsers handle content differently
- omission of comments
- replacement of entity references
- non-validating parsers may not retrieve external
DTD - use of comments
- only for human-readable information
- not for illegitimate content (c.f. HTML)
- parsers may well ignore comments