Title: Introduction to XML and Related Technologies
1Introduction to XML and Related Technologies
- Internet Engineering Course
- University of Tehran
- Sepand Ansari
2History
- The essence of Markup languages.
- You may faced problems that you need to add
metadata or tags in your document to describe it! - Example plain text vs. rich text document
- Setting files. (linux setting files)
Section "Device" Identifier "ATI Radeon
Mobility M6" Driver "radeon"
VendorName "ATI Radeon Mobility M6"
BoardName "Radeon Mobility M6 LY"
ChipID 0x4c59 VideoRam 32768
BusID "PCI150" Option
"AGPMode" "4" Option
"noaccel" EndSection
3Before standardization
- Several markup languages were developed, but each
with a its own style. - Problems
- Incompatibility
- No CASE Tool could be developed for markup file
processing. - Complexity and deceptions
- Some examples MS Office files, Linux setting
files, bussines markup data files
4SGML
- Standard Generalized Markup Language (SGML) is
developed in1960s by Charles Goldfarb, Edward
Mosher and Raymond Lorie (whose surname initials
also happen to be GML) - It is a standard for creating new markup
languages. - but its complexity has prevented its widespread
application for small-scale general-purpose use. - This complexity was because of its generality.
- And that much generality, is not needed for most
of usages.
5HTML
- HyperText Markup Language (HTML) is a markup
language designed for the creation of web pages
and other information viewable in a browser. - Originally defined as a highly simplified subset
of SGML by Tim Barners-Lee. - And is now widely used with HTTP protocol.
- It is used for presentation of data! (in
browsers)
6XML
- SGML is too complex
- HTML is a simplified subset of SGML which
- Many unused features of SGML are eliminated
- It is well-known and widely used.
- So XML was born to
- Do what SGML was originally created to do.
- But as simple as HTML.
7XML (cont.)
- XML is a metalanguage
- A language used to describe other languages using
markup tags that describe properties of the
data - Designed to be structured
- Strict rules about how data can be formatted
- Designed to be extensible
- Can define own terms and markup
8When XML is used?
- XML aims to accomplish what HTML cannot and be
simpler to use and implement than SGML - In XML you can define your own tags.
- And create markup documents based on your tag
declaration to describe your data. - And this descriptions are used by an application
to extract semantics from your data. - .
9An example
- lt?xml version"1.0" encoding"ISO-8859-1"?gt
- ltbookstoregt
- ltbook category"CHILDREN"gt
- lttitle lang"en"gtHarry Potterlt/titlegt
- ltauthorgtJ K. Rowlinglt/authorgt
- ltyeargt2005lt/yeargt
- ltpricegt29.99lt/pricegt
- lt/bookgt
- ltbook category"WEB"gt
- lttitle lang"en"gtXQuery Kick Startlt/titlegt
- ltauthorgtJames McGovernlt/authorgt
- ltauthorgtPer Bothnerlt/authorgt
- ltauthorgtJames Linnlt/authorgt
- ltpricegt49.99lt/pricegt
- lt/bookgt
- lt/bookstoregt
10An odd example
- lt?xml version"1.0" encoding"UTF-8"?gt
- ltRecipe name"bread" prep_time"5 mins"
cook_time"3 hours"gt lttitlegtBasic breadlt/titlegt
- ltingredient amount"3" unit"cups"gtFlourlt/
ingredientgt - ltingredient amount"0.25"
unit"ounce"gtYeastlt/ingredientgt - ltingredient amount"1.5" unit"cups"gtWarm
Waterlt/ingredientgt - ltingredient amount"1" unit"teaspoon"gtSal
tlt/ingredientgt - ltInstructionsgt
- ltstepgtMix all ingredients
togetherlt/stepgt - ltstepgtLeave for one hour in warm
room.lt/stepgt - ltstepgtKnead again, and then bake in
the oven.lt/stepgt - lt/Instructionsgt
- lt/Recipegt
11XML features
- its simultaneously human- and machine-readable
format - its support for Unicode, allowing almost any
information in any human language to be
communicated - its ability to represent the most general
computer science data structures (records, lists
and trees) - its self-documenting format that describes
structure and field names as well as specific
values - its strict syntax and parsing requirements that
allow the necessary parsing algorithms to remain
simple, efficient, and consistent.
12XML Family
- XML is not a subset of HTML, nor HTML is a subset
of XML. - Since XML is more general than HTML.
- XML has some constraints (next slide) that HTML
doesn't have. - But if those constraints are held, we have XHTML
HTML
XML
SGML
13HTML vs. XML
HTML
XML
14HTML vs. XML
HTML
XML
XHTML documents have all XML properties Except
these two.
15Working with XML
- First, How to describe tags
- DTD
- XSD
- How to parse XML files
- SAX Parsers
- DOM parsers
- XML binding tools
- Transform XML files to (X)HTML or other XML
types. - XSLT
- Address a point in XML file
- XPath
- Query XML file for specific data
- XQuery
16Working with XML
- First, How to describe tags
- DTD
- XSD
- We should have a parser to extract content of xml
file - SAX Parsers
- DOM parsers
- XML binding tools
- Transform XML files to (X)HTML or other XML
types. - XSLT
- Address a point in XML file
- XPath
- Query XML file for specific data
- XQuery
17DTD
- A Document Type Definition (DTD for short) is a
set of declarations that conform to a particular
markup syntax and that describe a class, or
"type", of SGML or XML documents, in terms of
constraints on the structure of those documents. - DTD criticisms
- No support for newer features of XML most
importantly, namespaces. - Lack of expressivity. Certain formal aspects of
an XML document cannot be captured in a DTD. - Custom non-XML syntax to describe the schema,
inherited from SGML.
18Example of DTD and its sample XML
- lt!ELEMENT people_list (person)gt
- lt!ELEMENT person (name, birthdate?, gender?,
SSNum?)gt - lt!ELEMENT name (PCDATA) gt
- lt!ELEMENT birthdate (PCDATA) gt
- lt!ELEMENT gender (PCDATA) gt
- lt!ELEMENT socialsecuritynumber (PCDATA) gt
lt?xml version"1.0" encoding"UTF-8"?gt lt!DOCTYPE
people_list SYSTEM "example.dtd"gt ltpeople_listgt
ltpersongt ltnamegtFred Bloggslt/namegt
ltbirthdategt27/11/2008lt/birthdategt
ltgendergtMalelt/gendergt lt/persongt
lt/people_listgt
19XSD
- An XML Schema Definition (XSD) , published as a
W3C Recommendation in May 2001. - XSD files have .xsd extention.
- XSD solved problems that DTD has
- Supports namespace
- XSD is datatype-aware
20Example of XSD and its sample XML
- ltxsschema xmlnsxs"http//www.w3.org/2001/XMLSch
ema"gt ltxselement name"country"gt - ltxscomplexTypegt
- ltxssequencegt
- ltxselement name"name"
type"xsstring"/gt - ltxselement name"population"
type"xsdecimal"/gt - lt/xssequencegt
- lt/xscomplexTypegt
- lt/xselementgt
- lt/xsschemagt
ltcountry xmlnsxsi"http//www.w3.org/2001/XMLSche
ma-instance" xsinoNamespaceSchemaLocation"countr
y.xsd"gt ltnamegtFrancelt/namegt
ltpopgt59.7lt/popgt lt/countrygt
21Well-formed vs. valid XML file
- A well-formed XML file has all the properties of
an XML file. - Proper nesting
- Case sensitivity
- Quoted attributes
-
- An XML document that complies with a particular
schema, in addition to being well-formed, is said
to be valid.
22Working with XML
- First, How to describe tags
- DTD
- XSD
- We should have a parser to extract content of xml
file - SAX Parsers
- DOM parsers
- XML binding tools
- Transform XML files to (X)HTML or other XML
types. - XSLT
- Address a point in XML file
- XPath
- Query XML file for specific data
- XQuery
23SAX Parser
- SAX Simple API for XML
- Parser creates events while traversing tree
- Parser calls methods (that you write) to deal
with the events. - Similar to an I/O-Stream, goes in one direction
24Sample Document
- lttransactiongt
- ltaccountgt89-344lt/accountgt
- ltbuy shares100gt
- ltticker exchNASDAQgtWEBMlt/tickergt
- lt/buygt
- ltsell shares30gt
- ltticker exchNYSEgtGElt/tickergt
- lt/sellgt
- lt/transactiongt
25SAX Example
- import java.io.
- import org.xml.sax.
- import org.xml.sax.helpers.
- import org.apache.xerces.parsers.SAXParser
- public class Flour extends DefaultHandler
-
- public void startElement(String
namespaceURI, String localName,String qName,
Attributes atts) -
- if (localName.equals(amount")
-
- String n atts.getValue("","name")
- System.out.println(number of shares n)
-
-
26SAX Example (cont.)
- public static void main(String args)
-
- Flour f new Flour()
- SAXParser p new SAXParser()
p.setContentHandler(f) - try
-
- p.parse(args0)
-
- catch (Exception e)
-
- e.printStackTrace()
-
27Document as Events
- lttransactiongt
- ltaccountgt89-344lt/accountgt
- ltbuy shares100gt
- ltticker exchNASDAQgtWEBMlt/tickergt
- lt/buygt
- ltsell shares30gt
- ltticker exchNYSEgtGElt/tickergt
- lt/sellgt
- lt/transactiongt
28Advantages and Disadvantages
- Advantages
- Requires little memory
- Fast
- Disadvantages
- Cannot read backwards
- Does not support transformation of the document
such as cut and paste of fragments - Difficult to program
29Programming using SAX is Difficult
- In some cases, programming with SAX is difficult
- How can we find, using a SAX parser, an element
e1 with ancestor e2? - How can we find, using a SAX parser, elements e1
that have a descendant element e2? - What about cases that are even more complex?
30DOM Parser
- DOM Document Object Model
- Parser creates a tree object out of the document
- User accesses data by traversing the tree
- The API allows for constructing, accessing and
manipulating the structure and content of XML
documents
31Document as Tree
Methods like getRoot getChildren getAttributes et
c.
transaction
account
buy
sell
89-344
shares
shares
ticker
ticker
100
30
exch
exch
NYSE
NASDAQ
WEBM
GE
32Node Navigation
- Every node has a specific location in tree
- Node interface specifies methods to find
surrounding nodes - Node getFirstChild()
- Node getLastChild()
- Node getNextSibling()
- Node getPreviousSibling()
- Node getParentNode()
- NodeList getChildNodes()
33Node Manipulation
- Children of a node in a DOM tree can be
manipulated - added, edited, deleted, moved,
copied, etc.
Node removeChild(Node old) throws
DOMException Node insertBefore(Node new, Node
ref) throws DOMException Node appendChild(Node
new) throws DOMException Node replaceChild(Node
new, Node old) throws DOMException Node
cloneNode(boolean deep)
34Advantages and Disadvantages
- Advantages
- Natural and relatively easy to use
- Can repeatedly traverse tree
- Disadvantages
- High memory requirements the whole document is
kept in memory - Must parse the whole document and construct many
objects before use
35Which should we use?DOM vs. SAX
- If your document is very large and you only need
a few elements use SAX - If you need to manipulate (i.e., change) the XML
use DOM - If you need to access the XML many times use
DOM (assuming the file is not too large)
36XML data binding
- With XML data binding, an java object is
automatically created from the data of a XML
document. - JAXB is Sun Microsystems's specification for XML
data binding. - This java class can be created manually.
- A mapping file is needed to tell the binding
engine, how to map XML elements and attributes to
Class properties. - Or the java class can be automatically created by
the JAXB compiler.
37Advantages and disadvantages
- Advantages
- JAXB requires a DTD
- Using JAXB ensures the validity of your XML
- A JAXB parser is actually faster than a generic
SAX parser - A tree created by JAXB is smaller than a DOM tree
- Its much easier to use a JAXB tree for
application-specific code - You can modify the tree and save it as XML
- Disadvantages
- JAXB requires a DTD
- Hence, you cannot use JAXB to process generic XML
(for example, if you are writing an XML editor or
other tool) - You must do additional work up front to tell JAXB
what kind of tree you want it to construct - But this more than pays for itself by simplifying
your application - JAXB is new Version 1.0 is due Q4 (fourth
quarter) 2002
38JAXB at a glance
39Step 1 Create XML Schema
Demo.xsd
ltxselement name"Person" type"PersonType"/gt
ltxscomplexType name"PersonType"gt
ltxssequencegt ltxselement nameName"
type"xsstring"/gt ltxselement
name"Address" type"AddressType"
minOccurs"1" maxOccurs"unbounded"/gt
lt/xssequencegt lt/xscomplexTypegt
ltxscomplexType name"AddressType"gt
ltxssequencegt ltxselement
name"Number" type"xsunsignedInt"/gt
ltxselement name"Street" type"xsstring"/gt
lt/xssequencegt lt/xscomplexTypegt
40Step 2 Create XML Document
Demo.xml
ltPerson xmlnsxsi"http//www.w3.org/2001/XMLS
chema-instance" xsinoNamespaceSchemaLocation
"C\JAXB Demo\demo.xsd"gt ltNamegtSharon
Krisherlt/Namegt ltAddressgt ltStreetgtIben
Gevirollt/Streetgt ltNumbergt57lt/Numbergt lt/Addressgt
ltAddressgt ltStreetgtMoshe Sharetlt/Streetgt ltNum
bergt89lt/Numbergt lt/Addressgt lt/Persongt
Check that your XML conforms to the Schema
41Step 3 Run the binding compiler
- JWSDP_HOME\jaxb\bin\xjc -p demo demo.xsd
- A package named demo is created
- (in the directory demo)
- The package contains (among other things)
- interface AddressType
- interface PersonType
42AddressType and PersonType
public interface AddressType long
getNumber() void setNumber(long value)
String getStreet() void setStreet(String
value)
Must be non-negative
Must be non-null
Must be non-null
public interface PersonType String
getName() void setName(String value)
/ List of AddressType / java.util.List
getAddress()
Must contain at least one item
In Java1.5 ListltAddressTypegt
43Step 4 Create Context
- The context is the entry point to the API
- Contains methods to create Marshaller,
Unmarshaller and Validator instances
JAXBContext context JAXBContext.newInstance("dem
o")
The package name is demo (Recall xjc -p demo
demo.xsd)
44Step 5 Unmarshal xml -gt objects
Enable validation of xml according to the schema
while unmarshalling
Unmarshaller unmarshaller context.createUnmars
haller() unmarshaller.setValidating(true) Pers
onType person (PersonType) unmarshaller.unmars
hal( new FileInputStream("demo.xml") )
45Step 6 Read
System.out.println("Person name"
person.getName() ) AddressType address
(AddressType) person.getAddress().get(0) Syste
m.out.println("First Address " " Street"
address.getStreet() " Number"
address.getNumber() )
46Step 7 Manipulate objects
// Update person.setName("Yoav Zibin")
// Delete List addressList person.getAddress()
addressList.clear()
47Step 8 Validate on-demand
- Validator validator context.createValidator()
- validator.validate(newAddr)
- validator.validate(person)
Check that we have set Street and Number, and
that Number is non-negative
Check that we have set Name, and that Address
contains at least one item
48Step 9 Marshal objects -gt xml
Marshaller marshaller context.createMarshaller()
marshaller.setProperty(Marshaller.JAXB_FORMATTED
_OUTPUT, Boolean.TRUE) marshaller.marshal(person
, new FileOutputStream("output.xml"))
output.xml
ltPersongt ltNamegtYoav Zibinlt/Namegt ltAddressgt
ltStreetgtHanoterlt/Streetgt ltNumbergt5lt/Numbergt
lt/Addressgt lt/Persongt
49JAXB compiler is smart enough
- The DTD lt!ELEMENT book (title, author, chapter)
gt lt!ELEMENT title (PCDATA) gt lt!ELEMENT
author (PCDATA)gt lt!ELEMENT chapter (PCDATA)
gt - The schema ltxml-java-binding-schemagt
ltelement name"book" type"class" root"true"
/gt lt/xml-java-binding-schemagt - The results public Book() //
constructor public String getTitle() public
void setTitle(String x) public String
getAuthor() public void setAuthor(String
x) public List getChapter() public void
deleteChapter() public void emptyChapter()
Note 1 In these slides we only show the class
outline, but JAXB creates a complete class for you
Note 2 JAXB constructs names based on yours,
with good capitalization style
50Some Implementations of JAXB specification
- JAXB specification is still in ß version.
(version 0.8) - JAXME by apache foundation is an implementation
of JAXB specification - There are some other packages that are not JAXB
implementation but do the same task - XMLBeans by apache foundation
- Castor XML
51Working with XML
- First, How to describe tags
- DTD
- XSD
- We should have a parser to extract content of xml
file - SAX Parsers
- DOM parsers
- XML binding tools
- Transform XML files to (X)HTML or other XML
types. - XSLT
- Address a point in XML file
- XPath
- Query XML file for specific data
- XQuery
52XSLT
- XSLT stands for Extensible Stylesheet Language
Transformations - XSLT is used to transform XML documents into
other kinds of documents--usually, but not
necessarily, XHTML - XSLT uses two input files
- The XML document containing the actual data
- The XSL document containing both the framework
in which to insert the data, and XSLT commands to
do so
53An example
54Cocoon framework
- Apache Cocoon is an open source web based
publishing framework written in Java. - It transforms XML documents to XML, WML or PDF
using XSL file. - Cocoon can be integrated with Tomcat.
- It is used when
- The source of data is in XML format
- If you want to completely separate data from
presentation
55Working with XML
- First, How to describe tags
- DTD
- XSD
- We should have a parser to extract content of xml
file - SAX Parsers
- DOM parsers
- XML binding tools
- Transform XML files to (X)HTML or other XML
types. - XSLT
- Address a point in XML file
- XPath
- Query XML file for specific data
- XQuery
56XPath
- XPath (XML Path Language) is a terse (non-XML)
syntax for addressing portions of an XML
document. - XPath further defines a library of standard
functions for working with strings, numbers and
Boolean expressions, as well as supporting a
number of utility operators. - A typical XPath expression is a Location Path
consisting of a string of element or attribute
qualifiers separated by forward slashes ("/")
57Some Examples
- The Root element /
- All elements everywhere (implementations of this
expression can be very slow) // - All Top Level Elements (children of Root) //
- The fifth child element under an element named
"FOOB" FOOB5 - The element FOOB whose BAZ attribute is "untrue"
FOOB _at_BAZ "untrue"
58XQuery
- XQuery is a programming language under
development by the W3C that's designed to query
collections of XML data. - XQuery provides a mechanism to extract and
manipulate data from XML documents or any data
source that can be viewed as XML such as
relational databases or office documents. - It is semantically similar to SQL.
- XQuery uses XPath syntax to address specific
parts of an XML document.