Title: XML Basics
1XML Basics
2What is XML?
- Extensible Markup Language
- A syntax for documents
- A Meta-Markup Language
- A Structural and Semantic language, not a
formatting language - Not just for Web pages
3XML is a Meta Markup Language
- Not like HTML, troff, LaTeX
- Make up the tags you needs as you need them
- The tags you create can be documented in a
Document Type Definition (DTD) - A meta syntax for domain-specific markup
languages like MusicML, MathML, and CML
4XML describes structure and semantics, not
formatting
- XML documents form a tree
- Element and attribute names reflect the kind of
the element - Formatting can be added with a style sheet
5A Song Description in HTML
- ltdtgtHot Cop
- ltddgt by Jacques Morali, Henri Belolo, and Victor
Willis - ltulgt
- ltligtProducer Jacques Morali
- ltligtPublisher PolyGram Records
- ltligtLength 620
- ltligtWritten 1978
- ltligtArtist Village People
- lt/ulgt
6A Song Description in XML
- ltSONGgt
- ltTITLEgtHot Coplt/TITLEgt
- ltCOMPOSERgtJacques Moralilt/COMPOSERgt
- ltCOMPOSERgtHenri Belololt/COMPOSERgt
- ltCOMPOSERgtVictor Willislt/COMPOSERgt
- ltPRODUCERgtJacques Moralilt/PRODUCERgt
- ltPUBLISHERgtPolyGram Recordslt/PUBLISHERgt
- ltLENGTHgt620lt/LENGTHgt
- ltYEARgt1978lt/YEARgt
- ltARTISTgtVillage Peoplelt/ARTISTgt
- lt/SONGgt
7Style Sheets provide formatting
- SONG display block
- TITLE display block font-family Helvetica,
serif - font-size 20pt font-weight bold
- COMPOSER display block
- font-family Times, Times New Roman,
serif - font-size 14pt font-style italic
- ARTIST display block
- font-family Times, Times New Roman,
serif - font-size 14pt font-weight bold
- font-style italic
- PUBLISHER display block font-size 14pt
- font-family Times, Times New Roman,
serif - LENGTH display block
- font-family Times, Times New Roman,
serif - font-size 14pt
- YEAR display block
- font-family Times, Times New Roman,
serif - font-size 14pt
8Attaching style sheets to documents
- Processing Instruction
- lt?xml-stylesheet type"text/css"
href"song.css"?gt - Converter Program
9What is XML used for?
- Domain-Specific Markup Languages
- Self-Describing Data
- Interchange of Data Among Applications
- Structured and Integrated Data
10Domain-Specific Markup Languages
- Non proprietary format
- Dont pay for what you dont use
11Self-Describing Data
- Much data is lost due to format problems
- XML is very simple
- XML is self-describing
- XML is well documented
12- ltPERSON ID"p1100" SEX"M"gt
- ltNAMEgt
- ltGIVENgtJudsonlt/GIVENgt
- ltSURNAMEgtMcDaniellt/SURNAMEgt
- lt/NAMEgt
- ltBIRTHgt
- ltDATEgt21 Feb 1834lt/DATEgt
- lt/BIRTHgt
- ltDEATHgt
- ltDATEgt9 Dec 1905lt/DATEgt
- lt/DEATHgt
- lt/PERSONgt
13Interchange of Data Among Applications
14Structured and Integrated Data
- Can specify relationships between elements
- Can assemble data from multiple sources
15XML Applications
- A specific markup language uses the XML
meta-syntax is called an XML application - Different XML applications have their own more
constricted syntaxes and vocabularies within the
broader XML syntax - Further syntax can be layered on top of this
e.g. data typing through DCDs or other schemas
16Example XML Applications
- Web Pages
- Mathematical Equations
- Music Notation
- Vector Graphics
- Metadata
- and more
17Mathematical Markup Language
18Channel Definition Format
lt?xml version"1.0"?gt ltCHANNEL HREF"http//metala
b.unc.edu/xml/index.html"gt ltTITLEgtCafe con
Lechelt/TITLEgt ltITEM HREF"http//metalab.unc.edu
/xml/books.html"gt ltTITLEgtBooks about
XMLlt/TITLEgt lt/ITEMgt ltITEM HREF"http//metalab
.unc.edu/xml/tradeshows.html"gt ltTITLEgtTrade
shows and conferences about XMLlt/TITLEgt
lt/ITEMgt ltITEM HREF"http//metalab.unc.edu/xml/l
ists.htm"gt ltTITLEgtMailing Lists dedicated to
XMLlt/TITLEgt lt/ITEMgtlt/CHANNELgt
19Classic Literature
- The Complete Plays of Shakespeare
- The Bible
- The Koran
- The Book of Mormon
20Vector Graphics
- Vector Markup Language (VML)
- Internet Explorer 5.0
- Microsoft Office 2000
- Scalable Vector Graphics (SVG)
21The Resource Description Framework (RDF)
- Meta-data
- Dublin Core
- Better Web searching
22An Example of RDF
- ltrdfRDF
- xmlnsrdf"http//www.w3.org/1999/02/22-rdf-synta
x-ns" - xmlnsdc"http//purl.org/DC/gt
- ltrdfDescription about"http//metalab.unc.edu/x
ml/gt - ltdcCREATORgtElliotte Rusty Haroldlt/dcCREATORgt
- ltdcTITLEgtCafe con Lechelt/dcTITLEgt
- lt/rdfDescriptiongt
- lt/rdfRDFgt
23XML for XML
- XSL The Extensible Stylesheet Language
- DCD The Document Content Description Schema
Language - XLL The Extensible Linking Language
24XSL The Extensible Stylesheet Language
- XSL Transformations
- XSL Formatting Objects
25DCD The Document Content Description Schema
Language
- Data Typing in XML is Weak
- ltMONTHgt9lt/MONTHgt
- ltDCDgt
- ltElementDef Type"MONTH"
- Model"Data" Datatype"i1"
- Min"1" Max"12" /gt
- lt/DCDgt
26XLL The Extensible Linking Language
- Any element can be a link
- Links can be bi-directional
- Links can be separated from the documents they
connect
ltfootnote xlinkform"simple" href"footnote7.xml"
gt7lt/footnotegt
27File Formats, In-house applications, and other
behind the scenes uses
- Microsoft Office 2000
- Federal Express Web API
- Netscape Whats Related
28Hello XML
lt?xml version"1.0" standalone"yes"?gt ltFOOgt Hello
XML! lt/FOOgt
- Plain ASCII or UTF-8 text
- .xml is standard file extension
- Any standard text editor will work
29The XML Declaration
lt?xml version"1.0" standalone"yes"?gt
- version attribute
- required
- always has the value 1.0
- standalone attribute
- yes
- no
- encoding attribute
- UTF-8
- 8859_1
- etc.
30The FOO element
ltFOOgt Hello XML! lt/FOOgt
- Start tag ltFOOgt
- Contents "Hello XML!"
- End tag lt/FOOgt
31greeting.xml
lt?xml version"1.0" standalone"yes"?gt ltGREETINGgt
Hello XML! lt/GREETINGgt
32Style sheets
- Separate from the XML document
- Different Languages
- Cascading Style Sheets Level 1 (CSS1)
- Internet Explorer 5.0
- Mozilla 5.0
- Cascading Style Sheets Level 2 (CSS2)
- Internet Explorer 5 (partial)
- Mozilla 5.0 (partial)
- Extensible Style Language (XSL)
- Internet Explorer 5.0 (older draft, buggy)
- LotusXSL, XT, Other non-browser converters
- Document Style and Semantics Language (DSSSL)
- Jade
33xml-stylesheet
- Style sheets are attached via an xml-stylesheet
processing instruction in the prolog - lt?xml version"1.0" standalone"yes"?gt
- lt?xml-stylesheet type"text/css"
href"greeting.css"?gt - ltGREETINGgtHello XML!lt/GREETINGgt
- type attribute has the value text/css or text/xsl
- href attribute is a URL to the stylesheet,
possibly relative - Can also use non-browser converters like XT,
LotusXSL, and Jade
34greeting.css
- GREETING display block
- font-size 24pt
- font-weight bold
35A larger example Baseball statistics
- Examine the data
- Design a vocabulary for the data
- Write a style sheet
36Sample statistics
- http//cbs.sportsline.com/u/baseball/mlb/stats.htm
37Organizing the Data
- XML documents are trees.
- XML elements contain other elements as well as
text - Within these limits there's more than one way to
organize the data - Hierarchically
- Relationally
- Objects
38What is the Root Element
- The League?
- The Season?
- A custom Document element?
39The Root Element
- Choose SEASON for the root element
- Everything else will be a descendant of SEASON
- This is not the only possible choice
lt?xml version"1.0"?gt ltSEASONgt lt/SEASONgt
40What are the Immediate Children of The root?
- Leagues?
- Teams?
- Players?
- Games?
41Child Elements
- lt?xml version"1.0"?gtltSEASONgt ltYEARgt 1998
lt/YEARgtlt/SEASONgt
42White space in XML is not especially significant
- lt?xml version"1.0"?gt
- ltSEASONgtltYEARgt1998lt/YEARgtlt/SEASONgt
43Leagues
- Major league baseball is divided into two leagues
- Each league has
- a name
- three divisions
44Divisions
- Each division has
- name
- 4-6 teams
45Teams
- Each team has
- Name
- City
- Players
46Player Data
- Each player has
- First name
- Last name
- Position
- Statistics
47Player Batting Statistics
- SB Stolen Bases
- CS Caught Stealing
- SH Sacrifice Hits
- SF Sacrifice Flies
- Err Errors
- PB Pitcher Balked
- BB Base on Balls (Walks)
- SO Strike Outs
- HBP Hit By Pitch
- G Games Played
- GS Games Started
- AB At Bats
- R Runs
- H Hits
- 2B Doubles
- 3B Triples
- HR Home Runs
- RBI Runs Batted In
48What does a player look like
- Long names vs. short names
49The Complete 1998 Major League
50A Style Sheet
- 1998shortstats.xml
- baseballstats.css
- lt?xml-stylesheet type"text/css"
href"baseballstats.css"?gt - styled1998shortstats.xml
51Cascading Style Sheets
- Partially supported by Mozilla and IE 5.0
- Full W3C Recommendation
52The Default Rule
- Not every element needs a rule
- The root element should be at least display
block - SEASON font-size 14pt
- background-color white
- color black
- display block
53A style rule for the YEAR element
- Make it look like a title
- YEAR display block
- font-size 32pt
- font-weight bold
- text-align center
54Style Rules for Division and League Names
- LEAGUE_NAME display block
text-align center font-size
28pt font-weight bold - DIVISION_NAME display block
text-align center font-size
24pt font-weight bold
55Alternate Style Rules for Division and League
Names
- LEAGUE_NAME, DIVISION_NAME display block
text-align center font-weight
boldLEAGUE_NAME font-size 28pt
DIVISION_NAME font-size 24pt
56Style Rules for Teams
- Team name and Team city must be one title
- Must be inline elements
- Previous and following must be block elements
- TEAM_CITY font-size 20pt font-weight bold
font-style italic - TEAM_NAME font-size 20pt font-weight bold
font-style italic - TEAM, PLAYER display block
57Style Rules for Players
TEAM display table TEAM_CITY display
table-caption TEAM_NAME display
table-caption PLAYER display
table-row SURNAME, GIVEN_NAME, POSITION,
GAMES, GAMES_STARTED, AT_BATS, RUNS, HITS,
DOUBLES, TRIPLES, HOME_RUNS, RBI,
STEALS, CAUGHT_STEALING, SACRIFICE_HITS,
SACRIFICE_FLIES, ERRORS, WALKS, STRUCK_OUT,
HIT_BY_PITCH display table-cell
58Finished Style Sheet
- SEASON font-size 14pt background-color white
color black display block - YEAR display block font-size 32pt
- font-weight bold text-align center
- LEAGUE_NAME display block text-align center
font-size 28pt font-weight bold - DIVISION_NAME display block text-align
center font-size 24pt font-weight bold - TEAM_CITY font-size 20pt font-weight bold
font-style italic - TEAM_NAME font-size 20pt
- font-weight bold font-style italic
- TEAM display block
- PLAYER display block
59Possible Extensions
- There should be captions like "RBI" or "At Bats.
- Derived numbers like batting averages are not
included. - The titles are short. E.g. "1998" instead of
"1998 Major League Baseball". - The document is so long it's hard to read.
Something similar to IE5's collapsible outline
view would be nice. - Pitcher stats should be separated from batter
stats.
60Possible Solutions
- CSS Level 2
- XSL
- XSL JavaScript
61Well-formedness Rules
- Open and close all tags
- Empty tags end with /gt
- There is a unique root element
- Elements may not overlap
- Attribute values are quoted
- lt and are only used to start tags and entities
- Only the five predefined entity references are
used
62Open and close all tags
63Empty tags end with /gt
- ltBR/gt, ltHR/gt, and ltIMG/gt instead of ltBRgt, ltHRgt,
and ltIMGgt - Web browsers deal inconsistently with these
- Can use ltBRgtlt/BRgt ltHRgtlt/HRgt ltIMGgtlt/IMGgt instead
64There is a unique root element
- One element completely contains all other
elements of the document - This is HTML in HTML files
- XML Declaration is not an element
lt?xml version"1.0" standalone"yes"?gt ltGREETINGgt
Hello XML! lt/GREETINGgt
65Elements may not overlap
- If an element contains a start tag for an
element, it must also contain the corresponding
end tag - Empty elements may appear anywhere
- Every non root element has a parent element
66Attribute values are quoted
- Good
- ltA HREF"http//metalab.unc.edu/xml/"gt
- Bad
- ltA HREFhttp//metalab.unc.edu/xml/gt
67lt and are only used to start tags and entities
- Good ltH1gtO'Reilly amp Associateslt/H1gt
- Bad ltH1gt O'Reilly Associateslt/H1gt
- Good
- ltCODEgtfor (int i 0 i lt args.length i )
lt/CODEgt - Bad
- ltCODEgtfor (int i 0 i lt args.length i )
lt/CODEgt
68Only the five predefined entity references are
used
- Bad
- copy
- reg
- tm
- alpha
- eacute
- nbsp
- etc.
69DTDs and Validity
- A Document Type Definition describes the elements
and attributes that may appear in a document - Validation compares a particular document against
a DTD - Well-formedness is a prerequisite for validity
70What is a DTD?
- a list of the elements, tags, attributes, and
entities contained in a document, and their
relationship to each other - internal vs. external DTDs
71The importance of validation
- Ensures that data is correct before feeding it
into a program - Ensure that a format is followed
- Establish what must be supported
- Not all documents need to be valid sometimes
well-formed is enough
72A DTD for greeting.xml
- greeting.xml
- lt?xml version"1.0"?gt
- ltGREETINGgt
- Hello XML!
- lt/GREETINGgt
- greeting.dtd
- lt!ELEMENT GREETING (PCDATA)gt
73Document Type Declarations
- lt?xml version"1.0"?gt
- lt!DOCTYPE GREETING SYSTEM "greeting.dtd"gt
- ltGREETINGgt
- Hello XML!
- lt/GREETINGgt
- specifies the root element
- gives a URL for the DTD
74Invalid Documents
- Valid
- ltGREETINGgt
- various random text but no markup
- lt/GREETINGgt
- Invalid anything else including
- ltGREETINGgt
- ltsometaggtvarious random textlt/sometaggt
- ltsomeEmptyTag/gt
- lt/GREETINGgt
- or
- ltGREETINGgt
- ltGREETINGgtvarious random textlt/GREETINGgt
- lt/GREETINGgt
75Validating Tools
- Command line programs like XJParse
- Online validators
- http//www.stg.brown.edu/service/xmlvalid/
- http//www.cogsci.ed.ac.uk/7Erichard/xml-check.ht
ml - Browsers
76Element Declarations
- Each tag must be declared in a lt!ELEMENTgt
declaration. - A lt!ELEMENTgt declaration gives the name and
content model of the element - The content model uses a simple regular
expression-like grammar to precisely specify what
is and isn't allowed in an element
77Content Specifications
- ANY
- PCDATA
- Sequences
- Choices
- Mixed Content
- Modifiers
- Empty
78ANY
- lt!ELEMENT SEASON ANYgt
- A SEASON can contain any child element and/or raw
text (parsed character data)
79PCDATA
- lt!ELEMENT YEAR (PCDATA)gt
- Parsed Character Data i.e. raw text, no markup
80PCDATA
- Invalid
- ltYEARgt
- ltMONTHgtJanuarylt/MONTHgt
- ltMONTHgtFebruarylt/MONTHgt
- ltMONTHgtMarchlt/MONTHgt
- ltMONTHgtAprillt/MONTHgt
- ltMONTHgtMaylt/MONTHgt
- ltMONTHgtJunelt/MONTHgt
- ltMONTHgtJulylt/MONTHgt
- ltMONTHgtAugustlt/MONTHgt
- ltMONTHgtSeptemberlt/MONTHgt
- ltMONTHgtOctoberlt/MONTHgt
- ltMONTHgtNovemberlt/MONTHgt
- ltMONTHgtDecemberlt/MONTHgt
- lt/YEARgt
- Valid
- ltYEARgt1999lt/YEARgt
- ltYEARgt99lt/YEARgt
- ltYEARgt1999 C.E.lt/YEARgt
- ltYEARgt
- The year of our Lord one thousand, nine hundred,
and ninety-nine - lt/YEARgt
81Child Elements
- To declare that a LEAGUE element must have a
LEAGUE_NAME child - lt!ELEMENT LEAGUE (LEAGUE_NAME)gt
- lt!ELEMENT LEAGUE_NAME (PCDATA)gt
82Sequences
- Separate multiple required child elements with
commas e.g. - lt!ELEMENT SEASON (YEAR, LEAGUE, LEAGUE)gt
- lt!ELEMENT LEAGUE (LEAGUE_NAME, DIVISION,
DIVISION, DIVISION)gt
83One or More Children
- lt!ELEMENT DIVISION_NAME (PCDATA)gt
- lt!ELEMENT DIVISION (DIVISION_NAME, TEAM)gt
84Zero or More Children
- lt!ELEMENT TEAM (TEAM_CITY, TEAM_NAME, PLAYER)gt
- lt!ELEMENT TEAM_CITY (PCDATA)gt
- lt!ELEMENT TEAM_NAME (PCDATA)gt
85Zero or One Children ?
- lt!ELEMENT PLAYER (GIVEN_NAME, SURNAME, POSITION,
GAMES, GAMES_STARTED, AT_BATS?, RUNS?, HITS?,
DOUBLES?, TRIPLES?, HOME_RUNS?, RBI?, STEALS?,
CAUGHT_STEALING?, SACRIFICE_HITS?,
SACRIFICE_FLIES?, ERRORS?, WALKS?, STRUCK_OUT?,
HIT_BY_PITCH?, WINS?, LOSSES?, SAVES?,
COMPLETE_GAMES?, SHUT_OUTS?, ERA?, INNINGS?,
EARNED_RUNS?, HIT_BATTER?, WILD_PITCHES?,
BALK?,WALKED_BATTER?, STRUCK_OUT_BATTER?) - gt
86Finished DTD
87Choices
- lt!ELEMENT PAYMENT (CASH CREDIT_CARD)gt
- lt!ELEMENT PAYMENT (CASH CREDIT_CARD CHECK)gt
88Grouping With Parentheses
- Parentheses combine several elements into a
single element. - Parenthesized element can be nested inside other
parentheses in place of a single element. - The parenthesized element can be suffixed with a
plus sign, a comma, or a question mark. - lt!ELEMENT dl (dt, dd)gt
- lt!ELEMENT ARTICLE (TITLE, (P PHOTO GRAPH
SIDEBAR PULLQUOTE SUBHEAD), BYLINE?)gt
89Mixed Content
- Both PCDATA and child elements in a choice
- lt!ELEMENT TEAM (PCDATA TEAM_CITY TEAM_NAME
PLAYER)gt - PCDATA must come first
- PCDATA cannot be used in a sequence
90Empty elements
- lt!ELEMENT BR EMPTYgt
- lt!ELEMENT IMG EMPTYgt
- lt!ELEMENT HR EMPTYgt
91Internal DTDs
- lt?xml version"1.0"?gt
- lt!DOCTYPE GREETING
- lt!ELEMENT GREETING (PCDATA)gt
- gt
- ltGREETINGgt
- Hello XML!
- lt/GREETINGgt
92Internal DTD Subsets
- lt?xml version"1.0"?gt
- lt!DOCTYPE GREETING SYSTEM "greeting.dtd"
- lt!ELEMENT GREETING (PCDATA)gt
- gt
- ltGREETINGgt
- Hello XML!
- lt/GREETINGgt
- Internal declarations override external
declarations
93Programming with XML
- Java works best
- C, Perl, Python etc. can also be used
- Unicode support is the biggest issue
94SAX, the Simple API for XML
- Event based
- Programs can plug in different parsers
95The Document Object Model (DOM)
96To Learn More Books
- XML Extensible Markup Language
- IDG Books 1998
- ISBN 0-76453-199-9
- The XML Bible
- IDG Books 1999
- ISBN 0-76453-236-7
97Questions?