Title: XML and Information Processing
1XML and Information Processing
T.B. RajashekarNational Centre for Science
InformationIndian Institute of ScienceBangalore
560 012(raja_at_ncsi.iisc.ernet.in)
2XML and Information Processing
- Objective of the presentation
- Current web information architecture and HTML
limitations - Solution XML (Extensible Markup Language)
- SGML, XML and HTML
- XML components and examples
- XML DTD
- XML applications and vocabularies
- XML and bibliographic information
- Related specifications and standards
- XML resources
3Objective of the Presentation
- To explain a few key concepts related to XML
- To indicate relevance of XML for information
processing - To point to related resources
4Current Web Information Architecture
5Limitations of HTML
- Lets look at two typical HTML examples portion
of a full text paper and a journal TOC (springer
toc, libaut.htm) - Mixes data with presentation
- HTML carries only the layout of the document
not its semantic structure identification and
extraction of structural elements is very
difficult poor quality searches - Has restricted set of tags mostly to do with
presentation cannot be extended by authors - Data exchange between applications/ services is
not possible - Weak support for validation (now overcome with
XHTML)
6XML Extensible Markup Language
- XML is specifically designed for the web and as a
data interchange format - XML is a simplified subset of Standard
Generalized Markup Language (SGML) - Overcomes limitations of HTML with several
additional advantages - Formally ratified as a W3C standard in February
1998 - Specification provides a set of grammar and
syntax rules for semantically describing the
structure of data - Mark the structure of an object (document,
database entry, etc.) and use this for sharing
and processing.
7XML Extensible Markup Language
- ASCII of the Web OS independent create once
and use in different ways - Quality searching
- Can search for field-based content, not just any
content - computer (name/model) price lt 700
- Enables data interchange and sharing between
applications - Limitation of XML Does not convey meaning of
structural elements applications have to
achieve this also through common XML
vocabularies
8XML Design Goals
- XML shall be usable over the Internet
- XML shall support a variety of applications
- XML shall be compatible with SGML
- It shall be easy to write programs that process
XML documents - Optional features in XML shall be kept to the
absolute minimum, ideally zero (compatibility) - XML documents should be human-legible and
reasonably clear - Design of XML should be prepared quickly
- Design of XML shall be formal and concise
- XML documents shall be easy to create
9SGML, HTML and XML
- XML is a compromise between the non-extensible,
limited capabilities of HTML and the full power
and complexity of SGML - Claimed to be better than SGML
- 80 of capabilities, 20 of complexity
- Small enough to be supported by browser vendors
- Explicitly includes hyper linking
- Supports style sheets
- Better than HTML
- Extensible (can create your own tags)
- Tags identify content e.g. item and its price
10(No Transcript)
11XML Document Structure
Three Components
- Prolog
- XML Declaration
- Document Type Declaration (optional)
- Processing instructions and comments
- Document Body
- One or more elements, in the form of a
hierarchical tree, some of which may contain data - Processing instructions and comments
- Epilog (optional)
- Processing instructions and comments
12XML Document Structure
- All XML documents start with the XML declaration
lt?xml version"1.0"?gt - DTD (Document Type Definition) (part of document
type declaration in prolog), provides the
vocabulary for the XML documents, and defines
the document hierarchy, elements, tags and
syntactic rules for the document structure - DTD can accompany XML documents or reside outside
the documents - Lets take an example and see the two
possibilities, using IE browser (version 5.0 or
above) (adbook.xml)
13XML Document With External DTD
- lt?xml version"1.0"?gtlt!DOCTYPE addressbook
SYSTEM "adbook.dtd"gtltaddressbookgt ltperson
id"B.WALLACE" gender"male"gt ltnamegt
ltfamilygtWallacelt/familygt
ltgivengtBoblt/givengt lt/namegt
ltemailgtbwallace_at_megacorp.comlt/emailgt
ltlink manager"C.TUTTLE"/gt lt/persongt
ltperson id"C.TUTTLE" gender"female"gt
ltnamegtltfamilygtTuttlelt/familygtltgivengtClairelt/givengt
lt/namegt ltemailgtctuttle_at_megacorp.comlt/email
gt ltlink subordinates"B.WALLACE"/gt
lt/persongtlt/addressbookgt
XML Declaration
Root element
Body
14DTD Stored in adbook.dtd File
lt!-- DTD for a simple address book --gt lt!ELEMENT
addressbook (person)gt lt!ELEMENT person
(name,email,link?)gt lt!ATTLIST person id ID
REQUIREDgt lt!ATTLIST person gender (malefemale)
IMPLIEDgt lt!ELEMENT name (family,given)gt lt!ELEMENT
family (PCDATA)gt lt!ELEMENT given
(PCDATA)gt lt!ELEMENT email (PCDATA)gt lt!ELEMENT
link EMPTYgt lt!ATTLIST link manager IDREF IMPLIED
subordinates IDREFS IMPLIEDgt
15XML Document With Accompanying DTD
lt?xml version"1.0"?gtlt!DOCTYPE addressbook
lt!ELEMENT addressbook (person)gtlt!ELEMENT
person (name,email,link?)gtlt!ATTLIST person id
ID REQUIREDgtlt!ATTLIST person gender
(malefemale) IMPLIEDgtlt!ELEMENT name
(family,given)gtlt!ELEMENT family
(PCDATA)gtlt!ELEMENT given (PCDATA)gtlt!ELEMENT
email (PCDATA)gtlt!ELEMENT link EMPTYgtlt!ATTLIST
link manager IDREF IMPLIED subordinates IDREFS
IMPLIEDgtgt
XML Declaration
DTD
16XML Document With Accompanying DTD
ltaddressbookgt ltperson id"B.WALLACE"
gender"male"gt ltnamegt
ltfamilygtWallacelt/familygt
ltgivengtBoblt/givengt lt/namegt
ltemailgtbwallace_at_megacorp.comlt/emailgt
ltlink manager"C.TUTTLE"/gt lt/persongt
ltperson id"C.TUTTLE" gender"female"gt
ltnamegtltfamilygtTuttlelt/familygtltgivengtClairelt/givengt
lt/namegt ltemailgtctuttle_at_megacorp.comlt/email
gt ltlink subordinates"B.WALLACE"/gt
lt/persongtlt/addressbookgt
Body
17Do We Need a DTD?
- DTD is not mandatory for XML documents
- Lets try this out (use adbook.xml)
- Why do we need DTD then?
- DTD defines the vocabulary for a specific class
of documents (e.g. reports, theses) - document
hierarchy, elements, tags and syntactic rules for
the document structure - Essential for information exchange between
applications and services - Forms the basis for validating the correctness of
XML documents - Lets see another example IOP DTD and XML
files.
18Do We Need a DTD?
19Inside an XML Document (File)
- An XML document file contains one and only one
document root element - Root element contains one or more documents
- Document content is marked up using user defined
elements (tags) and element attributes - Document content may also contain entity
references (replacement text, external
references, etc.). - Document content may also contain comments and
processing instructions
20Elements
- XML documents contain data marked up with user
defined tags - These tags are referred as elements
- Elements can contain other elements should be
properly nested - Consists of a start-tag/end-tag pair, along with
the enclosed content (tag pairs are essential in
XML) - Delimited by lt and /gt charactersltemailgtbwalla
ce_at_megacorp.comlt/emailgt - Allows for empty tags elements with no enclosed
content marked up using trailing
slashltnamegtRavi Vermaltbr/gtlt/namegt
21Elements
- Element attributes
- Elements, including empty elements, may have
attributes associated with them - These are name, value pairsltbookgtlttitle
languageEnglishgtRoad Aheadlt/titlegtltprice
currencyEurogt420.12lt/pricegtlt/bookgt - Values are always in quotes (single or double)
- Any number of attributes
- Element names are case sensitive
22Elements
- Element identifiers XML allows unique
identifiers to elements, as the value of some
attributeltstate idTNgt ltsnamegtTamil
Nadult/snamegtlt/stategtltcity idCHNgtltcnamegtChen
nailt/cnamegtltstate-of idrefTN/gtlt/citygt
23Text
- PCDATA (parsed character data)
- XML documents store data as text, as a string of
characters - No distinction is made between numeric and
character values - Noted as PCDATA (parseable character data)
- Text data may not contain reserved characters
(lt,gt,,,). Entity references must be used to
represent these (e.g. lt for lt) - CDATA
- Used for storing data that may contain the
delimiters - Delimited by lt!CDATA and gt
24Entity References
- Entity references within XML documents
- Delimited by and
- Internal entity reference
- Simple macros used for defining replacement text
in XML documents (e.g. edi for Electronic
Data Interchange) - External entity references
- Referred entity lies outside the XML document
(e.g. database) - Parameter entity references
- Used in DTD allow grouping of components for
easy reference (e.g. pe to include elements A,
B and C. - Delimited by and
25Comments and Processing Instructions
- Comments
- Text strings delimited by lt!-- and --gt
- Can be found anywhere in the XML document
- Processing instructions
- Messages for processing applications
- Examplelt?myprog sortalpha?gt
26Inside XML DTD
- DTD defines the document hierarchy in terms of
the elements, elements themselves and their
syntactic structure, entities and rules for their
usage - This is used for validating XML documents
Examplelt!--DTD for Books--gtlt!ENTITY cright
"169"gtlt!ELEMENT books (book)gtlt!ELEMENT book
(title, isbn, authors, description?,price)gtlt!ELEM
ENT title (PCDATA)gtlt!ATTLIST title lang
(englishfrench) REQUIREDgtlt!ELEMENT isbn
(PCDATA)gtlt!ELEMENT authors (PCDATA)gtlt!ELEMENT
description (PCDATA)gtlt!ELEMENT price
(PCDATA)gtlt!ATTLIST price curr (RsDollar)
IMPLIEDgt
27Well Formed and Valid XML Document
- Well formed XML documents must be syntactically
correct. What does this mean? - There is only one Root element and it contains
the documents content - Match start-tags with end-tags (except for empty
element tag) - Nested elements never overlap
- Attributes are unique and values are in quotes
- Only entity references permitted are amp for
, lt for lt, gt for gt, apos for , and
quot for .
28Well Formed and Valid XML Document
- Valid XML documents
- Valid XML documents are well-formed XML documents
which include an XML declaration and document
type declaration - Valid documents must also adhere to the DTD
Lets look at an example using sample Medline
XML document
29Rendering of XML
- How can we view XML documents on the Web?
- XML is about data, not presentation
presentation has to be handled separately - This can be handled using CSS (Cascading Style
Sheets) and XSLT (Extensible Stylesheet Language
Transformations) - CSS and XSLT can be used for rendering XML
documents on the Web - IE 5.0 supports viewing XML document trees and
HTML rendering using CSS - Lets see a CSS example (syllabus2)
30Creating and Processing XML Documents
- XML documents can be created using any text
editor - XML editors and browsers are also available (e.g.
Softquads Xmetal, XMLSPY) - Complete publishing solutions are also available
(e.g. Cocoon in Java) - Computers (applications) can exchange XML data
and process these using DTD for extraction and
validation - API specifications are also available for
simplifying XML document processing (e.g. W3Cs
DOM specification) - Most RDBMS packages have begun to support import/
export of SQL database content in XML format
31XML Applications and Vocabularies
- XML has found rapid use in a large number of
domains - Domain-specific vocabularies (DTDs and processing
tools, techniques, practices) have been developed - Mathematics MathML
- Chemistry CML
- Instruments Markup Language IML
- BioInformatics Sequence Markup Language (BSML)
- E-Books Specs developed by Open eBook forum
- Medicine Health Level 7
- Business/e-commerce eXML, BizTalk
- Mobile communications - WML
32XML and Bibliographic Information
- XML appears to be made for documentary
information (bibliographic or full text) since
document content is already structured XSLT can
be used for viewing, transforming and presenting
the content in different ways - Several bibliographic databases are now available
in XML (e.g. Medline, MARC) - Journal publishers are also embracing XML (e.g.
IOP) - Digital publishing Books and other full-text
documents mark-up once and view differently - e.g. Tobacco war book in escholarship website
in the California Digital Library - Interoperability among digital libraries OAI
initiative in using MARC XML for interoperability
among DLs
33Related Specifications and Standards
- XSL (Extensible Stylesheet Language) way to
specify stylesheets with XML for presentation
and rendering (formatting) also a language for
conversion of documents with different DTDs - XLL (Extensible Linking Language) XML linking
and addressing linking language (XLink) and
addressing language (XPointer) - XHTML Formulation of HTML 4.0 as an XML 1.0
application Extensible HyperText Markup Language - XQL XML oriented query language (under
formulation) - XML Schema Schema language for supporting user
defined data types (e.g. integers) something
not possible with DTD
34Related Specifications and Standards
- DOM (Document Object Model) XML documents will
be manipulated by web applications, for request
parsing and building responses. DOM is an API for
this. Specifies a set of objects and interfaces
for manipulating HTML and XML documents. DOM
provides a tree-structured view of the document
enabling XML parsers to construct a tree of
objects in memory. - Namespaces Web applications may need to work
with different XML documents having different
DTDs these might use same element names for
different entities and different names for same
entity name collisions and heterogeneity.
Namespaces specification is expected to resolve
this problem.
35(No Transcript)
36XML Resources
- www.oasis-open.org/cover/ (Robin Covers SGML/XML
page) - www.xml.com
- www.ucc.ie/xml/ (XML FAQ and links to other
topics related to XML) - www.w3.org/XML/
- IBM Developerworks www.ibm.com/developer/xml/
- Microsoft XML Developer Center
msdn.microsoft.com/xml/default.asp - XML4Lib A discussion forum related to XML use on
librarieshttp//sunsite.berkeley.edu/XML4Lib/