XML and Information Processing - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

XML and Information Processing

Description:

Mark the structure of an object (document, database entry, etc.) and use this ... (formatting) also a language for conversion of documents with different DTDs ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 37
Provided by: tbraja6
Category:

less

Transcript and Presenter's Notes

Title: XML and Information Processing


1
XML and Information Processing
T.B. RajashekarNational Centre for Science
InformationIndian Institute of ScienceBangalore
560 012(raja_at_ncsi.iisc.ernet.in)
2
XML and Information Processing
  • Objective of the presentation
  • Current web information architecture and HTML
    limitations
  • Solution XML (Extensible Markup Language)
  • SGML, XML and HTML
  • XML components and examples
  • XML DTD
  • XML applications and vocabularies
  • XML and bibliographic information
  • Related specifications and standards
  • XML resources

3
Objective of the Presentation
  • To explain a few key concepts related to XML
  • To indicate relevance of XML for information
    processing
  • To point to related resources

4
Current Web Information Architecture
5
Limitations of HTML
  • Lets look at two typical HTML examples portion
    of a full text paper and a journal TOC (springer
    toc, libaut.htm)
  • Mixes data with presentation
  • HTML carries only the layout of the document
    not its semantic structure identification and
    extraction of structural elements is very
    difficult poor quality searches
  • Has restricted set of tags mostly to do with
    presentation cannot be extended by authors
  • Data exchange between applications/ services is
    not possible
  • Weak support for validation (now overcome with
    XHTML)

6
XML Extensible Markup Language
  • XML is specifically designed for the web and as a
    data interchange format
  • XML is a simplified subset of Standard
    Generalized Markup Language (SGML)
  • Overcomes limitations of HTML with several
    additional advantages
  • Formally ratified as a W3C standard in February
    1998
  • Specification provides a set of grammar and
    syntax rules for semantically describing the
    structure of data
  • Mark the structure of an object (document,
    database entry, etc.) and use this for sharing
    and processing.

7
XML Extensible Markup Language
  • ASCII of the Web OS independent create once
    and use in different ways
  • Quality searching
  • Can search for field-based content, not just any
    content
  • computer (name/model) price lt 700
  • Enables data interchange and sharing between
    applications
  • Limitation of XML Does not convey meaning of
    structural elements applications have to
    achieve this also through common XML
    vocabularies

8
XML Design Goals
  • XML shall be usable over the Internet
  • XML shall support a variety of applications
  • XML shall be compatible with SGML
  • It shall be easy to write programs that process
    XML documents
  • Optional features in XML shall be kept to the
    absolute minimum, ideally zero (compatibility)
  • XML documents should be human-legible and
    reasonably clear
  • Design of XML should be prepared quickly
  • Design of XML shall be formal and concise
  • XML documents shall be easy to create

9
SGML, HTML and XML
  • XML is a compromise between the non-extensible,
    limited capabilities of HTML and the full power
    and complexity of SGML
  • Claimed to be better than SGML
  • 80 of capabilities, 20 of complexity
  • Small enough to be supported by browser vendors
  • Explicitly includes hyper linking
  • Supports style sheets
  • Better than HTML
  • Extensible (can create your own tags)
  • Tags identify content e.g. item and its price

10
(No Transcript)
11
XML Document Structure
Three Components
  • Prolog
  • XML Declaration
  • Document Type Declaration (optional)
  • Processing instructions and comments
  • Document Body
  • One or more elements, in the form of a
    hierarchical tree, some of which may contain data
  • Processing instructions and comments
  • Epilog (optional)
  • Processing instructions and comments

12
XML Document Structure
  • All XML documents start with the XML declaration
    lt?xml version"1.0"?gt
  • DTD (Document Type Definition) (part of document
    type declaration in prolog), provides the
    vocabulary for the XML documents, and defines
    the document hierarchy, elements, tags and
    syntactic rules for the document structure
  • DTD can accompany XML documents or reside outside
    the documents
  • Lets take an example and see the two
    possibilities, using IE browser (version 5.0 or
    above) (adbook.xml)

13
XML Document With External DTD
  • lt?xml version"1.0"?gtlt!DOCTYPE addressbook
    SYSTEM "adbook.dtd"gtltaddressbookgt ltperson
    id"B.WALLACE" gender"male"gt ltnamegt
    ltfamilygtWallacelt/familygt
    ltgivengtBoblt/givengt lt/namegt
    ltemailgtbwallace_at_megacorp.comlt/emailgt
    ltlink manager"C.TUTTLE"/gt lt/persongt
    ltperson id"C.TUTTLE" gender"female"gt
    ltnamegtltfamilygtTuttlelt/familygtltgivengtClairelt/givengt
    lt/namegt ltemailgtctuttle_at_megacorp.comlt/email
    gt ltlink subordinates"B.WALLACE"/gt
    lt/persongtlt/addressbookgt

XML Declaration
Root element
Body
14
DTD Stored in adbook.dtd File
lt!-- DTD for a simple address book --gt lt!ELEMENT
addressbook (person)gt lt!ELEMENT person
(name,email,link?)gt lt!ATTLIST person id ID
REQUIREDgt lt!ATTLIST person gender (malefemale)
IMPLIEDgt lt!ELEMENT name (family,given)gt lt!ELEMENT
family (PCDATA)gt lt!ELEMENT given
(PCDATA)gt lt!ELEMENT email (PCDATA)gt lt!ELEMENT
link EMPTYgt lt!ATTLIST link manager IDREF IMPLIED
subordinates IDREFS IMPLIEDgt
15
XML Document With Accompanying DTD
lt?xml version"1.0"?gtlt!DOCTYPE addressbook
lt!ELEMENT addressbook (person)gtlt!ELEMENT
person (name,email,link?)gtlt!ATTLIST person id
ID REQUIREDgtlt!ATTLIST person gender
(malefemale) IMPLIEDgtlt!ELEMENT name
(family,given)gtlt!ELEMENT family
(PCDATA)gtlt!ELEMENT given (PCDATA)gtlt!ELEMENT
email (PCDATA)gtlt!ELEMENT link EMPTYgtlt!ATTLIST
link manager IDREF IMPLIED subordinates IDREFS
IMPLIEDgtgt
XML Declaration
DTD
16
XML Document With Accompanying DTD
ltaddressbookgt ltperson id"B.WALLACE"
gender"male"gt ltnamegt
ltfamilygtWallacelt/familygt
ltgivengtBoblt/givengt lt/namegt
ltemailgtbwallace_at_megacorp.comlt/emailgt
ltlink manager"C.TUTTLE"/gt lt/persongt
ltperson id"C.TUTTLE" gender"female"gt
ltnamegtltfamilygtTuttlelt/familygtltgivengtClairelt/givengt
lt/namegt ltemailgtctuttle_at_megacorp.comlt/email
gt ltlink subordinates"B.WALLACE"/gt
lt/persongtlt/addressbookgt
Body
17
Do We Need a DTD?
  • DTD is not mandatory for XML documents
  • Lets try this out (use adbook.xml)
  • Why do we need DTD then?
  • DTD defines the vocabulary for a specific class
    of documents (e.g. reports, theses) - document
    hierarchy, elements, tags and syntactic rules for
    the document structure
  • Essential for information exchange between
    applications and services
  • Forms the basis for validating the correctness of
    XML documents
  • Lets see another example IOP DTD and XML
    files.

18
Do We Need a DTD?
19
Inside an XML Document (File)
  • An XML document file contains one and only one
    document root element
  • Root element contains one or more documents
  • Document content is marked up using user defined
    elements (tags) and element attributes
  • Document content may also contain entity
    references (replacement text, external
    references, etc.).
  • Document content may also contain comments and
    processing instructions

20
Elements
  • XML documents contain data marked up with user
    defined tags
  • These tags are referred as elements
  • Elements can contain other elements should be
    properly nested
  • Consists of a start-tag/end-tag pair, along with
    the enclosed content (tag pairs are essential in
    XML)
  • Delimited by lt and /gt charactersltemailgtbwalla
    ce_at_megacorp.comlt/emailgt
  • Allows for empty tags elements with no enclosed
    content marked up using trailing
    slashltnamegtRavi Vermaltbr/gtlt/namegt

21
Elements
  • Element attributes
  • Elements, including empty elements, may have
    attributes associated with them
  • These are name, value pairsltbookgtlttitle
    languageEnglishgtRoad Aheadlt/titlegtltprice
    currencyEurogt420.12lt/pricegtlt/bookgt
  • Values are always in quotes (single or double)
  • Any number of attributes
  • Element names are case sensitive

22
Elements
  • Element identifiers XML allows unique
    identifiers to elements, as the value of some
    attributeltstate idTNgt ltsnamegtTamil
    Nadult/snamegtlt/stategtltcity idCHNgtltcnamegtChen
    nailt/cnamegtltstate-of idrefTN/gtlt/citygt

23
Text
  • PCDATA (parsed character data)
  • XML documents store data as text, as a string of
    characters
  • No distinction is made between numeric and
    character values
  • Noted as PCDATA (parseable character data)
  • Text data may not contain reserved characters
    (lt,gt,,,). Entity references must be used to
    represent these (e.g. lt for lt)
  • CDATA
  • Used for storing data that may contain the
    delimiters
  • Delimited by lt!CDATA and gt

24
Entity References
  • Entity references within XML documents
  • Delimited by and
  • Internal entity reference
  • Simple macros used for defining replacement text
    in XML documents (e.g. edi for Electronic
    Data Interchange)
  • External entity references
  • Referred entity lies outside the XML document
    (e.g. database)
  • Parameter entity references
  • Used in DTD allow grouping of components for
    easy reference (e.g. pe to include elements A,
    B and C.
  • Delimited by and

25
Comments and Processing Instructions
  • Comments
  • Text strings delimited by lt!-- and --gt
  • Can be found anywhere in the XML document
  • Processing instructions
  • Messages for processing applications
  • Examplelt?myprog sortalpha?gt

26
Inside XML DTD
  • DTD defines the document hierarchy in terms of
    the elements, elements themselves and their
    syntactic structure, entities and rules for their
    usage
  • This is used for validating XML documents
    Examplelt!--DTD for Books--gtlt!ENTITY cright
    "169"gtlt!ELEMENT books (book)gtlt!ELEMENT book
    (title, isbn, authors, description?,price)gtlt!ELEM
    ENT title (PCDATA)gtlt!ATTLIST title lang
    (englishfrench) REQUIREDgtlt!ELEMENT isbn
    (PCDATA)gtlt!ELEMENT authors (PCDATA)gtlt!ELEMENT
    description (PCDATA)gtlt!ELEMENT price
    (PCDATA)gtlt!ATTLIST price curr (RsDollar)
    IMPLIEDgt

27
Well Formed and Valid XML Document
  • Well formed XML documents must be syntactically
    correct. What does this mean?
  • There is only one Root element and it contains
    the documents content
  • Match start-tags with end-tags (except for empty
    element tag)
  • Nested elements never overlap
  • Attributes are unique and values are in quotes
  • Only entity references permitted are amp for
    , lt for lt, gt for gt, apos for , and
    quot for .

28
Well Formed and Valid XML Document
  • Valid XML documents
  • Valid XML documents are well-formed XML documents
    which include an XML declaration and document
    type declaration
  • Valid documents must also adhere to the DTD

Lets look at an example using sample Medline
XML document
29
Rendering of XML
  • How can we view XML documents on the Web?
  • XML is about data, not presentation
    presentation has to be handled separately
  • This can be handled using CSS (Cascading Style
    Sheets) and XSLT (Extensible Stylesheet Language
    Transformations)
  • CSS and XSLT can be used for rendering XML
    documents on the Web
  • IE 5.0 supports viewing XML document trees and
    HTML rendering using CSS
  • Lets see a CSS example (syllabus2)

30
Creating and Processing XML Documents
  • XML documents can be created using any text
    editor
  • XML editors and browsers are also available (e.g.
    Softquads Xmetal, XMLSPY)
  • Complete publishing solutions are also available
    (e.g. Cocoon in Java)
  • Computers (applications) can exchange XML data
    and process these using DTD for extraction and
    validation
  • API specifications are also available for
    simplifying XML document processing (e.g. W3Cs
    DOM specification)
  • Most RDBMS packages have begun to support import/
    export of SQL database content in XML format

31
XML Applications and Vocabularies
  • XML has found rapid use in a large number of
    domains
  • Domain-specific vocabularies (DTDs and processing
    tools, techniques, practices) have been developed
  • Mathematics MathML
  • Chemistry CML
  • Instruments Markup Language IML
  • BioInformatics Sequence Markup Language (BSML)
  • E-Books Specs developed by Open eBook forum
  • Medicine Health Level 7
  • Business/e-commerce eXML, BizTalk
  • Mobile communications - WML

32
XML and Bibliographic Information
  • XML appears to be made for documentary
    information (bibliographic or full text) since
    document content is already structured XSLT can
    be used for viewing, transforming and presenting
    the content in different ways
  • Several bibliographic databases are now available
    in XML (e.g. Medline, MARC)
  • Journal publishers are also embracing XML (e.g.
    IOP)
  • Digital publishing Books and other full-text
    documents mark-up once and view differently
  • e.g. Tobacco war book in escholarship website
    in the California Digital Library
  • Interoperability among digital libraries OAI
    initiative in using MARC XML for interoperability
    among DLs

33
Related Specifications and Standards
  • XSL (Extensible Stylesheet Language) way to
    specify stylesheets with XML for presentation
    and rendering (formatting) also a language for
    conversion of documents with different DTDs
  • XLL (Extensible Linking Language) XML linking
    and addressing linking language (XLink) and
    addressing language (XPointer)
  • XHTML Formulation of HTML 4.0 as an XML 1.0
    application Extensible HyperText Markup Language
  • XQL XML oriented query language (under
    formulation)
  • XML Schema Schema language for supporting user
    defined data types (e.g. integers) something
    not possible with DTD

34
Related Specifications and Standards
  • DOM (Document Object Model) XML documents will
    be manipulated by web applications, for request
    parsing and building responses. DOM is an API for
    this. Specifies a set of objects and interfaces
    for manipulating HTML and XML documents. DOM
    provides a tree-structured view of the document
    enabling XML parsers to construct a tree of
    objects in memory.
  • Namespaces Web applications may need to work
    with different XML documents having different
    DTDs these might use same element names for
    different entities and different names for same
    entity name collisions and heterogeneity.
    Namespaces specification is expected to resolve
    this problem.

35
(No Transcript)
36
XML Resources
  • www.oasis-open.org/cover/ (Robin Covers SGML/XML
    page)
  • www.xml.com
  • www.ucc.ie/xml/ (XML FAQ and links to other
    topics related to XML)
  • www.w3.org/XML/
  • IBM Developerworks www.ibm.com/developer/xml/
  • Microsoft XML Developer Center
    msdn.microsoft.com/xml/default.asp
  • XML4Lib A discussion forum related to XML use on
    librarieshttp//sunsite.berkeley.edu/XML4Lib/
Write a Comment
User Comments (0)
About PowerShow.com