CSE 636 Data Integration - PowerPoint PPT Presentation

About This Presentation
Title:

CSE 636 Data Integration

Description:

HTML. Uses tags for formatting the presentation (e.g., 'italic' ... Tags, as in HTML, are normally matched pairs book ... /book ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 43
Provided by: michailpe
Learn more at: https://cse.buffalo.edu
Category:
Tags: cse | data | html | integration | tags

less

Transcript and Presenter's Notes

Title: CSE 636 Data Integration


1
CSE 636Data Integration
XML Semistructured Data Document Type Definitions
2
Semistructured Data
  • Another data model, based on trees
  • Motivation flexible representation of data
  • Often, data comes from multiple sources with
    differences in notation, meaning, etc.
  • Motivation sharing of documents among systems
    and databases

3
Graphs of Semistructured Data
  • Nodes objects
  • Labels on arcs (attributes, relationships)
  • Atomic values at leaf nodes (nodes with no arcs
    out)
  • Flexibility no restriction on
  • Labels out of a node
  • Number of successors with a given label

4
Example Data Graph
root
beer
beer
bar
manf
manf
prize
A.B.
name
name
year
award
servedAt
Bud
Gold
1995
Mlob
name
addr
Maple
Joes
5
XML
  • HTML
  • Uses tags for formatting the presentation (e.g.,
    italic)
  • Hard for applications to process
  • XML Extensible Markup Language
  • Uses tags for semantics(e.g., this is an
    address)
  • Similar to labels in semistructured data
  • Allows you to invent your own tags
  • Easy for applications to process

6
HTML ? XML
lthtmlgt ltbodygt lth1gt Bibliography lt/h1gt
ltpgt ltigtFoundations of Databaseslt/igt
Abiteboul, Hull, Vianu ltbr/gt
Addison Wesley, 1995 lt/pgt ltpgt
ltigt Data on the Web lt/igt Abiteboul,
Buneman, Suciu ltbr/gt Morgan Kaufmann,
1999 lt/pgt lt/bodygt lt/htmlgt
lt?xml version 1.0 standalone yes
?gt ltbibliographygt ltbookgt
lttitlegtFoundations of Databaseslt/titlegt
ltauthorgt Abiteboul lt/authorgt ltauthorgt Hull
lt/authorgt ltauthorgt Vianu lt/authorgt
ltpublishergt Addison Wesley lt/publishergt
ltyeargt 1995 lt/yeargt lt/bookgt
lt/bibliographygt
7
Why XML is of Interest to Us
  • XML is just syntax for data
  • Note we have no syntax for relational data
  • But XML is not relational semistructured
  • This is exciting because
  • Can translate any data to XML
  • Can ship XML over the Web (HTTP, SOAP)
  • Can input XML into any application
  • Thus data sharing and exchange on the Web

8
XML Data Sharing and Exchange
XML DB
?
?
Applications
Applications
XML Data
Transform
Integrate
Web (HTTP, SOAP)
Warehouse
Relational DB
Web Site
Web Service
9
XML Tags Elements
  • Tags book, title, author,
  • XML tags are case sensitive
  • Tags, as in HTML, are normally matched pairs
  • ltbookgt lt/bookgt
  • Start tag ltbookgt, End tag lt/bookgt
  • Elements everything between tags
  • Example 1 lttitlegtFoundations of
    Databaseslt/titlegt
  • Example 2 ltbookgt lttitlegtFoundations of
    Databaseslt/titlegt lt/bookgt
  • Elements may be nested arbitrarily
  • Empty element ltbookgtlt/bookgt
  • Abbreviation ltbook/gt

10
XML Attributes
  • ltbook price 55 currency USDgt
  • lttitlegt Foundations of Databases lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt
  • Attributes are alternative ways to represent data

11
Replacing Attributes with Elements
  • ltbookgt
  • lttitlegt Foundations of Databases lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltyeargt 1995 lt/yeargt
  • ltpricegt 55 lt/pricegt
  • ltcurrencygt USD lt/currencygt
  • lt/bookgt

12
Elements vs. Attributes
  • Too many attributes make documents hard to read
  • Attributes do not specify document structure
  • Attributes are good for simple information

13
More XML CDATA Section
  • Syntax lt!CDATA .....any text here...gt
  • Example
  • ltexamplegt lt!CDATA some text here lt/notAtaggt
    ltgtgt
  • lt/examplegt

14
More XML Entity References
  • Syntax entityname
  • Example ltelementgt this is less than lt
    lt/elementgt
  • Some entities

lt lt
gt gt
amp
apos
quot
38 Unicode char
15
More XML Comments
  • Syntax lt!-- .... Comment text... --gt
  • Yes, they are part of the data model !!!

16
XML Semantics a Tree !
data
ltdatagt ltperson age25 gt ltnamegt Mary
lt/namegt ltaddressgt ltstreetgt Maple
lt/streetgt ltnogt 345 lt/nogt
ltcitygt Seattle lt/citygt lt/addressgt
lt/persongt ltpersongt ltnamegt John lt/namegt
ltaddressgtThailandlt/addressgt ltphonegt
23456 lt/phonegt lt/persongt lt/datagt
person
person
age
address
name
address
name
phone
25
street
no
city
Mary
Thai
John
23456
Maple
345
Seattle
  • Order matters!!!

17
Well-Formed XML
  • Start the document with a declaration, surrounded
    by lt?xml ?gt
  • Normal declaration is
  • lt?xml version 1.0 standalone yes ?gt
  • Standalone no DTD provided
  • Has single root element surrounding nested
    elements
  • Has matching tags

18
XML Data
  • XML is self-describing
  • Schema elements become part of the data
  • Relational schema person(name, phone)
  • In XML ltpersongt, ltnamegt, ltphonegt are part of the
    data, and are repeated many times
  • Consequence XML is much more flexible
  • XML semistructured data
  • Well-Formed XML with nested tags is exactly the
    same idea as trees of semistructured data
  • XML also enables nontree structures, as does the
    semistructured data model

19
XML is Semistructured Data
  • Missing attributes
  • Could represent ina table with nulls

ltpersongt ltnamegt Johnlt/namegt
ltphonegt1234lt/phonegt lt/persongt ltpersongt
ltnamegtJoelt/namegt lt/persongt
? no phone !
name phone
John 1234
Joe -
20
XML is Semistructured Data
  • Repeated attributes
  • Impossible in tables

ltpersongt ltnamegtMarylt/namegt
ltphonegt2345lt/phonegt
ltphonegt3456lt/phonegt lt/persongt
? two phones !
name phone
Mary 2345 3456

???
21
XML is Semistructured Data
  • Attributes with different types in different
    objects
  • Nested collections (no 1NF)
  • Heterogeneous collections
  • ltdbgt contains both ltbookgts and ltpublishergts

ltpersongt ltnamegt
ltfirstgtJohnlt/firstgt
ltlastgtSmithlt/lastgt lt/namegt
ltphonegt1234lt/phonegt lt/persongt
? structured name !
22
Document Type Definition (DTD)
  • Part of the original XML specification
  • An XML document may have a DTD
  • Valid XML if it has a DTD and conforms to it
  • Validation is useful in data exchange

23
Very Simple DTD
lt!DOCTYPE db lt!ELEMENT db ((bookpublisher)
)gt lt!ELEMENT book (title,author,year?)gt
lt!ELEMENT title (PCDATA)gt lt!ELEMENT
author (PCDATA)gt lt!ELEMENT year (PCDATA)gt
lt!ELEMENT publisher (PCDATA)gt gt
24
DTD The Content Model
contentmodel
  • Content modellt!ELEMENT tag (CONTENT)gt
  • Complex a regular expression over other
    elements
  • Text-only PCDATA
  • Empty EMPTY
  • Any ANY
  • Mixed content (PCDATA A B C)

25
DTD Regular Expressions
DTD
XML
sequence
lt!ELEMENT name (firstName, lastName))
optional
lt!ELEMENT name (firstName?, lastName))
zero or more
lt!ELEMENT person (name, phone))
one or more
lt!ELEMENT person (name, phone))
alternation
lt!ELEMENT person (name, (phoneemail)))
26
DTD Attributes
lt!ELEMENT person (ssn, name, office,
phone?)gt lt!ATTLIST person age CDATA REQUIRED
height CDATA IMPLIEDgt
ltperson age25 height6gt ltnamegt
...lt/namegt ... lt/persongt
27
DTD Attributes
  • lt!ATTLIST tag (name type kind)gt
  • Types
  • CDATA string
  • (Mon Wed Fri) enumeration
  • ID key
  • IDREF foreign key
  • IDREFS foreign keys separated by space
  • others rarely used
  • Kind
  • REQUIRED
  • IMPLIED optional
  • value default value
  • value FIXED the only value allowed

28
XML IDs and References
  • Attributes can be pointers from one object to
    another
  • Compare to HTMLsNAME foo and HREF foo
  • Allows the structure of an XML document to be a
    general graph, rather than just a tree

29
XML Creating IDs
  • Give an element E an attribute A of type ID
  • When using tag ltEgt in an XML document, give its
    attribute A a unique value
  • Example
  • ltE A xyzgt

30
XML Creating References
  • To allow objects of type F to refer to another
    object with an ID attribute, give F an attribute
    of type IDREF
  • Or, let the attribute have type IDREFS, so the F
    object can refer to any number of other objects

31
XML IDs and References
  • ltperson ido555gt
  • ltnamegtJanelt/namegt
  • lt/persongt
  • ltperson ido456gt
  • ltnamegt Mary lt/namegt
  • ltchildren idrefo123 o555/gt
  • lt/persongt
  • ltperson ido123 mothero456gt
  • ltnamegtJohnlt/namegt
  • lt/persongt
  • IDs and references in XML are just syntax

32
DTD ID and IDREF(S) Attributes
lt!ELEMENT person (ssn, name, office,
phone?)gt lt!ATTLIS person age CDATA REQUIRED
id ID REQUIRED manager IDREF REQUIRED
manages IDREFS REQUIRED gt
ltperson age25 idp29432
managerp48293 managesp34982
p423234gt ltnamegt ....lt/namegt
... lt/persongt
33
Use of DTDs
  • Set standalone no
  • Either
  • Include the DTD as a preamble of the XML
    document, or
  • Follow DOCTYPE and the ltroot taggt by SYSTEM and a
    path to the file where the DTD can be found, or
  • Mix the two... (e.g. to override the external
    definition)

34
Example (a)
  • lt?xml version 1.0 standalone no ?gt
  • lt!DOCTYPE BARS
  • lt!ELEMENT BARS (BAR)gt
  • lt!ELEMENT BAR (NAME, BEER)gt
  • lt!ELEMENT NAME (PCDATA)gt
  • lt!ELEMENT BEER (NAME, PRICE)gt
  • lt!ELEMENT PRICE (PCDATA)gt
  • gt
  • ltBARSgt
  • ltBARgtltNAMEgtJoes Barlt/NAMEgt
  • ltBEERgtltNAMEgtBudlt/NAMEgt ltPRICEgt2.50lt/PRICEgtlt/BEER
    gt
  • ltBEERgtltNAMEgtMillerlt/NAMEgt ltPRICEgt3.00lt/PRICEgtlt/B
    EERgt
  • lt/BARgt
  • ltBARgt
  • lt/BARSgt

35
Example (b)
  • Assume the BARS DTD is in file bar.dtd
  • lt?xml version 1.0 standalone no ?gt
  • lt!DOCTYPE BARS SYSTEM bar.dtdgt
  • ltBARSgt
  • ltBARgtltNAMEgtJoes Barlt/NAMEgt
  • ltBEERgtltNAMEgtBudlt/NAMEgt
  • ltPRICEgt2.50lt/PRICEgtlt/BEERgt
  • ltBEERgtltNAMEgtMillerlt/NAMEgt
  • ltPRICEgt3.00lt/PRICEgtlt/BEERgt
  • lt/BARgt
  • ltBARgt
  • lt/BARSgt

36
DTDs as Grammars
lt!DOCTYPE db lt!ELEMENT db ((bookpublisher)
)gt lt!ELEMENT book (title,author,year?)gt
lt!ELEMENT title (PCDATA)gt lt!ELEMENT
author (PCDATA)gt lt!ELEMENT year (PCDATA)gt
lt!ELEMENT publisher (PCDATA)gt gt
37
DTDs as Grammars
db (bookpublisher) book
(title,author,year?) title
string author string year
string publisher string
  • Same thing as
  • A DTD is a EBNF (Extended BNF) grammar
  • An XML tree is precisely a derivation tree
  • A valid XML document a parse tree for that
    grammar

38
DTDs as Grammars
lt!DOCTYPE paper lt!ELEMENT paper
(section)gt lt!ELEMENT section
((title,section) text)gt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT text (PCDATA)gt gt
ltpapergt ltsectiongt lttextgt lt/textgt lt/sectiongt
ltsectiongt lttitlegt lt/titlegt
ltsectiongt lt/sectiongt
ltsectiongt lt/sectiongt
lt/sectiongt lt/papergt
  • XML documents can be nested arbitrarily deep

39
DTDs as Schemas
  • Not so well suited
  • impose unwanted constraints on order
  • lt!ELEMENT person (name,phone)gt
  • references cannot be constrained
  • ID/IDREFS can reference any ID
  • can be too vague
  • lt!ELEMENT person ((namephoneemail))gt

40
DTDs as Schemas
  • No context-dependant typing
  • Cannot distinguish between used car ads and new
    car ads
  • Different structure in different contexts

dealer
UsedCars
NewCars
ad
ad
model
year
year
41
XML APIs
  • Document Object Model - DOM
  • Manipulation of XML Data
  • Provides a representation of an XML Document as a
    tree
  • Reads XML Document into memory
  • http//www.w3.org/DOM
  • Many implementations (Sun JAXP, Apache Xerces, )
  • Simple API for XML - SAX
  • Event-based framework for parsing XML data
  • http//www.saxproject.org/

42
References
  • Lecture Slides
  • Jeffrey D. Ullman
  • http//www-db.stanford.edu/ullman/dscb/pslides/ps
    lides.html
  • Dan Suciu
  • http//www.cs.washington.edu/homes/suciu/COURSES/5
    90DS/02xmlsyntax.htm
  • http//www.cs.washington.edu/homes/suciu/COURSES/5
    90DS/11dtd.htm
  • Alon Levy
  • http//www.cs.washington.edu/education/courses/cse
    p544/02sp/lectures/lecture5cut.ppt
  • BRICS XML Tutorial
  • A. Moeller, M. Schwartzbach
  • http//www.brics.dk/amoeller/XML/index.html
  • W3C's XML homepage
  • http//www.w3.org/XML
  • XML School an XML tutorial
  • http//www.w3schools.com/xml
Write a Comment
User Comments (0)
About PowerShow.com