Title: Web Data
1Web Data
- Semistructured data
- XML data
2Classes of XML Documents
- Structured
- Un-normalized relational data
- Ex product catalogs, inventory data, medical
records, network messages, logs, stock quotes - Mixed
- Structured data embedded in large text fragments
- Ex On-line manuals, transcripts, tax forms
- Application may process XML in both classes
- Ex SOAP messages
- Header is structured payload is mixed
3Structured Data HL7 Lab Report
- Health-care industry data-exchange format
- ltHL7gt
- ltPATIENTgt
- ltPID IDNum"PATID1234"gt
- ltPaNagtltFaNagtJoneslt/FaNagtltGiNagtWilliamlt/GiNagt
lt/PaNagt - ltDTofBigtltdategt1961-06-13lt/dategtlt/DTofBigt
- ltSexgtMlt/Sexgt
- lt/PIDgt
- ltOBX SetID"1"gt
- ltObsVagt150lt/ObsVagt
- ltObsIdgtNalt/ObsIdgt
- ltAbnFlgtAbove highlt/AbnFlgt
- lt/OBXgt
- ...
4Mixed Data Library of Congress
- Documents of U.S. Legislation
- ltbill bill-stage"Introduction""gt
- ltcongressgt110th CONGRESSlt/congressgt
- ltsessiongt1st Sessionlt/sessiongt
- ltlegis-numgtH.R. 133lt/legis-numgt
- ltcurrent-chambergtIN THE HOUSE OF
REPRESENTATIVESlt/current-chambergt - ltaction date"June 5, 2008"gt
- ltaction-descgt
- ltsponsorgtMr. Englishlt/sponsorgt (for himself
and ltcosponsorgtMr.Coynelt/cosponsorgt)
introduced the following - bill which was referred to the
ltcommittee-namegtCommittee on Financial
Serviceslt/committee-namegt ... - lt/action-descgt
5Wheres the XML Data?
6Schemas
here lies our interest
- why ?
- XML to describe semantics
- semistructured data to improve processing
- what ?
- semistructured data foundational
- XML several concrete proposals
7Schemas
- when ?
- semistructured data, XML a posteriori
- RDBMS a priori, to interpret binary data
- how ?
- semistructured data schema is independent
- XML schema is hardwired with the data
8Outline
- schemas for semistructured data
- foundations
- schema extraction
- schemas for XML
- DTD
- XML-Schema
- RDF-Schema
9Schemas An Example
Some database
10Lower-Bound Schemas
Root
person
company
works-for
managed-by
Employee
Company
c.e.o.
name
address
name
string
11Upper Bound Schemas
Root
person
company
works-for
managed-by
Employee
Company
c.e.o. employee
name address url
name phone position
description
string
Any
-
12The Two Questions to Ask
- Conformance does that data conform to this
schema ? - Classification if so, then which objects belong
to what classes ?
13Schemas for Semistructured XML Data
- Motivations for considering schema
- Optimize query evaluation
- Improve storage efficiency
- Support index construction
- Facilitate the description of database content
- Facilitate query formulation
- Facilitate data integration.
14Application 1 Improve Secondary Storage
Lower-bound schema
Store rest in overflow graph
15Application 2 Query Optimization
select X.title from Bib._ X where X..zip
12345
select X.title from Bib.book X where
X.address.zip 12345
Upper-bound schema
Fernandez, Suciu 1998
16Schema Extraction(From Data)
- Problem statement
- given data instance D
- find the most specific schema S for D
- In practice S too large, need to relax
S.Nestorov , S.Abiteboul, and R.Motwani,
Inferring structure in semistructure data. In
Proc. of The Workshop on Management of
Semi-structured Data, 1997
17Schema Extraction(From Data)
- Roy Goldman, Jennifer Widom DataGuides Enabling
Query Formulation and Optimization in
Semistructured Databases. VLDB 1997
18Schema Extraction(From XML Data)
- Minos N. Garofalakis, Aristides Gionis, Rajeev
Rastogi, S. Seshadri, Kyuseok Shim XTRACT A
System for Extracting Document Type Descriptors
from XML Documents. SIGMOD Conference 2000