Title: Web Data Management
1Web Data Management
- Sanjay Kumar Madria
- Department of Computer Science
- University of Missouri-Rolla
- madrias_at_umr.edu
2WWW
- Huge, widely distributed, heterogeneous
collection of semi-structured multimedia
documents in the form of web pages connected via
hyperlinks.
3World Wide Web
- Web is fast growing
- More business organizations putting information
in the Web - Business on the highway
- Myriad of raw data to be processed for information
4As WWW grows, more chaotic it becomes
- Web is fast growing, distributed,
non-administered global information resource - WWW allows access to text, image, video, sound
and graphic data - More business organizations creating web servers
- More chaotic environment to locate information of
interest - Lost in hyperspace syndrome
5Characteristics of WWW
- WWW is a set of directed graphs
- Data in the WWW has a heterogeneous nature,
self-describing and schema less - Unstructured information , deeply nested
- No central authority to manage information
- Dynamic verses static information
- Web information discoveries - search engines
6Web is Growing!
- In 1994, WWW grew by 1758 !!
- June 1993 - 130
- June 1994 - 1265
- Dec. 1994 - 11,576
- April 1995 - 15,768
- July 1995 - 23,000
- 2000 - !!!!!
7COM domains are increasing!
- As of July 1995, 6.64 million host computers on
the Internet - 1.74 million are com domains
- 1.41 million are edu domains
- 0.30 million are net
- 0.27 million are gov
- 0.22 million are mil
- 0.20 million are org
8Top web countries
- 1. Canada (1) 80 9. New Zealand(7)101
- 2. US (4) 140 10. Sweden (9) 101
- 3. Ireland (3) 110 11. Israel (12) 112
- 4. Iceland (2) 68 12. Cyprus (8) 72
- 5. UK (14) 336 13. Hong Kong (15)148
- 6. Malta (5) 155 14. Norway (10) 64
- 7. Australia (6) 133 15. Switzerland (13) 75
- 8. Singapore (11) 207 16. Denmark (16) 105
9How users find web sites
- Indexes and search engines 75
- UseNet newsgroups 44
- Cool lists 27
- New lists 24
- Listservers 23
- Print ads 21
- Word-of-mouth and e-mail 17
- Linked web advertisement 4
10Limitations of Search Engines
- Do not exploit hyperlinks
- Search is limited to string matching
- Queries are evaluated on archived data rather
than up-to-date data no indexing on current data
- Low accuracy
- Replicated results
- No further manipulation possible
11Limitations of Search Engines
- ERROR 404!
- No efficient document management
- Query results cannot be further manipulated
- No efficient means for knowledge discovery
12More PROBLEMS
- specifying/understanding what information is
wanted - the high degree of variability of accessible
information - the variability in conceptual vocabulary or
ontology used to describe information - complexity of querying unstructured data
13- complexity of querying structured data
- uncontrolled nature of web-based information
content - determining which information sources to
search/query
14- Search Engine Capabilities
- Selection of language
- Keywords with disjunction, adjacency, presence,
absence, ... - Word stemming (Hotbot)
- Similarity search (Excite)
- Natural language (LycosPro)
- Restrict by modification date (Hotbot) or range
of dates (AltaVista) - Restrict result types (e.g., must include images)
(Hotbot) - Restrict by geographical source (content or
domain) (Hotbot) - Restrict within various structured regions of a
document (titles or URLs) (LycosPro) (summary,
first heading, title, URL) (Opentext)
15SEARCH RETRIEVAL
Search engine web covered Hotbot 34 AltaVista
28 Northern Light 20 Excite 14 Infoseek 10 Lyco
s 3
- using several search engines is better than
using only one - Source Lawrence, S., and Giles, C.L., Searching
the World Wide Web, Science 280, pp. 98-100,
1998.
16Key Objectives
- Design a suitable data model to represent web
information - Development of web algebra and query language,
query optimization - Maintenance of Web data - view maintenance
- Development of knowledge discovery and web mining
tools - Web warehouse
- Data integration , secondary storages, indexes
17Web Data Representation
- HTML - Hypertext Markup Language
- fixed grammar, no regular expressions
- Simple representation of data
- good for simple data
- difficult to extract information
- SGML - Standard Generalized Markup
- Language - good for publishing deeply structured
document - XML - Extended Markup Language -a subset of SGML
18Terminology
- HTML - Hypertext Mark-up Language
- HTTP - Hypertext Transmission Protocol
- URL - Uniform Resource Locator
- example - ltURLgtltprotocolgt//ltHostgt/ltpathgt/filena
megtltlocationgt where - ltprotocolgt is http, ftp, gopher
- host is internet address
- location is a textual label in the file.
19 - Links are specified as
- ltA HREFDestination URLgtAnhor Textlt/Agt
- destination URL is the URL of the destination
document and Anchor Text is the text that appears
as an anchor when displayed. - Example
- ltA HREFhttp//www.ntu.edu.sg/ gtNanyang
Technological Universitylt/Agt - Absolute and relative
- URL ltA HREF"AtlanticStates/NYStats.html"gtNew
Yorklt/Agt is relative - ltA HREF"http//www.ncsa.uiuc.edu/General/Internet
/ WWW/HTMLPrimer.html"gt NCSA's Beginner's Guide
to HTMLlt/Agt absolute address
20World Wide Web
- Prevalent, persistent and informative
- HTML documents (soon, XML) created by humans or
applications.
- Accessed day in and day out by humans and
applications.
- Persistent HTML documents!!!
Can database technology help?
21Current Research Projects
- Web Query System
- W3QS, WebSQL, AKIRA, NetQL, RAW,
- WebLog, Araneus
- Semistructured Data Management
- LOREL, UnQL, WebOQL, Florid
- Website Management System
- STRUDEL, Araneus
- Web Warehouse
- WHOWEDA
22Main Tasks
- Modeling and Querying the Web
- view web as directed graph
- content and link based queries
- example - find the page that contain the word
clinton which has a link from a page containing
word monica.
23 - Information Extraction and integration
- wrapper - program to extract a structured
representation of the data a set of tuples from
HTML pages. - Mediator - integration of data-softwares that
access multiple source from a uniform interface - Web Site Construction and Restructuring
- creating sites
- modeling the structure of web sites
- restructuring data
24MEDIATOR ARCHITECTURE
User Interface
Mediator (Query/Search/ Retrieval/Result)
Wrapper
Wrapper
. . .
25What to Model
- Structure of Web sites
- Internal structure of web pages
- Contents of web sites in finer granularities
26Data Representation of Web Data
- Graph Data Models
- Semistructured Data Models (also graph based)
27Graph Data Model
- Labeled graph data model where node represents
web pages and arcs represent links between pages. - Labels on arcs can be viewed as attribute names.
- Regular path expression queries
28Semistructured Data Models
- Irregular data structure, no fixed schema known
and may be implicit in the data - Schema may be large and may change frequently
- Schema is descriptive rather than perspective
describes the current state of data, but
violations of schema is still tolerated
29 - Data is not strongly typed for different objects
the values of the same attributes may be of
differing types. (heterogenious sources) - No restriction on the set of arcs that emanate
from a given node in a graph or on the types of
the values of attributes - Ability to query the schemas acr variables which
get bound to labels on arcs, rather than nodes in
the graph
30Graph based Query Languages
- Use graph to model databases
- Support regular path expressions and graph
construction in queries. - Examples
- Graph Log for hypertext queries
- graph query language for OO
31Query Languages for Semi-Structured data
- Use labeled graphs
- Query the schema of data
- Ability to accommodate irregularities in the
data, such as missing links etc. - Examples Lorel (Stanford) , UnQL (ATT), STRUQL
(ATT)
32Comparison of Query Systems
33Types of Query Languages
- First Generation
- Second generation
34First Generation Query Languages
- Combine the content-based queries of search
engines with structure-based queries - Combine conditions on text pattern in documents
with graph pattern describing link structures - Examples - W3QL (TECHNION, Israel)
- WebSQL (Toronto), WebLOG (Concordia)
35Second generation languages
- Called web data manipulation languages
- Web pages as atomic objects with properties that
they contain or do not contain certain text
patterns and they point to other objects - Useful for data wrapping, transformation, and
restructuring - Useful for web site transformation and
restructuring
36How they Differ?
- Provide access to the structure of web objects
they manipulate - return structure - Model internal structures of web documents as
well as the external links that connect them - Support references to model hyperlinks and some
support to ordered collections of records for
more natural data representation - Ability to create new complex structures as a
result of a query
37Examples
38W3QS (WWW Query System) at Technion - Israel
- Content queries
- Structural Queries
- Interfacing with user written programs and
standard UNIX utilities - Uses existing WWW indexes and search Services
- Provides view update facility
39W3QS
- Accessible via any WWW browsers
- API can be used by programs running anywhere in
the Internet - Support queries on the web structure by
specifying starting page, a search domain and
depth of links. - File content analysis tools and filling up of
forms automatically
40File Types
- Strict Inner Structure files such as Unix
environment files - Semantics of the data is
clearly linked to the syntax - Semi-structured files - text files containing
formatting codes such as Latex or HTML files-
possible to use formatting codes to analyze their
semantic content - Raw Files - no relation between meaning of file
and its inner structure
41Content Queries
- Queries based on the content of a single node of
hypertext - SQLCOND is used to evaluate boolean expressions
- Example - node-format Latex and Node.author
Sanjay
42Structure Queries
- Information conveyed in the hypertext
organization itself is conveyed. - The result is a set of nodes and links from the
hypertext structure that satisfy a given graph
pattern graph with nodes and edges are annotated
with conditions. - Components are pattern definition, search engines
and form completion
43Structure Query
Node2.author Sanjay
Link1. revdoc
Node1.title Good article
Answer URL http//../myarticles.html
URL http///.tex ltTitlegt Good articleslt/Titlegt
\author sanjay A HREF//..revdocgt
44Search for an article
- Select cp n2/ result
- from n1, l2, n2
- where n1 in importantindexs.url
- Fill n1.form as IN importantindexes.fil with
Keyword sanjay SQLCOND (n2.format Laytex)
and (n2.authorsanjay)
45Query to search hypertext pattern
- Return all the articles cited in the first
chapter of the book. Each chapter includes
several pointers to the bibliography, for example - ltA HREFhttp//cs/refrences.htmlref2gt
- Relativitylt/gt means link Relativity leads to
the label ref2 in the references.html file. - In the references.html file the labeled link
looks like ltA HREF./relative.texnameref2gt - relativity, sanjaylt/Agt this link points to
relative.tex
46 - Select cp art/ result from Ind,
l1,chap,l2,ref,l3 art where SQLCOND (ind.url
http//) And (chap.url /.chapter-1.html/) AND
l2.HREF /.\13.Name/) - USING BFS.
47Url http//cs.tech/bookindex.html INDEX Chapter
1 Chapter 2 References
Url http///Chapter-1.html ref 1 ref 2 ref 3
l1
http//relative.tex
l3
ref 1 ref 2 ref 3
article
48WebSQL-University of Toronto
- Model web as relational database
- Use two relations Document and Anchor
- Document relation has one tuple for each document
in the web and the anchor relation has one tuple
for each anchor in each document
49WebSQL
- SQL-like query language for extracting
information from the web. - Capable of systematic processing of either all
the links in a page, all the pages that can be
reached from a given URL through paths that match
a pattern, or a combination of both. - Provides transparent access to index servers
50Document
51Anchor
52 - Give documentss URLs which contain same title
and keyword(s) - Select d1.url, d2.url from
- document d1 such that d1 MENTIONS keyword1 and
document d2 such that d2 MENTIONS keyword1 - where d1.title d2.title
- and NOT (d1.url d2.url)
53Find Labels of all Hyperlinks to Postscript
FilesSELECT a.labelFROM Anchor a SUCH THAT
base "http//www.SomeDoc.html"WHERE a.href
CONTAINS ".ps.Z"
54 Documents about Databases
SELECT Document d.url, d.titleFROM d SUCH THAT
"http//www.OtherDoc.html" -gtgt dWHERE d.title
CONTAINS "databases" Note -gt path of length
one within same servergt path of length of one
but different server
55Retrieve all the documents in the same server
that are pointed tofrom the documentWhose URL
is given
- Select d.url, d.title from
- Document d SUCH THAT
- http//www. Cs.in -gt d
56Find all broken links in a page
- SELECT a.hrefFROM Anchor a SUCH THAT base
"http//the.document.to.test"WHERE
protocol(a.href) "http" AND doc(a.href) null
57Web OQL (University of Toronto)
- Provides a framework that supports a large class
of data - Restructuring operations.
- Simple semistructured data model for documents
and record-based data - OQL-like syntax and regular expressions
- Serves as a two-way bridge between databases and
the Web.
58DATA MODEL
- Hypertrees are Ordered arc labeled trees with two
types of arcs internal and external - Internal arcs represent structured objects
- External arcs to represent refrences (huperlinks)
among objects. - Records as labels in the arcs
- Sets of related hypertrees as Web
59ARCHITECTURE
- Wrappers map all data sources to trees
- The mapping can be done all at once or on
demand
60Example
- Extract from cspapers (paper database) title and
URL of the full version of papers of Smith - select y.title,y.URL
- from x in cs papers, y in x
- where y.authors smith
61Web Creation
- Create a new page for each research Group (using
the group name as URL). Each page contains the
publications of the corresponding group. - Select x as x.group from x in cspapers
- Select q1 as s1, q2 as s2, ...qm as sm
- where qs are queries and each Ss is either a
string query or keyword schema. as clause
create a URLs s1 , ..sm assigned to each new
page resulting from each query.
62ARANEUS
- Data Model called ADM for Web Documents - nested
web objects, page schemas - Several languages for wrapping, querying,
creating and updating web sites - object algebra - Methods and Techniques for Web Site Design and
Implementation - Presentation in SIGMOD99
- Software is available at their home site
63- Wrappers - map logical access to attribute values
in a page at the ADM level tp physical access to
text in the HTML source using EDITOR - ULIXES - SQL-like query languages
- PENELOPE - manipulation language
- Site integration, semantic heterogeneities
- Materialized views
- http//poincare.dia.uniroma3.it8080/Araneous
-
64Lore - motivation
- The data may be irregular and thus not conform to
a rigid schema. - Relational data model has null values, and OO
models have inheritance and complex objects. Both
have difficulties in designing schemas to
incorporate irregular data. - It may be difficult to decide in advance on a
single, correct schema, The structure of the data
may evolve rapidly, data elements may change
types, or data not conforming to previous
structure may be added.
65- Thus, there is a need for management of
- semi-structured data!
- Lore system manages semi-structured data. The
data managed by Lore is not confined to a schema
and it may be irregular or incomplete. - OEM is the Lores data model. OEM - object
Exchange Model - graph based self-describing
object instance model where nodes are objects and
edges are labeled with attribute names and leaf
nodes have atomic values - Lore is light weight object repository and Lorel
is Lores query language.
66Object Exchange Model - OEM
- Motivation - information exchange and extraction
- Why a new data model? it not a new model.
- Each value exchanged is given an explicit label.
- Object ?temp-in-Fahrenheit, integer, 80? -
temp-in-Fahrenheit is the label. Each object is
self-describing, with a label, type and value. - ?set-of-temps, set, cmpnt1, cmpnt2 ?
- cmpnt1 is ?temp-in-Fahrenheit, integer, 80?
- cmpnt2 is ?temp-in-Celsius, integer, 20?
67Labels
- Plays two roles
- identifying an object (component)
- identifying the meaning of an object (component)
?person-record, set, cmpnt1, cmpnt2, cmpnt3 ?
cmpnt1 is ?person-name, string, Fred?
cmpnt2 is ?office-num-in-bldg-5, integer, 333?
cmpnt3 is ?department, string, toy?
- Person-name both identifies cmpnt1 and coveys its
meaning.
- In relational data this corresponds to .
68Labels - Issues
- What does the label mean?
- Database of labels
- Ontology of labels - within each source
- Labels are relative (more specific) to the source
of the data object. - Similar labels from different sources need to be
resolved.
- Labels provide the flexibility in representing
object structure
69Self-describing data models
- Have been in existence for a long time? Why
additional interest now?
- Use the nature of self-describing data model
for information exchange, and to extend the model
to include object nesting. - To provide an appropriate object request language
(query facility)
70OEM - Specification
- Each object in OEM has the following structure
- Label A variable character string describing
what the object represents. - Type The data type of the objects value. Each
is either an atom type, or type set. - Value A variable-length value of the object.
- Object-ID A unique variable-length identifier
for the object or null.
71OEM - Summary
- OEM is an information exchange model. It does not
specify how objects are stored at source.
- OEM does specify how objects are received at a
client, but after objects are received they can
be stored in any way the client likes.
- Each source has a distinguished object with
lexical identifier root.
- Note the schema-less nature of OEM is
particularly useful when a client does not know
in advance the labels or structure of OEM objects.
72- ltbiblio,set,doc1,doc2,,docngt
- doc1 is ltdoc, set, auths1, topic1, call-no1gt
- auths1 is ltauth-set,set auth11gt
- auth11 is ltauth-ln, string, Ullmangt
- topic1 is lttopic, string,Databasesgt
- call-no1 is ltinternal-call-no, integer, 25gt
- doc2 is ltdoc, set, auths2, topic2, call-no2gt
- auths2 is ltauth-set,set auth21, auth22,
auth23gt - auth21 is ltauth-ln, string, Ahogt
- auth22 is ltauth-ln, string, Hopcroftgt
- auth23 is ltauth-ln, string, Ullmangt
-
Example
- topic2 is lttopic, string,Algorithmsgt
- call-no1 is ltdewey-decimal, string, BR273gt
- docn is ltdoc, set, authsn, topicn, call-nongt
- authsn is ltauth,string, Crichtongt
- topic1 is lttopic, string,Dinosaursgt
- call-no1 is ltfictional-call-no, integer, 95gt
- biblio is the root object.
73OEM - QL
- SELECT Fetch-expression
- FROM Object
- WHERE Condition
- The result of this query is itself an object,
with special label answer - ?answer, set, obj1, obj2, , objn ?
- Each returned obji is a component of object
specified in the From clause of the query, where
the component is located by the Fetch-expression
and satisfies the Condition.
74Path
- The notion of path is used in both
Fetch-Expression in the Select clause and the
condition in the Where clause. - Path describes traversals through an object using
subobject structure and labels. - Example biblio.doc.auth
- Paths are used in Fetch-Expression to specify
which components are are returned in the answer
object. - Paths are used in the condition to qualify the
fetched objects or other (related) components in
the same object structure.
75Queries - Simple
- Retrieve the topic of each document for which
Ullman is one of the authors - SELECT biblio.doc.topic
- FROM root
- WHERE biblio.doc.auth-set.auth-ln Ullman
- Intuitively, the querys where clause finds all
paths through subobject structure with the
sequence of labels biblio,doc,auth-set,auth-ln
such that the object at the end of the path has
value Ullman. - ltanswer, set, obj1, obj2gt
- obj1 is lttopic, string, Databasesgt
- obj2 is lttopic, string, Algorithmsgt
76Queries - wild-cards
- Retrieve all documents with internal call number
- SELECT biblio.?.topic
- FROM root
- WHERE biblio.?.internal-call-no
- ? label matches any label. For this query,
the doc labels can be replaced by any other
strings and query would produce the same result.
By convention, two occurrences of ? In the same
query must match the same label unless variables
are used. - ltanswer, set, obj1gt
- obj1 is lttopic, string, Databasesgt
77Queries - wild-paths
- Retrieve all documents with internal call number
- SELECT .topic
- FROM root
- WHERE .internal-call-no
- Symbol matches any path of length one or
more. The use of followed by a single label is
a convenient and common way to locate objects
with a certain label in complex structure.
Similar to ?, two occurrences of in the same
query must match the same sequence of labels,
unless variables are used. - ltanswer, set, obj1gt
- obj1 is lttopic, string, Databasesgt
78Queries - variables
- Retrieve each document for which both
Hopcroft and Aho are co-authors - SELECT biblio.doc
- FROM root
- WHERE biblio.doc.auth-set.auth-ln(a1)Aho and
- biblio.doc.auth-set.auth-ln(a1)H
opcroft - Here, the query finds all the paths with
structure biblio, doc, auth-set, and with two
distinct path completions with label auth with
values Aho and Hopcroft - ltanswer, set, obj1gt
- obj1 is the complete doc2
79An OEM Database
DBGroup
1
Member
Project
Member
Member
Project
Member
2
3
4
5
6
Name
Project
Name
Office
Project
Age
Name
Age
Office
Office
9
11
8
10
12
13
14
7
15
16
Clark
Smith
46
Gates 252
Lore
Tsimmis
Jones
28
Room
Building
Room
Building
17
18
19
20
CIS
411
CIS
252
80Lorel Queries - Simple Path Expression
- Retrieve the offices of members with age greater
than 30 years - Query SELECT DBGroup.Member.Office
- WHERE DBGroup.Member.Age gt 30
- Result Office Gates 252
- Office
- Building CIS
- Room 411
81Queries - General Path Expression
- Query SELECT DBGroup.Member.Name
- WHERE DBGroup.Member.Office(.Room.Cubicle)?
- Like 252
- Result Name Jones
- Name Smith
- Room matches all labels starting from Room, like
Room68. stands for disjunction. ? indicates
that the label pattern is optional. like 252
specifies that the data value should end with
string 252.
82Queries - SubQueries
Retrieve Lore project members who work on other
projects Query SELECT M.Name, ( SELECT
M.Project.Title WHERE M.Project.Title !
Lore) FROM DBGroup.Member M WHERE
M.Project.Title Lore Result Member Name
Jones Title Tsimmis
83Lore - Summary
- Lore does facilitate query and updates on
semi-structural databases - There has been more work done on optimization
using data guides (vldb97). - The system is up and running http//WWW-DB.Stanfo
rd.EDU/lore/demo/ - How is this related to WWW?
- XML-QL and related work provides the answer.
84Extraction and Integration
- OEM and subsequent LORE(L) can be used for
extracting information from multiple information
sources. - OEM helps navigate through unknown objects by
- SELECT ?
- FROM root
- Thus help browsing and schema discovery
- Efficient implementations are possible using
partial fetch mechanism. - Push and Pull information delivery systems are
possible. - How is this different from WebIR?
85STRUDEL
- Web Site Management System
- web Site from multiple sources
- STruQL - based on OEM, graphs, regular
expressions, result as graph - Example - return all the postscript papers from
homepages - Where homepages(p), p paper q
- ispostscript(q) collect postscriptpages(p)
- Where C1,...Ck Create N1,...Nn link L1,...Lp,
Collect G1, Gq
86Complex Constructors
Supported by Strudel a Website Management System
with StruQL as query language where Biblio(X),
X -gt paper -gt P, P -gt author -gtA, P -gt
title -gt T, P -gt year -gt Y create Root(),
HomePage(A), YearPage(A,Y), PubPage(P) link Root()
-gt person -gt HomePage (A), HomePage(A)
-gtyearentry -gt YearPage(A,Y), YearPage(A,Y) -gt
publication -gt PubPage(P), PubPage(P) -gt
author -gt HomePage(A), PubPage(P) -gt title
-gtT
87WebDB
- View WWW as multimedia documents in the form of
web pages - WQL supports selection, aggregation, sorting,
summary, grouping - projection on title , URL, keywords, tables,
forms, images etc.
88Some More Results
- UnQL - ATT
- AKIRA- Pennstate
- NoDose - SIGMOD98
89HTML to XML
- HTML documents
- Emerging Web Standards - XML
- XML good for data interchange across platforms
enterprise wide - conversion HTML to XML - IBM, Microsoft
90XML - Motivation
- In HTML, both the tag semantics and tags are
fixed. There is limited and strict interpretation
of tags. - HTML is widely successful in disseminating
documents across internet. - Though data can be disseminated through HTML, its
extraction is painful, and laborious. - EDI has been a predominate mode of exchanging
data among businesses. But it has very rigid
format that requires highly customized
applications.
91XML - Introduction
- XML aims to provide ease of authoring HTML
documents with ease of data exchange that is
possible with EDI. - Tags are used to markup documents.
- XML is a meta-language for describing markup
languages. - XML provides a facility to define tags and
structural relationships between them. - No pre-defined tag set implied no preconceived
semantics, semantics of XML document is will be
defined by applications that process them or
style sheets (XSL).
92XML - Goals
- Straightforward to use over internet
- Support wide variety of applications, authoring,
browsing, content analysis, etc. - Easy to write programs that process XML documents
and validate them. - XML documents must be human-legible and
reasonably clear. - Design of XML shall be formal and concise -
expressed as EBNF (extended Backus Naur Form) -
amenable to modern compiler tools and techniques.
93XML-features
- Some structure - not rigid
- Extensibility - User defined tags
- nested elements
- validation - documents may specify their own
grammar - DTP (Document Type Descriptor) - schema exists
with data as tag names - Application -EDI - extraction, conversion, ,
transformation, integration - can be modeled using DOM
94More terminology
- RDF - Resource Description Framework - a method
to describe metdata for XML documents - XSL - Extensible Stylesheet Language - language
for transforming and formatting XML. - Transformation Language - XSLT, XPath, XPointer
95Example-HTML
- Print - Sanjay Madria
- Web Warehouse Tutorial, ADBIS99
- HTML
- ltH2gt Sanjay Madria lt/H2gt
- ltIgt Web Warehouse Tutorial, ADBIS99lt/Igt
- Very difficult to understand, structure is
hidden, describes only appearance
96XML
- ltRefgt
- ltSpeakergt ltFirstnamegt Sanjaylt/firstnamegt
- ltLastnamegt Madrialt/lastnaamegt
- lt/Speakergt
- ltTitle gt Web Warehouse Tutoriallt/Titlegt
- ltConferencegt ADBIS99lt/Conferencegt
- lt/emptygt
- lt/Refgt
- another format
- ltFirstname Value Sanjay/gt
97XML Data
- ltbookgt
- lttitlegt database systemslt/titlegt
- ltauthorgt John ltlastnamegt Korthlt/lastnamegtlt/autho
rgt - ltprice currency USDgt 5.87lt/pricegt
- lt/bookgt
- DTD
- lt!ELEMENT book (title, author, price)gt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT author(PCDATA)lastname)
98- lttrgt lttd width"20" valign"top"gt Firma
Karl-Heinz Rosowski lt/tdgt - lttd width"20" valign"top"gt Maikstraße 14 lt/tdgt
- lttd width"20" valign"top"gt 22041 Hamburg lt/tdgt
- lttd width"20" valign"top"gt 721 99 64 lt/tdgt
- lttd width"20" valign"top"gt 21110111 lt/tdgt
lt/trgt
HTML Version
- lt?xml version"1.0"?gt
- ltAddressesgt
- ltAddress id"12359"gt
- ltNamegtFirma Karl-Heinz Rosowskilt/Namegt
- ltStreetgtMaikstraße 14lt/Streetgt
- ltZIPgt22041lt/ZIPgt
- ltCitygtHamburglt/Citygt
- ltTelgt721 99 64lt/Telgt
- ltFaxgt21110111lt/Faxgt ltEmail/gt
- lt/Addressgt
- lt/Addressesgt
XML Version
99XML - Document - Continued
- lt?xml version"1.0"?gt is the XML declaration.
- ElementsMost common form of markup. ltelementgt
lt/elementgt. For example ltnamegtJack Lemon lt/namegt - Attributes are name-value pairs that occur
inside start-tags after the element name. For
example ltAddress id"12359"gt attaches value
12359 to attribute id of Address element. - Entity References to handle special characters
of XML like lt in the XML documents.
100- Comments lt!-- this is a comment --!gt
- CDATA Sections a CDATA (string of characters)
section instructs the parser to ignore most
markup characters. For example source code,
lt!CDATA p q b (I lt 3)gt, between
CDATA and all character data is passed to an
application, with out interpretation.
101XML - DTD - Element Type Declarations
- Element type declarations identify the names of
elements and the nature of their content. A
typical element type declaration looks like - lt!Element Address (Name, Street, ZIP?, City,
Tel, Fax, Email?)gt - Address is the element name, and (Name, Street,
ZIP?, City, Tel, Fax, Email?) is the content
model. Every address must contain, Name, Street,
City and Tel. ZIP and Email are optional, whereas
there can be zero or more Fax numbers.
102- The declarations for Name, Street, ZIP , must
also be given. For example - lt!Element Name (PCDATA)gt
- Attribute List Declarations identify which
elements may have attributes, what values the
attributes may hold, and what value is default.
Attribute values appear only within start-tags
and empty-element tags. - ltAddress id"12359"gt
103XML - Summary
- HTML describes presentation
- XML describes content
- XML vs. HTML
- users define new tags
- arbitrary nesting
- validation is possible
104XML and Semi Structural Data Model
- XML data is fundamentally different than
relational and object oriented data. - XML is not rigidly structured.
- In relational and OO data model every data
instance has a schema which is separate and
independent of the data. - XML data is self describing and can naturally
model irregularities that cannot be modeled by
relational or OO data model.
105- For example, data items may have missing elements
or multiple occurrences of the same element
elements may have atomic values in some data
items and structured values in others and
collections of elements can have heterogeneous
structure. - Even XML data that has an associated DTD is
self-describing (the schema is always stored
with the data) and, except for very restricted
forms of DTDs, may have all the irregularities
described above. - XML is an instance of semistructured data.
106XML-QL
- Regular path expression
- pattern matching
- used edge labeled graphs
- extract data from existing XML documents and
construct new XML documents - support for ordered and unordered views on XML
document - simple and declarative
107XML-QL
- The simplest XML-QL queries extract data from an
XML document. Consider the following DTD - lt!ELEMENT book (author,title,publisher)gt
- lt!ATTLIST Book year CDATAgt
- lt!ELEMENT article (author title year?,
(shortversion longversion))gt - lt!ATTLIST article type CDATAgt
- lt!ELEMENT publisher (name, address)gt
- lt!ELEMENT author (firstname?, lastname)gt
108XML-QL Example Data
ltbibgt ltbook year1995gt lttitlegt An
Introduction to DB Systems lt/titlegt ltauthorgt
ltlastnamegt Date lt/lastnamegtlt/authorgt ltpublishergt
ltnamegt Addison-Wesleylt/namegt lt/publishergt lt/bookgt
ltbook year1995gt lttitlegt Foundations for
OR Databases lt/titlegt ltauthorgt ltlastnamegt Date
lt/lastnamegtlt/authorgt ltauthorgt ltlastnamegt
Darwen lt/lastnamegtlt/authorgt ltpublishergtltnamegt
Addison-Wesleylt/namegt lt/publishergt lt/bookgt lt/bibgt
109Matching Data Using Patterns
- XML uses element patterns to match data in an XML
document. - Find all authors of books whose publisher is
Addison-Wesley in XML document www.a.b.c/bib.xml - WHERE ltbookgt
- ltpublishergtltnamegtAddison-Wesleylt/namegtlt/publishe
rgt - lttitlegt t lt/titlegt
- ltauthorgt a lt/authorgt
- lt/bookgt IN www.a.b.c/bib.xml
- CONSTRUCT a
- matches every ltbookgt element in the XML document
that has at least one lttitlegt element, one
ltauthorgt element , and one publisher element
whose ltnamegt is Addison-Wesley. For each such
match it binds t and a to every title and
author pair.
110XML-QL Constructing XML Data
- Often we would like format the result.
- Find all authors and titles of books whose
publisher is Addison-Wesley in XML document
www.a.b.c/bib.xml - WHERE ltbookgt
- ltpublishergtltnamegtAddison-Wesleylt/gtlt/gt
- lttitlegt t lt/titlegt
- ltauthorgt a lt/authorgt
- lt/bookgt IN www.a.b.c/bib.xml
- CONSTRUCT ltresultgt
- ltauthorgt a lt/gt
- lttitlegt t lt/gt
- lt/gt
111Constructing XML Data -cont.
Result of the query ltresultgt ltauthorgtltlastname
gt Date lt/lastnamegtlt/authorgt lttitlegt
Introduction to Database Systems
lt/titlegt lt/resultgt ltresultgt ltauthorgtltlastnamegt
Date lt/lastnamegtlt/authorgt lttitlegt Foundations
for OR Databases lt/titlegt lt/resultgt ltresultgt lt
authorgtltlastnamegt Darwen lt/lastnamegtlt/authorgt ltt
itlegt Foundations for OR Databases
lt/titlegt lt/resultgt One result for each author,
duplicating title information.
112XML-QL Nested Queries.
WHERE ltbookgt lttitlegt t lt/gt ltpublishergtltname
gtAddison-Wesleylt/gtlt/gt lt/gt CONTENT_AS p IN
www.a.b.c/bib.xml CONSTRUCT ltresultgt lttitle
gt t lt/gt WHERE ltauthorgt a lt/gt in
p CONSTRUCT ltauthorgt a lt/gt
lt/gt ltresultgt ltauthorgtltlastnamegt Date
lt/lastnamegtlt/authorgt lttitlegt Introduction to
Database Systems lt/titlegt lt/resultgt ltresultgt lt
authorgtltlastnamegt Date lt/lastnamegtlt/authorgt ltaut
horgtltlastnamegt Darwen lt/lastnamegtlt/authorgt lttitl
egt Foundations for OR Databases
lt/titlegt lt/resultgt
113XML-QL Join Queries
XML queries cab express joins by matching two
or more elements that contain same value. Find
all articles that have at least one author who
has written a book since 1995. WHERE ltarticlegt
ltauthorgt ltfirstnamegt f lt/gt //
firstname f ltlastnamegt l lt/gt //
lastname l lt/gt lt/gt CONTENT_AS a
IN "www.a.b.c/bib.xml" ltbook yearygt
ltauthorgt ltfirstnamegt f lt/gt //
join on same firstname f ltlastnamegt
l lt/gt // join on same lastname l lt/gt
lt/gt IN "www.a.b.c/bib.xml", y gt
1995 CONSTRUCT ltarticlegt a lt/gt
114XML-QL Data Model for XML
- XML graph G in which each node is represented by
a unique string called object identifier (OID),
Gs edges are labelled with element tags, Gs
nodes are labeled with sets of attribute value
pairs, Gs leaves are labeled with one string
value, and G has a distinguished node called
root. -
115XML-QL Data Model for XML
- The model allows several edges between the same
two nodes with the following restriction - between any two nodes there can be at most one
edge with a given label - a node cannot have two leaf children with the
same label and same string value - XML graphs are not only derived from XML
documents, but are also generated by queries.
116XML- Element Identity, Ids, and IDREFS
- For element sharing XML reserves an attribute of
type ID which allows a unique key to be
associated with an element. - An attribute of type IDREF allows an element to
refer to another element with the designated key,
and one of the type IDREFS may refer to multiple
elements.
117- lt!ATTLIST person ID REQUIREDgt
- lt!ATTLIST article author IDREFS IMPLIEDgt
- ltperson ID"o123"gt
- ltfirstnamegtJohnlt/firstnamegt
- ltlastnamegtSmithltlastnamegt
- lt/persongt
- ltperson ID"o234"gt
- . . .
- lt/persongt
- ltarticle author"o123 o234"gt
- lttitlegt ... lt/titlegt
- ltyeargt 1995 lt/yeargt
- lt/articlegt
118XML- Element Identity, Ids, and IDREFS
119The following query produces all lastname, title
pairs by joining the author element's IDREF
attribute value with the person element's ID
attribute value. WHERE ltarticle authorigt
lttitlegt lt/gt ELEMENT_AS t
lt/gt, ltperson IDigt
ltlastnamegt lt/gt ELEMENT_AS l
lt/gt CONSTRUCT ltresultgt t llt/gt The idiom
lttitlegtlt/gt ELEMENT_AS t binds t to a lttitlegt
element with arbitrary contents. The element
expression lttitle/gt matches a lttitlegt element
with empty contents.
120XML-QL- Advanced Examples
Tag Variables Regular Path Expressions Transformin
g XML Data (from one DTD to another) Integrating
Data from different XML sources Embedding queries
in data XML-QL check http//www3.org/TR/NOTE-xml
-ql
121Summary
- Even before you blink your eye. Lot of work has
gone in web data models and query languages - Some problems are addressed
- Semi-structural
- semi-structural data model based query languages
- schema inference from semi-structural data model
- efficient processing of queries on
semi-structural data - efficient indexing and storage structures
- integration with XML
- Traditional
- WebSQL/WebOQL
- Web Warehousing
- Which way will you go?
122Further issues
- Distributed query processing
- Continuous result processing with push/pull
result replenishment - Labels, labels every where, with XML more labels
every where how are semantics of queries across
multiple information sources handled - IR gives too many relevant/irrelevant results
- Query Processing requires some schema knowledge
that is difficult to handle across multiple
sources - Can these two be bridged? Cooperative solutions.
- Next Agents, Agents everywhere, What are they
doing? Will it work or Will it be a fad?