Web Data Management - PowerPoint PPT Presentation

About This Presentation
Title:

Web Data Management

Description:

Huge, widely distributed, heterogeneous collection of semi ... Mediator - integration of data-softwares that access multiple source from a uniform interface ... – PowerPoint PPT presentation

Number of Views:1172
Avg rating:3.0/5.0
Slides: 123
Provided by: skm8
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: Web Data Management


1
Web Data Management
  • Sanjay Kumar Madria
  • Department of Computer Science
  • University of Missouri-Rolla
  • madrias_at_umr.edu

2
WWW
  • Huge, widely distributed, heterogeneous
    collection of semi-structured multimedia
    documents in the form of web pages connected via
    hyperlinks.

3
World Wide Web
  • Web is fast growing
  • More business organizations putting information
    in the Web
  • Business on the highway
  • Myriad of raw data to be processed for information

4
As WWW grows, more chaotic it becomes
  • Web is fast growing, distributed,
    non-administered global information resource
  • WWW allows access to text, image, video, sound
    and graphic data
  • More business organizations creating web servers
  • More chaotic environment to locate information of
    interest
  • Lost in hyperspace syndrome

5
Characteristics of WWW
  • WWW is a set of directed graphs
  • Data in the WWW has a heterogeneous nature,
    self-describing and schema less
  • Unstructured information , deeply nested
  • No central authority to manage information
  • Dynamic verses static information
  • Web information discoveries - search engines

6
Web is Growing!
  • In 1994, WWW grew by 1758 !!
  • June 1993 - 130
  • June 1994 - 1265
  • Dec. 1994 - 11,576
  • April 1995 - 15,768
  • July 1995 - 23,000
  • 2000 - !!!!!

7
COM domains are increasing!
  • As of July 1995, 6.64 million host computers on
    the Internet
  • 1.74 million are com domains
  • 1.41 million are edu domains
  • 0.30 million are net
  • 0.27 million are gov
  • 0.22 million are mil
  • 0.20 million are org

8
Top web countries
  • 1. Canada (1) 80 9. New Zealand(7)101
  • 2. US (4) 140 10. Sweden (9) 101
  • 3. Ireland (3) 110 11. Israel (12) 112
  • 4. Iceland (2) 68 12. Cyprus (8) 72
  • 5. UK (14) 336 13. Hong Kong (15)148
  • 6. Malta (5) 155 14. Norway (10) 64
  • 7. Australia (6) 133 15. Switzerland (13) 75
  • 8. Singapore (11) 207 16. Denmark (16) 105

9
How users find web sites
  • Indexes and search engines 75
  • UseNet newsgroups 44
  • Cool lists 27
  • New lists 24
  • Listservers 23
  • Print ads 21
  • Word-of-mouth and e-mail 17
  • Linked web advertisement 4

10
Limitations of Search Engines
  • Do not exploit hyperlinks
  • Search is limited to string matching
  • Queries are evaluated on archived data rather
    than up-to-date data no indexing on current data
  • Low accuracy
  • Replicated results
  • No further manipulation possible

11
Limitations of Search Engines
  • ERROR 404!
  • No efficient document management
  • Query results cannot be further manipulated
  • No efficient means for knowledge discovery

12
More PROBLEMS
  • specifying/understanding what information is
    wanted
  • the high degree of variability of accessible
    information
  • the variability in conceptual vocabulary or
    ontology used to describe information
  • complexity of querying unstructured data

13
  • complexity of querying structured data
  • uncontrolled nature of web-based information
    content
  • determining which information sources to
    search/query

14
  • Search Engine Capabilities
  • Selection of language
  • Keywords with disjunction, adjacency, presence,
    absence, ...
  • Word stemming (Hotbot)
  • Similarity search (Excite)
  • Natural language (LycosPro)
  • Restrict by modification date (Hotbot) or range
    of dates (AltaVista)
  • Restrict result types (e.g., must include images)
    (Hotbot)
  • Restrict by geographical source (content or
    domain) (Hotbot)
  • Restrict within various structured regions of a
    document (titles or URLs) (LycosPro) (summary,
    first heading, title, URL) (Opentext)

15
SEARCH RETRIEVAL
  • Search Engines

Search engine web covered Hotbot 34 AltaVista
28 Northern Light 20 Excite 14 Infoseek 10 Lyco
s 3
  • using several search engines is better than
    using only one
  • Source Lawrence, S., and Giles, C.L., Searching
    the World Wide Web, Science 280, pp. 98-100,
    1998.

16
Key Objectives
  • Design a suitable data model to represent web
    information
  • Development of web algebra and query language,
    query optimization
  • Maintenance of Web data - view maintenance
  • Development of knowledge discovery and web mining
    tools
  • Web warehouse
  • Data integration , secondary storages, indexes

17
Web Data Representation
  • HTML - Hypertext Markup Language
  • fixed grammar, no regular expressions
  • Simple representation of data
  • good for simple data
  • difficult to extract information
  • SGML - Standard Generalized Markup
  • Language - good for publishing deeply structured
    document
  • XML - Extended Markup Language -a subset of SGML

18
Terminology
  • HTML - Hypertext Mark-up Language
  • HTTP - Hypertext Transmission Protocol
  • URL - Uniform Resource Locator
  • example - ltURLgtltprotocolgt//ltHostgt/ltpathgt/filena
    megtltlocationgt where
  • ltprotocolgt is http, ftp, gopher
  • host is internet address
  • location is a textual label in the file.

19
  • Links are specified as
  • ltA HREFDestination URLgtAnhor Textlt/Agt
  • destination URL is the URL of the destination
    document and Anchor Text is the text that appears
    as an anchor when displayed.
  • Example
  • ltA HREFhttp//www.ntu.edu.sg/ gtNanyang
    Technological Universitylt/Agt
  • Absolute and relative
  • URL ltA HREF"AtlanticStates/NYStats.html"gtNew
    Yorklt/Agt is relative
  • ltA HREF"http//www.ncsa.uiuc.edu/General/Internet
    / WWW/HTMLPrimer.html"gt NCSA's Beginner's Guide
    to HTMLlt/Agt absolute address

20
World Wide Web
  • Prevalent, persistent and informative
  • HTML documents (soon, XML) created by humans or
    applications.
  • Accessed day in and day out by humans and
    applications.
  • Persistent HTML documents!!!

Can database technology help?
21
Current Research Projects
  • Web Query System
  • W3QS, WebSQL, AKIRA, NetQL, RAW,
  • WebLog, Araneus
  • Semistructured Data Management
  • LOREL, UnQL, WebOQL, Florid
  • Website Management System
  • STRUDEL, Araneus
  • Web Warehouse
  • WHOWEDA

22
Main Tasks
  • Modeling and Querying the Web
  • view web as directed graph
  • content and link based queries
  • example - find the page that contain the word
    clinton which has a link from a page containing
    word monica.

23
  • Information Extraction and integration
  • wrapper - program to extract a structured
    representation of the data a set of tuples from
    HTML pages.
  • Mediator - integration of data-softwares that
    access multiple source from a uniform interface
  • Web Site Construction and Restructuring
  • creating sites
  • modeling the structure of web sites
  • restructuring data

24
MEDIATOR ARCHITECTURE
User Interface
Mediator (Query/Search/ Retrieval/Result)
Wrapper
Wrapper
. . .
25
What to Model
  • Structure of Web sites
  • Internal structure of web pages
  • Contents of web sites in finer granularities

26
Data Representation of Web Data
  • Graph Data Models
  • Semistructured Data Models (also graph based)

27
Graph Data Model
  • Labeled graph data model where node represents
    web pages and arcs represent links between pages.
  • Labels on arcs can be viewed as attribute names.
  • Regular path expression queries

28
Semistructured Data Models
  • Irregular data structure, no fixed schema known
    and may be implicit in the data
  • Schema may be large and may change frequently
  • Schema is descriptive rather than perspective
    describes the current state of data, but
    violations of schema is still tolerated

29
  • Data is not strongly typed for different objects
    the values of the same attributes may be of
    differing types. (heterogenious sources)
  • No restriction on the set of arcs that emanate
    from a given node in a graph or on the types of
    the values of attributes
  • Ability to query the schemas acr variables which
    get bound to labels on arcs, rather than nodes in
    the graph

30
Graph based Query Languages
  • Use graph to model databases
  • Support regular path expressions and graph
    construction in queries.
  • Examples
  • Graph Log for hypertext queries
  • graph query language for OO

31
Query Languages for Semi-Structured data
  • Use labeled graphs
  • Query the schema of data
  • Ability to accommodate irregularities in the
    data, such as missing links etc.
  • Examples Lorel (Stanford) , UnQL (ATT), STRUQL
    (ATT)

32
Comparison of Query Systems
33
Types of Query Languages
  • First Generation
  • Second generation

34
First Generation Query Languages
  • Combine the content-based queries of search
    engines with structure-based queries
  • Combine conditions on text pattern in documents
    with graph pattern describing link structures
  • Examples - W3QL (TECHNION, Israel)
  • WebSQL (Toronto), WebLOG (Concordia)

35
Second generation languages
  • Called web data manipulation languages
  • Web pages as atomic objects with properties that
    they contain or do not contain certain text
    patterns and they point to other objects
  • Useful for data wrapping, transformation, and
    restructuring
  • Useful for web site transformation and
    restructuring

36
How they Differ?
  • Provide access to the structure of web objects
    they manipulate - return structure
  • Model internal structures of web documents as
    well as the external links that connect them
  • Support references to model hyperlinks and some
    support to ordered collections of records for
    more natural data representation
  • Ability to create new complex structures as a
    result of a query

37
Examples
  • Web OQL
  • STRUQL
  • Florid

38
W3QS (WWW Query System) at Technion - Israel
  • Content queries
  • Structural Queries
  • Interfacing with user written programs and
    standard UNIX utilities
  • Uses existing WWW indexes and search Services
  • Provides view update facility

39
W3QS
  • Accessible via any WWW browsers
  • API can be used by programs running anywhere in
    the Internet
  • Support queries on the web structure by
    specifying starting page, a search domain and
    depth of links.
  • File content analysis tools and filling up of
    forms automatically

40
File Types
  • Strict Inner Structure files such as Unix
    environment files - Semantics of the data is
    clearly linked to the syntax
  • Semi-structured files - text files containing
    formatting codes such as Latex or HTML files-
    possible to use formatting codes to analyze their
    semantic content
  • Raw Files - no relation between meaning of file
    and its inner structure

41
Content Queries
  • Queries based on the content of a single node of
    hypertext
  • SQLCOND is used to evaluate boolean expressions
  • Example - node-format Latex and Node.author
    Sanjay

42
Structure Queries
  • Information conveyed in the hypertext
    organization itself is conveyed.
  • The result is a set of nodes and links from the
    hypertext structure that satisfy a given graph
    pattern graph with nodes and edges are annotated
    with conditions.
  • Components are pattern definition, search engines
    and form completion

43
Structure Query
Node2.author Sanjay
Link1. revdoc
Node1.title Good article
Answer URL http//../myarticles.html
URL http///.tex ltTitlegt Good articleslt/Titlegt
\author sanjay A HREF//..revdocgt
44
Search for an article
  • Select cp n2/ result
  • from n1, l2, n2
  • where n1 in importantindexs.url
  • Fill n1.form as IN importantindexes.fil with
    Keyword sanjay SQLCOND (n2.format Laytex)
    and (n2.authorsanjay)

45
Query to search hypertext pattern
  • Return all the articles cited in the first
    chapter of the book. Each chapter includes
    several pointers to the bibliography, for example
  • ltA HREFhttp//cs/refrences.htmlref2gt
  • Relativitylt/gt means link Relativity leads to
    the label ref2 in the references.html file.
  • In the references.html file the labeled link
    looks like ltA HREF./relative.texnameref2gt
  • relativity, sanjaylt/Agt this link points to
    relative.tex

46
  • Select cp art/ result from Ind,
    l1,chap,l2,ref,l3 art where SQLCOND (ind.url
    http//) And (chap.url /.chapter-1.html/) AND
    l2.HREF /.\13.Name/)
  • USING BFS.

47
Url http//cs.tech/bookindex.html INDEX Chapter
1 Chapter 2 References
Url http///Chapter-1.html ref 1 ref 2 ref 3
l1
http//relative.tex
l3
ref 1 ref 2 ref 3
article
48
WebSQL-University of Toronto
  • Model web as relational database
  • Use two relations Document and Anchor
  • Document relation has one tuple for each document
    in the web and the anchor relation has one tuple
    for each anchor in each document

49
WebSQL
  • SQL-like query language for extracting
    information from the web.
  • Capable of systematic processing of either all
    the links in a page, all the pages that can be
    reached from a given URL through paths that match
    a pattern, or a combination of both.
  • Provides transparent access to index servers

50
Document
51
Anchor
52
  • Give documentss URLs which contain same title
    and keyword(s)
  • Select d1.url, d2.url from
  • document d1 such that d1 MENTIONS keyword1 and
    document d2 such that d2 MENTIONS keyword1
  • where d1.title d2.title
  • and NOT (d1.url d2.url)

53
Find Labels of all Hyperlinks to Postscript
FilesSELECT a.labelFROM Anchor a SUCH THAT
base "http//www.SomeDoc.html"WHERE a.href
CONTAINS ".ps.Z"

54
Documents about Databases
SELECT Document d.url, d.titleFROM d SUCH THAT
"http//www.OtherDoc.html" -gtgt dWHERE d.title
CONTAINS "databases" Note -gt path of length
one within same servergt path of length of one
but different server

55
Retrieve all the documents in the same server
that are pointed tofrom the documentWhose URL
is given
  • Select d.url, d.title from
  • Document d SUCH THAT
  • http//www. Cs.in -gt d

56
Find all broken links in a page
  • SELECT a.hrefFROM Anchor a SUCH THAT base
    "http//the.document.to.test"WHERE
    protocol(a.href) "http" AND doc(a.href) null

57
Web OQL (University of Toronto)
  • Provides a framework that supports a large class
    of data
  • Restructuring operations.
  • Simple semistructured data model for documents
    and record-based data
  • OQL-like syntax and regular expressions
  • Serves as a two-way bridge between databases and
    the Web.

58
DATA MODEL
  • Hypertrees are Ordered arc labeled trees with two
    types of arcs internal and external
  • Internal arcs represent structured objects
  • External arcs to represent refrences (huperlinks)
    among objects.
  • Records as labels in the arcs
  • Sets of related hypertrees as Web

59
ARCHITECTURE
  • Wrappers map all data sources to trees
  • The mapping can be done all at once or on
    demand

60
Example
  • Extract from cspapers (paper database) title and
    URL of the full version of papers of Smith
  • select y.title,y.URL
  • from x in cs papers, y in x
  • where y.authors smith

61
Web Creation
  • Create a new page for each research Group (using
    the group name as URL). Each page contains the
    publications of the corresponding group.
  • Select x as x.group from x in cspapers
  • Select q1 as s1, q2 as s2, ...qm as sm
  • where qs are queries and each Ss is either a
    string query or keyword schema. as clause
    create a URLs s1 , ..sm assigned to each new
    page resulting from each query.

62
ARANEUS
  • Data Model called ADM for Web Documents - nested
    web objects, page schemas
  • Several languages for wrapping, querying,
    creating and updating web sites - object algebra
  • Methods and Techniques for Web Site Design and
    Implementation
  • Presentation in SIGMOD99
  • Software is available at their home site

63
  • Wrappers - map logical access to attribute values
    in a page at the ADM level tp physical access to
    text in the HTML source using EDITOR
  • ULIXES - SQL-like query languages
  • PENELOPE - manipulation language
  • Site integration, semantic heterogeneities
  • Materialized views
  • http//poincare.dia.uniroma3.it8080/Araneous

64
Lore - motivation
  • The data may be irregular and thus not conform to
    a rigid schema.
  • Relational data model has null values, and OO
    models have inheritance and complex objects. Both
    have difficulties in designing schemas to
    incorporate irregular data.
  • It may be difficult to decide in advance on a
    single, correct schema, The structure of the data
    may evolve rapidly, data elements may change
    types, or data not conforming to previous
    structure may be added.

65
  • Thus, there is a need for management of
  • semi-structured data!
  • Lore system manages semi-structured data. The
    data managed by Lore is not confined to a schema
    and it may be irregular or incomplete.
  • OEM is the Lores data model. OEM - object
    Exchange Model - graph based self-describing
    object instance model where nodes are objects and
    edges are labeled with attribute names and leaf
    nodes have atomic values
  • Lore is light weight object repository and Lorel
    is Lores query language.

66
Object Exchange Model - OEM
  • Motivation - information exchange and extraction
  • Why a new data model? it not a new model.
  • Each value exchanged is given an explicit label.
  • Object ?temp-in-Fahrenheit, integer, 80? -
    temp-in-Fahrenheit is the label. Each object is
    self-describing, with a label, type and value.
  • ?set-of-temps, set, cmpnt1, cmpnt2 ?
  • cmpnt1 is ?temp-in-Fahrenheit, integer, 80?
  • cmpnt2 is ?temp-in-Celsius, integer, 20?

67
Labels
  • Plays two roles
  • identifying an object (component)
  • identifying the meaning of an object (component)

?person-record, set, cmpnt1, cmpnt2, cmpnt3 ?
cmpnt1 is ?person-name, string, Fred?
cmpnt2 is ?office-num-in-bldg-5, integer, 333?
cmpnt3 is ?department, string, toy?
  • Person-name both identifies cmpnt1 and coveys its
    meaning.
  • In relational data this corresponds to .

68
Labels - Issues
  • What does the label mean?
  • Database of labels
  • Ontology of labels - within each source
  • Labels are relative (more specific) to the source
    of the data object.
  • Similar labels from different sources need to be
    resolved.
  • Labels provide the flexibility in representing
    object structure

69
Self-describing data models
  • Have been in existence for a long time? Why
    additional interest now?
  • Use the nature of self-describing data model
    for information exchange, and to extend the model
    to include object nesting.
  • To provide an appropriate object request language
    (query facility)

70
OEM - Specification
  • Each object in OEM has the following structure
  • Label A variable character string describing
    what the object represents.
  • Type The data type of the objects value. Each
    is either an atom type, or type set.
  • Value A variable-length value of the object.
  • Object-ID A unique variable-length identifier
    for the object or null.

71
OEM - Summary
  • OEM is an information exchange model. It does not
    specify how objects are stored at source.
  • OEM does specify how objects are received at a
    client, but after objects are received they can
    be stored in any way the client likes.
  • Each source has a distinguished object with
    lexical identifier root.
  • Note the schema-less nature of OEM is
    particularly useful when a client does not know
    in advance the labels or structure of OEM objects.

72
  • ltbiblio,set,doc1,doc2,,docngt
  • doc1 is ltdoc, set, auths1, topic1, call-no1gt
  • auths1 is ltauth-set,set auth11gt
  • auth11 is ltauth-ln, string, Ullmangt
  • topic1 is lttopic, string,Databasesgt
  • call-no1 is ltinternal-call-no, integer, 25gt
  • doc2 is ltdoc, set, auths2, topic2, call-no2gt
  • auths2 is ltauth-set,set auth21, auth22,
    auth23gt
  • auth21 is ltauth-ln, string, Ahogt
  • auth22 is ltauth-ln, string, Hopcroftgt
  • auth23 is ltauth-ln, string, Ullmangt

Example
  • topic2 is lttopic, string,Algorithmsgt
  • call-no1 is ltdewey-decimal, string, BR273gt
  • docn is ltdoc, set, authsn, topicn, call-nongt
  • authsn is ltauth,string, Crichtongt
  • topic1 is lttopic, string,Dinosaursgt
  • call-no1 is ltfictional-call-no, integer, 95gt
  • biblio is the root object.

73
OEM - QL
  • SELECT Fetch-expression
  • FROM Object
  • WHERE Condition
  • The result of this query is itself an object,
    with special label answer
  • ?answer, set, obj1, obj2, , objn ?
  • Each returned obji is a component of object
    specified in the From clause of the query, where
    the component is located by the Fetch-expression
    and satisfies the Condition.

74
Path
  • The notion of path is used in both
    Fetch-Expression in the Select clause and the
    condition in the Where clause.
  • Path describes traversals through an object using
    subobject structure and labels.
  • Example biblio.doc.auth
  • Paths are used in Fetch-Expression to specify
    which components are are returned in the answer
    object.
  • Paths are used in the condition to qualify the
    fetched objects or other (related) components in
    the same object structure.

75
Queries - Simple
  • Retrieve the topic of each document for which
    Ullman is one of the authors
  • SELECT biblio.doc.topic
  • FROM root
  • WHERE biblio.doc.auth-set.auth-ln Ullman
  • Intuitively, the querys where clause finds all
    paths through subobject structure with the
    sequence of labels biblio,doc,auth-set,auth-ln
    such that the object at the end of the path has
    value Ullman.
  • ltanswer, set, obj1, obj2gt
  • obj1 is lttopic, string, Databasesgt
  • obj2 is lttopic, string, Algorithmsgt

76
Queries - wild-cards
  • Retrieve all documents with internal call number
  • SELECT biblio.?.topic
  • FROM root
  • WHERE biblio.?.internal-call-no
  • ? label matches any label. For this query,
    the doc labels can be replaced by any other
    strings and query would produce the same result.
    By convention, two occurrences of ? In the same
    query must match the same label unless variables
    are used.
  • ltanswer, set, obj1gt
  • obj1 is lttopic, string, Databasesgt

77
Queries - wild-paths
  • Retrieve all documents with internal call number
  • SELECT .topic
  • FROM root
  • WHERE .internal-call-no
  • Symbol matches any path of length one or
    more. The use of followed by a single label is
    a convenient and common way to locate objects
    with a certain label in complex structure.
    Similar to ?, two occurrences of in the same
    query must match the same sequence of labels,
    unless variables are used.
  • ltanswer, set, obj1gt
  • obj1 is lttopic, string, Databasesgt

78
Queries - variables
  • Retrieve each document for which both
    Hopcroft and Aho are co-authors
  • SELECT biblio.doc
  • FROM root
  • WHERE biblio.doc.auth-set.auth-ln(a1)Aho and
  • biblio.doc.auth-set.auth-ln(a1)H
    opcroft
  • Here, the query finds all the paths with
    structure biblio, doc, auth-set, and with two
    distinct path completions with label auth with
    values Aho and Hopcroft
  • ltanswer, set, obj1gt
  • obj1 is the complete doc2

79
An OEM Database
DBGroup
1
Member
Project
Member
Member
Project
Member
2
3
4
5
6
Name
Project
Name
Office
Project
Age
Name
Age
Office
Office
9
11
8
10
12
13
14
7
15
16
Clark
Smith
46
Gates 252
Lore
Tsimmis
Jones
28
Room
Building
Room
Building
17
18
19
20
CIS
411
CIS
252
80
Lorel Queries - Simple Path Expression
  • Retrieve the offices of members with age greater
    than 30 years
  • Query SELECT DBGroup.Member.Office
  • WHERE DBGroup.Member.Age gt 30
  • Result Office Gates 252
  • Office
  • Building CIS
  • Room 411

81
Queries - General Path Expression
  • Query SELECT DBGroup.Member.Name
  • WHERE DBGroup.Member.Office(.Room.Cubicle)?
  • Like 252
  • Result Name Jones
  • Name Smith
  • Room matches all labels starting from Room, like
    Room68. stands for disjunction. ? indicates
    that the label pattern is optional. like 252
    specifies that the data value should end with
    string 252.

82
Queries - SubQueries
Retrieve Lore project members who work on other
projects Query SELECT M.Name, ( SELECT
M.Project.Title WHERE M.Project.Title !
Lore) FROM DBGroup.Member M WHERE
M.Project.Title Lore Result Member Name
Jones Title Tsimmis
83
Lore - Summary
  • Lore does facilitate query and updates on
    semi-structural databases
  • There has been more work done on optimization
    using data guides (vldb97).
  • The system is up and running http//WWW-DB.Stanfo
    rd.EDU/lore/demo/
  • How is this related to WWW?
  • XML-QL and related work provides the answer.

84
Extraction and Integration
  • OEM and subsequent LORE(L) can be used for
    extracting information from multiple information
    sources.
  • OEM helps navigate through unknown objects by
  • SELECT ?
  • FROM root
  • Thus help browsing and schema discovery
  • Efficient implementations are possible using
    partial fetch mechanism.
  • Push and Pull information delivery systems are
    possible.
  • How is this different from WebIR?

85
STRUDEL
  • Web Site Management System
  • web Site from multiple sources
  • STruQL - based on OEM, graphs, regular
    expressions, result as graph
  • Example - return all the postscript papers from
    homepages
  • Where homepages(p), p paper q
  • ispostscript(q) collect postscriptpages(p)
  • Where C1,...Ck Create N1,...Nn link L1,...Lp,
    Collect G1, Gq

86
Complex Constructors
Supported by Strudel a Website Management System
with StruQL as query language where Biblio(X),
X -gt paper -gt P, P -gt author -gtA, P -gt
title -gt T, P -gt year -gt Y create Root(),
HomePage(A), YearPage(A,Y), PubPage(P) link Root()
-gt person -gt HomePage (A), HomePage(A)
-gtyearentry -gt YearPage(A,Y), YearPage(A,Y) -gt
publication -gt PubPage(P), PubPage(P) -gt
author -gt HomePage(A), PubPage(P) -gt title
-gtT
87
WebDB
  • View WWW as multimedia documents in the form of
    web pages
  • WQL supports selection, aggregation, sorting,
    summary, grouping
  • projection on title , URL, keywords, tables,
    forms, images etc.

88
Some More Results
  • UnQL - ATT
  • AKIRA- Pennstate
  • NoDose - SIGMOD98

89
HTML to XML
  • HTML documents
  • Emerging Web Standards - XML
  • XML good for data interchange across platforms
    enterprise wide
  • conversion HTML to XML - IBM, Microsoft

90
XML - Motivation
  • In HTML, both the tag semantics and tags are
    fixed. There is limited and strict interpretation
    of tags.
  • HTML is widely successful in disseminating
    documents across internet.
  • Though data can be disseminated through HTML, its
    extraction is painful, and laborious.
  • EDI has been a predominate mode of exchanging
    data among businesses. But it has very rigid
    format that requires highly customized
    applications.

91
XML - Introduction
  • XML aims to provide ease of authoring HTML
    documents with ease of data exchange that is
    possible with EDI.
  • Tags are used to markup documents.
  • XML is a meta-language for describing markup
    languages.
  • XML provides a facility to define tags and
    structural relationships between them.
  • No pre-defined tag set implied no preconceived
    semantics, semantics of XML document is will be
    defined by applications that process them or
    style sheets (XSL).

92
XML - Goals
  • Straightforward to use over internet
  • Support wide variety of applications, authoring,
    browsing, content analysis, etc.
  • Easy to write programs that process XML documents
    and validate them.
  • XML documents must be human-legible and
    reasonably clear.
  • Design of XML shall be formal and concise -
    expressed as EBNF (extended Backus Naur Form) -
    amenable to modern compiler tools and techniques.

93
XML-features
  • Some structure - not rigid
  • Extensibility - User defined tags
  • nested elements
  • validation - documents may specify their own
    grammar
  • DTP (Document Type Descriptor) - schema exists
    with data as tag names
  • Application -EDI - extraction, conversion, ,
    transformation, integration
  • can be modeled using DOM

94
More terminology
  • RDF - Resource Description Framework - a method
    to describe metdata for XML documents
  • XSL - Extensible Stylesheet Language - language
    for transforming and formatting XML.
  • Transformation Language - XSLT, XPath, XPointer

95
Example-HTML
  • Print - Sanjay Madria
  • Web Warehouse Tutorial, ADBIS99
  • HTML
  • ltH2gt Sanjay Madria lt/H2gt
  • ltIgt Web Warehouse Tutorial, ADBIS99lt/Igt
  • Very difficult to understand, structure is
    hidden, describes only appearance

96
XML
  • ltRefgt
  • ltSpeakergt ltFirstnamegt Sanjaylt/firstnamegt
  • ltLastnamegt Madrialt/lastnaamegt
  • lt/Speakergt
  • ltTitle gt Web Warehouse Tutoriallt/Titlegt
  • ltConferencegt ADBIS99lt/Conferencegt
  • lt/emptygt
  • lt/Refgt
  • another format
  • ltFirstname Value Sanjay/gt

97
XML Data
  • ltbookgt
  • lttitlegt database systemslt/titlegt
  • ltauthorgt John ltlastnamegt Korthlt/lastnamegtlt/autho
    rgt
  • ltprice currency USDgt 5.87lt/pricegt
  • lt/bookgt
  • DTD
  • lt!ELEMENT book (title, author, price)gt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT author(PCDATA)lastname)

98
  • lttrgt lttd width"20" valign"top"gt Firma
    Karl-Heinz Rosowski lt/tdgt
  • lttd width"20" valign"top"gt Maikstraße 14 lt/tdgt
  • lttd width"20" valign"top"gt 22041 Hamburg lt/tdgt
  • lttd width"20" valign"top"gt 721 99 64 lt/tdgt
  • lttd width"20" valign"top"gt 21110111 lt/tdgt
    lt/trgt

HTML Version
  • lt?xml version"1.0"?gt
  • ltAddressesgt
  • ltAddress id"12359"gt
  • ltNamegtFirma Karl-Heinz Rosowskilt/Namegt
  • ltStreetgtMaikstraße 14lt/Streetgt
  • ltZIPgt22041lt/ZIPgt
  • ltCitygtHamburglt/Citygt
  • ltTelgt721 99 64lt/Telgt
  • ltFaxgt21110111lt/Faxgt ltEmail/gt
  • lt/Addressgt
  • lt/Addressesgt

XML Version
99
XML - Document - Continued
  • lt?xml version"1.0"?gt is the XML declaration.
  • ElementsMost common form of markup. ltelementgt
    lt/elementgt. For example ltnamegtJack Lemon lt/namegt
  • Attributes are name-value pairs that occur
    inside start-tags after the element name. For
    example ltAddress id"12359"gt attaches value
    12359 to attribute id of Address element.
  • Entity References to handle special characters
    of XML like lt in the XML documents.

100
  • Comments lt!-- this is a comment --!gt
  • CDATA Sections a CDATA (string of characters)
    section instructs the parser to ignore most
    markup characters. For example source code,
    lt!CDATA p q b (I lt 3)gt, between
    CDATA and all character data is passed to an
    application, with out interpretation.

101
XML - DTD - Element Type Declarations
  • Element type declarations identify the names of
    elements and the nature of their content. A
    typical element type declaration looks like
  • lt!Element Address (Name, Street, ZIP?, City,
    Tel, Fax, Email?)gt
  • Address is the element name, and (Name, Street,
    ZIP?, City, Tel, Fax, Email?) is the content
    model. Every address must contain, Name, Street,
    City and Tel. ZIP and Email are optional, whereas
    there can be zero or more Fax numbers.

102
  • The declarations for Name, Street, ZIP , must
    also be given. For example
  • lt!Element Name (PCDATA)gt
  • Attribute List Declarations identify which
    elements may have attributes, what values the
    attributes may hold, and what value is default.
    Attribute values appear only within start-tags
    and empty-element tags.
  • ltAddress id"12359"gt

103
XML - Summary
  • HTML describes presentation
  • XML describes content
  • XML vs. HTML
  • users define new tags
  • arbitrary nesting
  • validation is possible

104
XML and Semi Structural Data Model
  • XML data is fundamentally different than
    relational and object oriented data.
  • XML is not rigidly structured.
  • In relational and OO data model every data
    instance has a schema which is separate and
    independent of the data.
  • XML data is self describing and can naturally
    model irregularities that cannot be modeled by
    relational or OO data model.

105
  • For example, data items may have missing elements
    or multiple occurrences of the same element
    elements may have atomic values in some data
    items and structured values in others and
    collections of elements can have heterogeneous
    structure.
  • Even XML data that has an associated DTD is
    self-describing (the schema is always stored
    with the data) and, except for very restricted
    forms of DTDs, may have all the irregularities
    described above.
  • XML is an instance of semistructured data.

106
XML-QL
  • Regular path expression
  • pattern matching
  • used edge labeled graphs
  • extract data from existing XML documents and
    construct new XML documents
  • support for ordered and unordered views on XML
    document
  • simple and declarative

107
XML-QL
  • The simplest XML-QL queries extract data from an
    XML document. Consider the following DTD
  • lt!ELEMENT book (author,title,publisher)gt
  • lt!ATTLIST Book year CDATAgt
  • lt!ELEMENT article (author title year?,
    (shortversion longversion))gt
  • lt!ATTLIST article type CDATAgt
  • lt!ELEMENT publisher (name, address)gt
  • lt!ELEMENT author (firstname?, lastname)gt

108
XML-QL Example Data
ltbibgt ltbook year1995gt lttitlegt An
Introduction to DB Systems lt/titlegt ltauthorgt
ltlastnamegt Date lt/lastnamegtlt/authorgt ltpublishergt
ltnamegt Addison-Wesleylt/namegt lt/publishergt lt/bookgt
ltbook year1995gt lttitlegt Foundations for
OR Databases lt/titlegt ltauthorgt ltlastnamegt Date
lt/lastnamegtlt/authorgt ltauthorgt ltlastnamegt
Darwen lt/lastnamegtlt/authorgt ltpublishergtltnamegt
Addison-Wesleylt/namegt lt/publishergt lt/bookgt lt/bibgt
109
Matching Data Using Patterns
  • XML uses element patterns to match data in an XML
    document.
  • Find all authors of books whose publisher is
    Addison-Wesley in XML document www.a.b.c/bib.xml
  • WHERE ltbookgt
  • ltpublishergtltnamegtAddison-Wesleylt/namegtlt/publishe
    rgt
  • lttitlegt t lt/titlegt
  • ltauthorgt a lt/authorgt
  • lt/bookgt IN www.a.b.c/bib.xml
  • CONSTRUCT a
  • matches every ltbookgt element in the XML document
    that has at least one lttitlegt element, one
    ltauthorgt element , and one publisher element
    whose ltnamegt is Addison-Wesley. For each such
    match it binds t and a to every title and
    author pair.

110
XML-QL Constructing XML Data
  • Often we would like format the result.
  • Find all authors and titles of books whose
    publisher is Addison-Wesley in XML document
    www.a.b.c/bib.xml
  • WHERE ltbookgt
  • ltpublishergtltnamegtAddison-Wesleylt/gtlt/gt
  • lttitlegt t lt/titlegt
  • ltauthorgt a lt/authorgt
  • lt/bookgt IN www.a.b.c/bib.xml
  • CONSTRUCT ltresultgt
  • ltauthorgt a lt/gt
  • lttitlegt t lt/gt
  • lt/gt

111
Constructing XML Data -cont.
Result of the query ltresultgt ltauthorgtltlastname
gt Date lt/lastnamegtlt/authorgt lttitlegt
Introduction to Database Systems
lt/titlegt lt/resultgt ltresultgt ltauthorgtltlastnamegt
Date lt/lastnamegtlt/authorgt lttitlegt Foundations
for OR Databases lt/titlegt lt/resultgt ltresultgt lt
authorgtltlastnamegt Darwen lt/lastnamegtlt/authorgt ltt
itlegt Foundations for OR Databases
lt/titlegt lt/resultgt One result for each author,
duplicating title information.
112
XML-QL Nested Queries.
WHERE ltbookgt lttitlegt t lt/gt ltpublishergtltname
gtAddison-Wesleylt/gtlt/gt lt/gt CONTENT_AS p IN
www.a.b.c/bib.xml CONSTRUCT ltresultgt lttitle
gt t lt/gt WHERE ltauthorgt a lt/gt in
p CONSTRUCT ltauthorgt a lt/gt
lt/gt ltresultgt ltauthorgtltlastnamegt Date
lt/lastnamegtlt/authorgt lttitlegt Introduction to
Database Systems lt/titlegt lt/resultgt ltresultgt lt
authorgtltlastnamegt Date lt/lastnamegtlt/authorgt ltaut
horgtltlastnamegt Darwen lt/lastnamegtlt/authorgt lttitl
egt Foundations for OR Databases
lt/titlegt lt/resultgt
113
XML-QL Join Queries
XML queries cab express joins by matching two
or more elements that contain same value. Find
all articles that have at least one author who
has written a book since 1995. WHERE ltarticlegt
ltauthorgt ltfirstnamegt f lt/gt //
firstname f ltlastnamegt l lt/gt //
lastname l lt/gt lt/gt CONTENT_AS a
IN "www.a.b.c/bib.xml" ltbook yearygt
ltauthorgt ltfirstnamegt f lt/gt //
join on same firstname f ltlastnamegt
l lt/gt // join on same lastname l lt/gt
lt/gt IN "www.a.b.c/bib.xml", y gt
1995 CONSTRUCT ltarticlegt a lt/gt
114
XML-QL Data Model for XML
  • XML graph G in which each node is represented by
    a unique string called object identifier (OID),
    Gs edges are labelled with element tags, Gs
    nodes are labeled with sets of attribute value
    pairs, Gs leaves are labeled with one string
    value, and G has a distinguished node called
    root.

115
XML-QL Data Model for XML
  • The model allows several edges between the same
    two nodes with the following restriction
  • between any two nodes there can be at most one
    edge with a given label
  • a node cannot have two leaf children with the
    same label and same string value
  • XML graphs are not only derived from XML
    documents, but are also generated by queries.

116
XML- Element Identity, Ids, and IDREFS
  • For element sharing XML reserves an attribute of
    type ID which allows a unique key to be
    associated with an element.
  • An attribute of type IDREF allows an element to
    refer to another element with the designated key,
    and one of the type IDREFS may refer to multiple
    elements.

117
  • lt!ATTLIST person ID REQUIREDgt
  • lt!ATTLIST article author IDREFS IMPLIEDgt
  • ltperson ID"o123"gt
  • ltfirstnamegtJohnlt/firstnamegt
  • ltlastnamegtSmithltlastnamegt
  • lt/persongt
  • ltperson ID"o234"gt
  • . . .
  • lt/persongt
  • ltarticle author"o123 o234"gt
  • lttitlegt ... lt/titlegt
  • ltyeargt 1995 lt/yeargt
  • lt/articlegt

118
XML- Element Identity, Ids, and IDREFS
119
The following query produces all lastname, title
pairs by joining the author element's IDREF
attribute value with the person element's ID
attribute value. WHERE ltarticle authorigt
lttitlegt lt/gt ELEMENT_AS t
lt/gt, ltperson IDigt
ltlastnamegt lt/gt ELEMENT_AS l
lt/gt CONSTRUCT ltresultgt t llt/gt The idiom
lttitlegtlt/gt ELEMENT_AS t binds t to a lttitlegt
element with arbitrary contents. The element
expression lttitle/gt matches a lttitlegt element
with empty contents.
120
XML-QL- Advanced Examples
Tag Variables Regular Path Expressions Transformin
g XML Data (from one DTD to another) Integrating
Data from different XML sources Embedding queries
in data XML-QL check http//www3.org/TR/NOTE-xml
-ql
121
Summary
  • Even before you blink your eye. Lot of work has
    gone in web data models and query languages
  • Some problems are addressed
  • Semi-structural
  • semi-structural data model based query languages
  • schema inference from semi-structural data model
  • efficient processing of queries on
    semi-structural data
  • efficient indexing and storage structures
  • integration with XML
  • Traditional
  • WebSQL/WebOQL
  • Web Warehousing
  • Which way will you go?

122
Further issues
  • Distributed query processing
  • Continuous result processing with push/pull
    result replenishment
  • Labels, labels every where, with XML more labels
    every where how are semantics of queries across
    multiple information sources handled
  • IR gives too many relevant/irrelevant results
  • Query Processing requires some schema knowledge
    that is difficult to handle across multiple
    sources
  • Can these two be bridged? Cooperative solutions.
  • Next Agents, Agents everywhere, What are they
    doing? Will it work or Will it be a fad?
Write a Comment
User Comments (0)
About PowerShow.com