WebOQL - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

WebOQL

Description:

WebOQL A Web Object Query Language – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 69
Provided by: sanj114
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: WebOQL


1
WebOQL
  • A Web Object Query Language

2
OVERVIEW
  • Data model supports abstractions for modeling
    record-based data, structured documents and
    hypertexts
  • Supports querying small databases represented as
    documents (such as catalogs), restructuring
    single pages (converting a large page into
    smaller pages), restructuring sets of pages, for
    example, creating an index page containing a
    hyperlink to each of them and adding to each page
    a hyperlink to index page.
  • Restructuring the content of a web site in order
    to show the same content in another view

3
Data Model
The WebOQL data model introduces the hypertree a
tree based Data model representing structured
document containing hyperlinks
Hypertrees are Ordered arc-labeled trees with two
kinds of arcs Internal and external.
4
Data Model
Example
Group students
Group professors
Name oded. Seniority 8
Name moshe. Sem 5
Name arik. Sem 8
Label arik home page. URL www/index.html
Label seminar in www. URL www/s.html
Label databases. URL www/index.html
Label moshe home page. URL www/index.html
5
Data Model
  • Hyper trees are a useful data structure because
    the have three important abstractions
  • Collections
  • Nesting
  • Ordering
  • The reference notion which is very important to
    the web
  • structure is captured through the distinction
    between internal
  • and external arcs.
  • Because the nodes have no type the tree can hold
    heterogeneous records within its arcs.

6
Data Abstractions
WEB
a pair (t,F) where t is a hypertree and
schema
browsing function
PAGE
F(u) where u is a URL
7
Tree operators
Definitions
Tails a tails of tree t are trees obtained by
chopping prefixes of t.
Simple tree simple trees of tree t are the trees
that are composed of an arc that stems from the
root of t and its sub tree .
Subtree subtrees of t are the trees at the end
of arcs which stem from the root of t.
8
Label3
Label1
Tree t
Label2
A1
A2
B1
Label1
Label3
Label2
A1
A2
B1
Sample trees of t
null
A1
A2
B1
Sub trees of t
9
Tails of T ! (prefixes)
Label3
Label3
Label3
Label1
Label2
Label2
A1
A2
B1
B1
10
Tree operators
Concatenate
Tree1 Tree2
Connects two trees by their roots
11
Tree operators
Hang
Arc1 / Tree1
Hangs the tree from a new arc.
12
Tree operators
Prime
Tree
The first subtree of the argument.
13
Tree operators
Head
Tree x
The first x simple trees of the argument, if x is
not specified then only the first simple tree.
14
  • q4
  • q5
  • q5!
  • q52

q5
q6
q7
15
HANG
  • Label papers from smith, Format ps.Z/q1
  • Tag UL/Tag LI, Text First Child
  • Tag LI, Text Second Child
  • Tag LI, Text Third Child
  • Url http//a.b.c., Label Click Here

LabelPapers from smith Formatps.Z
TitleRecent.. Urlhttp//..
Title Are.. Urlhttp//www.
HANG concatenate
Url http//a.b.c., Label Click Here
TagUL
TagLI TextFirstChild


16
Tree operators
Peek
Arc.field
Extracts a field from an arcs label, e.g.
Example.Group can have a value of students.
If this filed does not exist a value of nil is
returned.
IsField
Arc?field
Test for the presence of a field from in an arcs
label, e.g. Example?Group evaluates to true,
while Example?Name evaluates to false.
17
  • PPage when a hypertree has an associated URL
    that identifies it.
  • Web Collection of interrelated pages.
  • External Arc of each page is a link in the web
  • Schema A web can be optionally have a
    distinguished page to provide entry point to the
    web

18
  • NNo Schema One must know URL of one or more
    pages

http//a.b.c./three.html
http//a.b.c./one.html
http//a.b.c./two.html
19
Weboql query
Web Web Schema
New page
http//a.b.c./three.html
http//a.b.c./one.html
http//a.b.c./four.html
http//a.b.c./two.html
20
  • ltULgt
  • ltLIgt First Child
  • ltLIgt Second Child
  • ltLIgt Third Child
  • lt/ULgt
  • ltA HREFhttp//a.b.c.gt Click Here lt/A gt

21
Urlhttp//a.b.c. Label Click here
Tag LI TextFirst Child
Tag LI TextThird Child
Tag LI TextSecond Child
Tree representing HTML document consisting of a
list and a hyperlink
  • Trees are ordered
  • Arcs are not labeled with atomic values but
    records

22
groupDBMS
groupCard
groupProgLang
TitleRecent AuthorsSmith PublicationsTech
TitleAre AuthorsSmith PublicationsACM
LabelAbstract Url www
LabelFull Papers Url www
Paper Database CS papers
23
SELECT - FROM - WHERE
This familiar query language construct is used by
WebOQL as the main construct of queries.
Query to evaluate
y.Label, y.URL
Definition of variables
x in example, y in x!
A boolean condition
x.Seniority 8
24
SELECT - FROM - WHERE
For each instantiation of the variables in the
from clause check the condition in the where
clause, if its true then evaluate the query in
the select clause and append it to the result.
25
  • Select y.title, y.publication
  • From x in cs papers, y in x
  • Missing data
  • Publication - undefined

26
  • Compute a listing of the papers publication data
    grouped by title.
  • Select x.Title /
  • Select z.Publication from y in cspapers, z in
    y Where x.title z.title
  • From w in csPapers , x in w

27
  • Schema a distinguished hypertree
  • Browsing function maps strings (URLs) to
    hypertree, it defines a graph where the nodes are
    pages and there is an arc between node a and b if
    the content of the page at node a contains an
    external arc whose url attribute is the url of
    the page at node b.

28
  • Analogy with Relational database
  • Hypertree gt Relations
  • Webs gt databases
  • Schema of a web gtcatalog of a database

29
  • Select x.Tag
  • From x in
  • browse(http//www.cs.toronto.edu)

Tag body
Tag head
30
  • SFW creates a web
  • Select Title and URLs of papers authored by
    Smith.
  • Select y.Title, y.URL as schema
  • From x in csPapers , y in x
  • Where y.authors smith

31
  • Create a web page with URL Group Names whose
    content is the list of group names (assume that
    there is no such page in the current web)
  • Select x.Group as Group Names from x in
    csPapers

32
  • Create several pages one for each research
    group (using the group name as URL). Each page
    contains the publications of the corresponding
    group
  • Select x as x.Group from x in csPapers

33
Data Model
  • Records as Labels on Arcs
  • Internal and External Arcs

Tag UL Text one of the
Tag H1, Text City Overview
Tag L1, Text If you are interested
Tag LI, Text One of the
Tag L1, Text All the hotels
Tag XYZ, Text If you are
Tag XYZ, Text
Label Theatres Online, Url http//www, Base
http//www, Text This page contains...
Tag XYZ, Text Contains
Label Sports Zone, Url http//www, Base
http//www, Text Sports Zone
Tag XYZ, Text One of the
Label All the Hotels, Url http//www, Base
http//www, Text These are all
34
Query list elements containing ticket
  • doc http//www.citynet.com/overview.html
  • tag UL/
  • Select y
  • from y in doc !
  • where y.text ticket

Tag UL
Tag LI
Tag LI
Tag XYZ, Text
Label Theatres Online, Url http//www, Base
http//www, Text This page contains...
Label Sports Zone, Url http//www, Base
http//www, Text Sports Zone
Tag XYZ, Text If you are
Tag XYZ, Text One of the
35
Web restructuring
Using these tree operators we have shown how a
tree Can be restructured.
To restructure a web we must have a function
which maps one web to another. The new web has
some hypertree as its schema while the browsing
function is an extension of the old webs
browsing function targets URLs which were not
previously targeted.
The way it is done in WebOQL is by using the AS
clause.
36
Web restructuring
Generally the select clause of WebOQL has the
form of
Select q1 as s1, q2 as s2, ., qn as sn
Si can be either the key word schema, or a string
query.
An as clause which evaluates to schema defines
the schema of the web.
Title y.Group as schema
37
Web restructuring
Generally the select clause of WebOQL has the
form of
Select q1 as s1, q2 as s2, ., qn as sn
Si can be either the key word schema, or a string
query.
An as clause which evaluates to a string defines
a page and is treated as the URL for it.
x.Name as y.Group
38
Web restructuring
After a web is created there are two
possibilities either query it further
(restructure it) or return it to the host
application.
If we want to return the web to the host
application for the sake of showing it to a
browser then we must format the pages in an HTML
compliant way. This is easily done by
restructuring it using HTML tags as labels.
39
Document restructuring
Web documents are a perfect example of semi
structured data since they do not have a fixed
schema and can have various irregularities. In
an HTML document most of the tags may appear any
number of times or not at all.
WebOQL uses a wrapper which creates abstract
syntax trees (AST) from any arbitrary HTML
document. This is easily done since the markup
tags of HTML reflects the logical
relationship between the various information
items.
Example ltULgt ltLIgt item 1. lt/LIgt ltLIgt
item 2. lt/LIgt ltLIgt item 2. lt/LIgt lt/ULgt
40
  • Generate a web consisting of a page for each
    research group containing a title and author of
    all its publications, and an index web page ,
    that lists all the groups and provides links to
    their pages
  • newWeb Select unique Name x.Group, url
    x.Group as schema
  • y.Title, y.Authors as x.Group
  • From x in csPapers, y in x

41
Name Card Punching Url Card Punching
Name Url..
As Schema
Name Prog. Lang Url Prog.Lang..
Prog. Lang.
Card Punching
Titles Assembly Lan Authors John,..
Titles Cobol Authors James J
Titles Recent Authors Smith
Titles Arc Authors Smith
As x. group
42
  • NewerWeb lt newWeb
  • select Tag H3, Text y.Title
  • Tag BR, Text y.Publication
  • Tag BR, Text y.Authors
  • Tag P
  • as x.Name
  • from x in schema, y in x.Name
  • select Tag H2, Text Publications of the
    x.Name Group x.Name
  • Tag A, Label To Index, Url
    http//a.b.c/Index of Projects.html
  • as http//a.b.c/ x.Name .html
  • from x in schema

43
  • select Url http//a.b.c/Index of
    Projects.html as schema,
  • Tag H2, Text Index of Projects
  • Tag UL /
  • select Tag LI /
  • Tag A, Label x.Name,
  • Url http//a.b.c/ x.name .html
  • from x in schema
  • as http//a.b.c/Index of Projects.html

44
  • ltH2gt Index of Projects lt/H2gt
  • ltULgt
  • ltLIgt ltA HREF http//a.b.c./cardpunching.htmlgt
  • Card Punching
  • lt/Agt
  • lt/LIgt
  • ltLIgt ltA HREF http//a.b.c./programminglanguage
    s.htmlgt
  • Programming Languages
  • lt/Agt
  • lt/LIgt
  • ltLIgt ..
  • lt/ULgt

Index Page
45
  • ltH2gtPublications of the Card Punching group lt/H2gt
  • ltH3gt recent Discoveries in Card Punching lt/H3gt
  • ltBRgt Technical Report TROIS
  • ltBRgt Peter Smith, John Brown
  • ltPgt
  • ltH3gt Are Magnetic Media Better ? lt/H3gt
  • ltBRgt ACM TOCP Vol 3 No. (1942) pp.2337
  • ltBRgt Peter Smith, John Brown
  • ltPgt
  • ltA HREFhttp//a.b.c./IndexnProject.htmlgt
  • To index
  • lt/Agt

Group Pages
46
Document restructuring
Navigation patterns
In the examples we have seen the variables used
in the queries ranged over simple trees of the
tree we queried, however in the WWW variables may
range over several linked sub trees
whose structure is not fully known to us.
select x.text from x in someones.html via
Tag H2
- record predicate which is true for every
internal arc.
TagH2 - record predicate which is true for
every arc which has an H2 tag.
47
Document restructuring
Navigation patterns
In the examples we have seen the variables used
in the queries ranged over simple trees of the
tree we queried, however in the WWW variables may
range over several linked sub trees
whose structure is not fully known to us.
select x.text from x in someones.html via
gtnot(Tag H2)
gt - record predicate which is true for every
external arc.
not(TagH2) - record predicate which is true
for every arc which does not have an H2 tag.
48
Document restructuring
Navigation patterns
When navigation patterns are omitted then they
query is treated as if there was a navigation
pattern which always evaluated to true.
Variables are instantiated in left to right
depth-first or breadth-first search. Since the
default is breadth-first to use depth-first the
key word viadfs is used instead of via.
49
Navigation Pattern
  • Not (Tag A) - Path of any length composed
    of arcs not having an attribute tag with value
    A.
  • Tag LI Tag A path of length 2
  • gt - all paths in a tree that lead from root to
    an external arc
  • Select x.url
  • from x in http//a.b.c./index.html
  • Via not (tag Table)gt
  • All the external arcs in the document pointed to
    by the http that do not occur within a table

50
  • Select x.url,x.text
  • From x in http//a.b.c./root.html
  • Via (Labled Nextgt)
  • What this query will produce?

51
Tail Variables
  • Variable in upper case iterates over tails plus
    simple trees

52
Tag H3, Text Price
Tag H3, Text Price
Tag UL
Tag UL
Tag LI
Tag LI
Tag LI
Select X ! From X in http//a.b.c./large.html
via Tag H3 Where X!.TagUL and X.Text
Price
53
Tag H2, Text Publications of the
Label To index, Url Base http//a.b.c./cardpu
nching.html, Text indexofprojects
Tag P, Text
Tag H3, Text
Tag BR, Text y
Tag BR, Text
Tag BR, Text
Tag H3, Text
Tag BR, Text y
Tag P, Text
Tree generated by Query
Tag OL/Select Tag LI / X3 from X in
http//a.b.c./cardpunching.html! where X.tag
H3
Tag OL
Tag LI
Tag LI
Tag H3
54
  • Tag OL/Select Tag LI/
  • Select y
  • from y in X while not y.Tagp
  • From X in http//a.b.c.//IrregularDoc.html!
  • where X.tag H3

55
  • Project web select x.proj name, x.proj descr as
    projects
  • x.emp name, x.emp phone as people
  • x.emp name as x.proj name
  • x.proj name as x.emp name
  • From x in SQLDb. Select proj name, emp name,
  • emp phone, proj descr from proj, emp, works in
  • where Emp.id worksIn.empid and
  • proj.id worksIn.projId
  • Generate a web containing a page for each
    project, a page for each person and two index
    pages, listing all the projects and all the
    people, a persons page contains pointers to the
    Projects in which he /she is involved and a
    project page contains pointers to the pages or
    the people involved in it.

56
Tag UL, Text
Tag H1, Text Publications of Research
.
Tag H2, Text Databases
Tag UL, Text Recent
Tag H2, Text Card Punching
Tag H2, Text Programming
Tag UL, Text Cobol in AI Sam James
Tag LI, Text Recent
Tag LI, Text Are Magnetic
Tag LI, Text Assembly for
Tag LI, Text Cobol in
.
.
Tag CITE, Text Are Magnetic
.
Tag XYZ, Text Are Magnetic
Tag BR, Text
Tag BR, Text
Label Full Version, Url http//www/paper2.ps.
z, Base http//www/cspapers.html, Text
1k098k79
Label Abstract, Url http//www/abstr2.html, B
ase http//www/cspapers.html, Text Are
Magnetic Media
Tag BR, Text
Tag B, Text Peter Smith
Tag BR, Text ACM TOCP Vol. 3 No. (1942) pp
23-37
57
  • Select Title y.Text,
  • Author y!!.text
  • From x in http//www.a.b.c./paper.html,
  • y in x Where x.Tag UL
  • Query - Retrieve titles and authors of each paper
  • x range over simple trees and y over elements
    under UL

58
  • Select title y.Text,
  • authors y!!.text,
  • Publications y!3.Text
  • ps-url y!4.url
  • abstract-urly!!.url
  • as pubsdb insert
  • From X in http//www.a.b.c./paper.html
  • y in X!
  • Where X.tag H2

59
Tag H1, Text Reports in
Tag H2, Text John Smith

Tag HR, Text
Tag HR, Text
Tag CITE, Text Indexing
Tag XYZ, TextCS-TR-0120..
Tag BR, Text
Tag H2, Text David Rice
Tag P, Text
Tag CITE, TextEfficient
Tag BR, Text
Tag BR, Text
Tag P, Text
Tag XYZ, TextCS-TR-0029..
Tag XYZ, TextCS-TR-0327..
Label Indexing Sound, Url http//www/pl.ps.gz
, Base http//www./trs.html, Text
sd..sGhj9870.
Label Efficient Clustering., Url
http//www/p2.ps.gz, Base http//www./trs.html,
Text .fHjs9))fujs.
LabelAbstract Available Online, Url
http//www/pl.html, Base http//www./trs.html,
Text Indexing Sound.
LabelTemporal Constraints, Url
http//www/p3.ps.gz, Base http//www./trs.html,
Text -9ivm27813nd.
60
  • Select title Y.text
  • author X.text
  • publications Y!!.Text
  • PS-Url YUrl
  • abstract-urlY!4.Url
  • as PubsDb insert
  • From X in http//www.x.y.z./papers.html
  • Y in X! while not (Y.Tag HR)
  • where X.Tag H2
  • and Y.TagCITE

61
Tag UL, Text
Tag H1, Text Publications of Research
.
X
Tag H2, Text Databases
Tag UL, Text Recent
Tag H2, Text Programming
Tag H2, Text Card Punching
Tag UL, Text Cobol in AI Sam James
y
Tag LI, Text Are Magnetic
Tag LI, Text Recent
y
Tag LI, Text Assembly for
Tag LI, Text Cobol in
.
.
y
.
Tag CITE, Text Recent
Tag BR, Text
Label Full Version, Url http//www/paperl.ps.
z, Base http//www/cspapers.html, Text
hH6YiaP.
Tag BR, Text
Tag XYZ, Text Recent.
Tag BR, Text Technical.
Label Abstract, Url http//www/abstrl.html, B
ase http//www/cspapers.html, Text It is
company
Tag BR, Text
Tag B, Text Peter Brown
Figure 5.6 Instantiation of Variables in Query 4
62
  • Query 4
  • csPapers selectGroup X.Text /
  • selectTitle y.Text ,
  • Authors y!!.Text,
  • Publicationy!3.Text /
  • Label Abstract,Urly!!.Url
  • Label Full Version,Urly!4.Url
  • from y in X!
  • from X in http//www.a.b.c./papers.html
  • where X.Tag H2

63
Architecture
64
  • Parser Rules
  • Each node corresponds to either a subdocument
    enclosed in an occurrence of a paired tag. For
    example, root node corresponds to the subdocument
    enclosed between lthtmlgt and lt/htmlgt or to a
    subdocument enclosed in an occurrence of a
    non-paired tag and the tag that follows it
  • Arcs leading to nodes corresponding to the ltagt
    tag and for which the protocol of the associated
    URL is http are external. All other arcs are
    internal.

65
  • The incoming arc to a node contains the
    attributes of the subdocument represented by this
    node.
  • Internal arcs are labeled with a record
    containing two fields Tag and Text.
  • Tag is the HTML tag corresponding to the subtree
    that is the destination of the arc.
  • The value of the Text depends on whether Tag is
    paired or non-paired.
  • If paired, then the value of the text is the text
    that is enclosed between ltTaggt and lt/Taggt
    excluding markups.
  • If Tag is non-paired, the the value of text is
    the text between ltTaggt and the tag that comes
    after it in document.

66
  • External arcs are labeled with a record
    containing four fields, label, url, base and
    text.
  • Label is the label of the hyperlink, the text
    enclosed between lta href gt and the lt/agt tags
    url is the value of the href attribute, base is
    the url of the document being processed and Text
    is the text of the referred document excluding
    markup.
  • A dummy tag named ltxyzgt is used to enclose pieces
    of text that are not explicitly tagged.
  • Rules are applied recursively to the text inside
    occurrences of paired tags.

67
  • ltHTMLgt
  • ltH1gt Publications of Research Groups at Cs
    Deptlt/H1gt
  • ltH2gt Card Punching lt/H2gt
  • ltULgt
  • ltLIgt
  • ltCTTEgt Recent Advances in Card Punchinggt ltBRgt
  • ltBgt Peter Smith, John Brownlt/BgtltBRgt
  • Technical Report TR015lt/CTTEgtltBRgt
  • ltA HREF http//../abstract.htmlgt Abstract
    lt/AgtltBRgt

68
  • ltAHREF http//../paper.ps.Zgt Full versionlt/Agt
  • lt/LIgt
  • ltLIgt
  • ltCTTEgt Are magnetic Media Better?ltBRgt
  • ltBgt Peter Smith, John Brown, Tomlt/BgtltBRgt
  • ACM TOCP Vol. 3, No. , pplt/CTTEgtltBRgt
  • ltA HREF HTTP//../abst2.htmlgt Abstractlt/AgtltBRgt
  • ltA HREFhttp//../paper2.ps.Zgt Full versionlt/Agt
  • lt/LIgt
  • lt/ULgt
  • ltH2gt Programming langlt/H2gt
Write a Comment
User Comments (0)
About PowerShow.com