Title: WebOQL
1WebOQL
- A Web Object Query Language
2OVERVIEW
- Data model supports abstractions for modeling
record-based data, structured documents and
hypertexts - Supports querying small databases represented as
documents (such as catalogs), restructuring
single pages (converting a large page into
smaller pages), restructuring sets of pages, for
example, creating an index page containing a
hyperlink to each of them and adding to each page
a hyperlink to index page. - Restructuring the content of a web site in order
to show the same content in another view
3Data Model
The WebOQL data model introduces the hypertree a
tree based Data model representing structured
document containing hyperlinks
Hypertrees are Ordered arc-labeled trees with two
kinds of arcs Internal and external.
4Data Model
Example
Group students
Group professors
Name oded. Seniority 8
Name moshe. Sem 5
Name arik. Sem 8
Label arik home page. URL www/index.html
Label seminar in www. URL www/s.html
Label databases. URL www/index.html
Label moshe home page. URL www/index.html
5Data Model
- Hyper trees are a useful data structure because
the have three important abstractions
- Collections
- Nesting
- Ordering
- The reference notion which is very important to
the web - structure is captured through the distinction
between internal - and external arcs.
- Because the nodes have no type the tree can hold
heterogeneous records within its arcs.
6Data Abstractions
WEB
a pair (t,F) where t is a hypertree and
schema
browsing function
PAGE
F(u) where u is a URL
7Tree operators
Definitions
Tails a tails of tree t are trees obtained by
chopping prefixes of t.
Simple tree simple trees of tree t are the trees
that are composed of an arc that stems from the
root of t and its sub tree .
Subtree subtrees of t are the trees at the end
of arcs which stem from the root of t.
8Label3
Label1
Tree t
Label2
A1
A2
B1
Label1
Label3
Label2
A1
A2
B1
Sample trees of t
null
A1
A2
B1
Sub trees of t
9Tails of T ! (prefixes)
Label3
Label3
Label3
Label1
Label2
Label2
A1
A2
B1
B1
10Tree operators
Concatenate
Tree1 Tree2
Connects two trees by their roots
11Tree operators
Hang
Arc1 / Tree1
Hangs the tree from a new arc.
12Tree operators
Prime
Tree
The first subtree of the argument.
13Tree operators
Head
Tree x
The first x simple trees of the argument, if x is
not specified then only the first simple tree.
14q5
q6
q7
15HANG
- Label papers from smith, Format ps.Z/q1
- Tag UL/Tag LI, Text First Child
- Tag LI, Text Second Child
- Tag LI, Text Third Child
- Url http//a.b.c., Label Click Here
LabelPapers from smith Formatps.Z
TitleRecent.. Urlhttp//..
Title Are.. Urlhttp//www.
HANG concatenate
Url http//a.b.c., Label Click Here
TagUL
TagLI TextFirstChild
16Tree operators
Peek
Arc.field
Extracts a field from an arcs label, e.g.
Example.Group can have a value of students.
If this filed does not exist a value of nil is
returned.
IsField
Arc?field
Test for the presence of a field from in an arcs
label, e.g. Example?Group evaluates to true,
while Example?Name evaluates to false.
17- PPage when a hypertree has an associated URL
that identifies it. - Web Collection of interrelated pages.
- External Arc of each page is a link in the web
- Schema A web can be optionally have a
distinguished page to provide entry point to the
web
18- NNo Schema One must know URL of one or more
pages
http//a.b.c./three.html
http//a.b.c./one.html
http//a.b.c./two.html
19Weboql query
Web Web Schema
New page
http//a.b.c./three.html
http//a.b.c./one.html
http//a.b.c./four.html
http//a.b.c./two.html
20- ltULgt
- ltLIgt First Child
- ltLIgt Second Child
- ltLIgt Third Child
- lt/ULgt
- ltA HREFhttp//a.b.c.gt Click Here lt/A gt
21Urlhttp//a.b.c. Label Click here
Tag LI TextFirst Child
Tag LI TextThird Child
Tag LI TextSecond Child
Tree representing HTML document consisting of a
list and a hyperlink
- Trees are ordered
- Arcs are not labeled with atomic values but
records
22groupDBMS
groupCard
groupProgLang
TitleRecent AuthorsSmith PublicationsTech
TitleAre AuthorsSmith PublicationsACM
LabelAbstract Url www
LabelFull Papers Url www
Paper Database CS papers
23SELECT - FROM - WHERE
This familiar query language construct is used by
WebOQL as the main construct of queries.
Query to evaluate
y.Label, y.URL
Definition of variables
x in example, y in x!
A boolean condition
x.Seniority 8
24SELECT - FROM - WHERE
For each instantiation of the variables in the
from clause check the condition in the where
clause, if its true then evaluate the query in
the select clause and append it to the result.
25- Select y.title, y.publication
- From x in cs papers, y in x
- Missing data
- Publication - undefined
26- Compute a listing of the papers publication data
grouped by title. - Select x.Title /
- Select z.Publication from y in cspapers, z in
y Where x.title z.title - From w in csPapers , x in w
27- Schema a distinguished hypertree
- Browsing function maps strings (URLs) to
hypertree, it defines a graph where the nodes are
pages and there is an arc between node a and b if
the content of the page at node a contains an
external arc whose url attribute is the url of
the page at node b.
28- Analogy with Relational database
- Hypertree gt Relations
- Webs gt databases
- Schema of a web gtcatalog of a database
29- Select x.Tag
- From x in
- browse(http//www.cs.toronto.edu)
Tag body
Tag head
30- SFW creates a web
- Select Title and URLs of papers authored by
Smith. - Select y.Title, y.URL as schema
- From x in csPapers , y in x
- Where y.authors smith
31- Create a web page with URL Group Names whose
content is the list of group names (assume that
there is no such page in the current web) - Select x.Group as Group Names from x in
csPapers
32- Create several pages one for each research
group (using the group name as URL). Each page
contains the publications of the corresponding
group - Select x as x.Group from x in csPapers
33Data Model
- Records as Labels on Arcs
- Internal and External Arcs
Tag UL Text one of the
Tag H1, Text City Overview
Tag L1, Text If you are interested
Tag LI, Text One of the
Tag L1, Text All the hotels
Tag XYZ, Text If you are
Tag XYZ, Text
Label Theatres Online, Url http//www, Base
http//www, Text This page contains...
Tag XYZ, Text Contains
Label Sports Zone, Url http//www, Base
http//www, Text Sports Zone
Tag XYZ, Text One of the
Label All the Hotels, Url http//www, Base
http//www, Text These are all
34Query list elements containing ticket
- doc http//www.citynet.com/overview.html
- tag UL/
- Select y
- from y in doc !
- where y.text ticket
Tag UL
Tag LI
Tag LI
Tag XYZ, Text
Label Theatres Online, Url http//www, Base
http//www, Text This page contains...
Label Sports Zone, Url http//www, Base
http//www, Text Sports Zone
Tag XYZ, Text If you are
Tag XYZ, Text One of the
35Web restructuring
Using these tree operators we have shown how a
tree Can be restructured.
To restructure a web we must have a function
which maps one web to another. The new web has
some hypertree as its schema while the browsing
function is an extension of the old webs
browsing function targets URLs which were not
previously targeted.
The way it is done in WebOQL is by using the AS
clause.
36Web restructuring
Generally the select clause of WebOQL has the
form of
Select q1 as s1, q2 as s2, ., qn as sn
Si can be either the key word schema, or a string
query.
An as clause which evaluates to schema defines
the schema of the web.
Title y.Group as schema
37Web restructuring
Generally the select clause of WebOQL has the
form of
Select q1 as s1, q2 as s2, ., qn as sn
Si can be either the key word schema, or a string
query.
An as clause which evaluates to a string defines
a page and is treated as the URL for it.
x.Name as y.Group
38Web restructuring
After a web is created there are two
possibilities either query it further
(restructure it) or return it to the host
application.
If we want to return the web to the host
application for the sake of showing it to a
browser then we must format the pages in an HTML
compliant way. This is easily done by
restructuring it using HTML tags as labels.
39Document restructuring
Web documents are a perfect example of semi
structured data since they do not have a fixed
schema and can have various irregularities. In
an HTML document most of the tags may appear any
number of times or not at all.
WebOQL uses a wrapper which creates abstract
syntax trees (AST) from any arbitrary HTML
document. This is easily done since the markup
tags of HTML reflects the logical
relationship between the various information
items.
Example ltULgt ltLIgt item 1. lt/LIgt ltLIgt
item 2. lt/LIgt ltLIgt item 2. lt/LIgt lt/ULgt
40- Generate a web consisting of a page for each
research group containing a title and author of
all its publications, and an index web page ,
that lists all the groups and provides links to
their pages - newWeb Select unique Name x.Group, url
x.Group as schema - y.Title, y.Authors as x.Group
- From x in csPapers, y in x
41Name Card Punching Url Card Punching
Name Url..
As Schema
Name Prog. Lang Url Prog.Lang..
Prog. Lang.
Card Punching
Titles Assembly Lan Authors John,..
Titles Cobol Authors James J
Titles Recent Authors Smith
Titles Arc Authors Smith
As x. group
42- NewerWeb lt newWeb
- select Tag H3, Text y.Title
- Tag BR, Text y.Publication
- Tag BR, Text y.Authors
- Tag P
- as x.Name
- from x in schema, y in x.Name
-
- select Tag H2, Text Publications of the
x.Name Group x.Name - Tag A, Label To Index, Url
http//a.b.c/Index of Projects.html - as http//a.b.c/ x.Name .html
- from x in schema
43-
- select Url http//a.b.c/Index of
Projects.html as schema, - Tag H2, Text Index of Projects
- Tag UL /
- select Tag LI /
- Tag A, Label x.Name,
- Url http//a.b.c/ x.name .html
-
-
- from x in schema
- as http//a.b.c/Index of Projects.html
44- ltH2gt Index of Projects lt/H2gt
- ltULgt
- ltLIgt ltA HREF http//a.b.c./cardpunching.htmlgt
- Card Punching
- lt/Agt
- lt/LIgt
- ltLIgt ltA HREF http//a.b.c./programminglanguage
s.htmlgt - Programming Languages
- lt/Agt
- lt/LIgt
- ltLIgt ..
- lt/ULgt
Index Page
45- ltH2gtPublications of the Card Punching group lt/H2gt
- ltH3gt recent Discoveries in Card Punching lt/H3gt
- ltBRgt Technical Report TROIS
- ltBRgt Peter Smith, John Brown
- ltPgt
- ltH3gt Are Magnetic Media Better ? lt/H3gt
- ltBRgt ACM TOCP Vol 3 No. (1942) pp.2337
- ltBRgt Peter Smith, John Brown
- ltPgt
- ltA HREFhttp//a.b.c./IndexnProject.htmlgt
- To index
- lt/Agt
Group Pages
46Document restructuring
Navigation patterns
In the examples we have seen the variables used
in the queries ranged over simple trees of the
tree we queried, however in the WWW variables may
range over several linked sub trees
whose structure is not fully known to us.
select x.text from x in someones.html via
Tag H2
- record predicate which is true for every
internal arc.
TagH2 - record predicate which is true for
every arc which has an H2 tag.
47Document restructuring
Navigation patterns
In the examples we have seen the variables used
in the queries ranged over simple trees of the
tree we queried, however in the WWW variables may
range over several linked sub trees
whose structure is not fully known to us.
select x.text from x in someones.html via
gtnot(Tag H2)
gt - record predicate which is true for every
external arc.
not(TagH2) - record predicate which is true
for every arc which does not have an H2 tag.
48Document restructuring
Navigation patterns
When navigation patterns are omitted then they
query is treated as if there was a navigation
pattern which always evaluated to true.
Variables are instantiated in left to right
depth-first or breadth-first search. Since the
default is breadth-first to use depth-first the
key word viadfs is used instead of via.
49Navigation Pattern
- Not (Tag A) - Path of any length composed
of arcs not having an attribute tag with value
A. - Tag LI Tag A path of length 2
- gt - all paths in a tree that lead from root to
an external arc - Select x.url
- from x in http//a.b.c./index.html
- Via not (tag Table)gt
- All the external arcs in the document pointed to
by the http that do not occur within a table
50- Select x.url,x.text
- From x in http//a.b.c./root.html
- Via (Labled Nextgt)
- What this query will produce?
51Tail Variables
- Variable in upper case iterates over tails plus
simple trees
52Tag H3, Text Price
Tag H3, Text Price
Tag UL
Tag UL
Tag LI
Tag LI
Tag LI
Select X ! From X in http//a.b.c./large.html
via Tag H3 Where X!.TagUL and X.Text
Price
53Tag H2, Text Publications of the
Label To index, Url Base http//a.b.c./cardpu
nching.html, Text indexofprojects
Tag P, Text
Tag H3, Text
Tag BR, Text y
Tag BR, Text
Tag BR, Text
Tag H3, Text
Tag BR, Text y
Tag P, Text
Tree generated by Query
Tag OL/Select Tag LI / X3 from X in
http//a.b.c./cardpunching.html! where X.tag
H3
Tag OL
Tag LI
Tag LI
Tag H3
54- Tag OL/Select Tag LI/
- Select y
- from y in X while not y.Tagp
- From X in http//a.b.c.//IrregularDoc.html!
- where X.tag H3
55- Project web select x.proj name, x.proj descr as
projects - x.emp name, x.emp phone as people
- x.emp name as x.proj name
- x.proj name as x.emp name
- From x in SQLDb. Select proj name, emp name,
- emp phone, proj descr from proj, emp, works in
- where Emp.id worksIn.empid and
- proj.id worksIn.projId
- Generate a web containing a page for each
project, a page for each person and two index
pages, listing all the projects and all the
people, a persons page contains pointers to the
Projects in which he /she is involved and a
project page contains pointers to the pages or
the people involved in it.
56Tag UL, Text
Tag H1, Text Publications of Research
.
Tag H2, Text Databases
Tag UL, Text Recent
Tag H2, Text Card Punching
Tag H2, Text Programming
Tag UL, Text Cobol in AI Sam James
Tag LI, Text Recent
Tag LI, Text Are Magnetic
Tag LI, Text Assembly for
Tag LI, Text Cobol in
.
.
Tag CITE, Text Are Magnetic
.
Tag XYZ, Text Are Magnetic
Tag BR, Text
Tag BR, Text
Label Full Version, Url http//www/paper2.ps.
z, Base http//www/cspapers.html, Text
1k098k79
Label Abstract, Url http//www/abstr2.html, B
ase http//www/cspapers.html, Text Are
Magnetic Media
Tag BR, Text
Tag B, Text Peter Smith
Tag BR, Text ACM TOCP Vol. 3 No. (1942) pp
23-37
57- Select Title y.Text,
- Author y!!.text
- From x in http//www.a.b.c./paper.html,
- y in x Where x.Tag UL
- Query - Retrieve titles and authors of each paper
- x range over simple trees and y over elements
under UL
58- Select title y.Text,
- authors y!!.text,
- Publications y!3.Text
- ps-url y!4.url
- abstract-urly!!.url
- as pubsdb insert
- From X in http//www.a.b.c./paper.html
- y in X!
- Where X.tag H2
59Tag H1, Text Reports in
Tag H2, Text John Smith
Tag HR, Text
Tag HR, Text
Tag CITE, Text Indexing
Tag XYZ, TextCS-TR-0120..
Tag BR, Text
Tag H2, Text David Rice
Tag P, Text
Tag CITE, TextEfficient
Tag BR, Text
Tag BR, Text
Tag P, Text
Tag XYZ, TextCS-TR-0029..
Tag XYZ, TextCS-TR-0327..
Label Indexing Sound, Url http//www/pl.ps.gz
, Base http//www./trs.html, Text
sd..sGhj9870.
Label Efficient Clustering., Url
http//www/p2.ps.gz, Base http//www./trs.html,
Text .fHjs9))fujs.
LabelAbstract Available Online, Url
http//www/pl.html, Base http//www./trs.html,
Text Indexing Sound.
LabelTemporal Constraints, Url
http//www/p3.ps.gz, Base http//www./trs.html,
Text -9ivm27813nd.
60- Select title Y.text
- author X.text
- publications Y!!.Text
- PS-Url YUrl
- abstract-urlY!4.Url
- as PubsDb insert
- From X in http//www.x.y.z./papers.html
- Y in X! while not (Y.Tag HR)
- where X.Tag H2
- and Y.TagCITE
61Tag UL, Text
Tag H1, Text Publications of Research
.
X
Tag H2, Text Databases
Tag UL, Text Recent
Tag H2, Text Programming
Tag H2, Text Card Punching
Tag UL, Text Cobol in AI Sam James
y
Tag LI, Text Are Magnetic
Tag LI, Text Recent
y
Tag LI, Text Assembly for
Tag LI, Text Cobol in
.
.
y
.
Tag CITE, Text Recent
Tag BR, Text
Label Full Version, Url http//www/paperl.ps.
z, Base http//www/cspapers.html, Text
hH6YiaP.
Tag BR, Text
Tag XYZ, Text Recent.
Tag BR, Text Technical.
Label Abstract, Url http//www/abstrl.html, B
ase http//www/cspapers.html, Text It is
company
Tag BR, Text
Tag B, Text Peter Brown
Figure 5.6 Instantiation of Variables in Query 4
62- Query 4
- csPapers selectGroup X.Text /
- selectTitle y.Text ,
- Authors y!!.Text,
- Publicationy!3.Text /
- Label Abstract,Urly!!.Url
- Label Full Version,Urly!4.Url
-
- from y in X!
-
- from X in http//www.a.b.c./papers.html
- where X.Tag H2
63Architecture
64- Parser Rules
- Each node corresponds to either a subdocument
enclosed in an occurrence of a paired tag. For
example, root node corresponds to the subdocument
enclosed between lthtmlgt and lt/htmlgt or to a
subdocument enclosed in an occurrence of a
non-paired tag and the tag that follows it - Arcs leading to nodes corresponding to the ltagt
tag and for which the protocol of the associated
URL is http are external. All other arcs are
internal.
65- The incoming arc to a node contains the
attributes of the subdocument represented by this
node. - Internal arcs are labeled with a record
containing two fields Tag and Text. - Tag is the HTML tag corresponding to the subtree
that is the destination of the arc. - The value of the Text depends on whether Tag is
paired or non-paired. - If paired, then the value of the text is the text
that is enclosed between ltTaggt and lt/Taggt
excluding markups. - If Tag is non-paired, the the value of text is
the text between ltTaggt and the tag that comes
after it in document.
66- External arcs are labeled with a record
containing four fields, label, url, base and
text. - Label is the label of the hyperlink, the text
enclosed between lta href gt and the lt/agt tags
url is the value of the href attribute, base is
the url of the document being processed and Text
is the text of the referred document excluding
markup. - A dummy tag named ltxyzgt is used to enclose pieces
of text that are not explicitly tagged. - Rules are applied recursively to the text inside
occurrences of paired tags.
67- ltHTMLgt
- ltH1gt Publications of Research Groups at Cs
Deptlt/H1gt - ltH2gt Card Punching lt/H2gt
- ltULgt
- ltLIgt
- ltCTTEgt Recent Advances in Card Punchinggt ltBRgt
- ltBgt Peter Smith, John Brownlt/BgtltBRgt
- Technical Report TR015lt/CTTEgtltBRgt
- ltA HREF http//../abstract.htmlgt Abstract
lt/AgtltBRgt
68- ltAHREF http//../paper.ps.Zgt Full versionlt/Agt
- lt/LIgt
- ltLIgt
- ltCTTEgt Are magnetic Media Better?ltBRgt
- ltBgt Peter Smith, John Brown, Tomlt/BgtltBRgt
- ACM TOCP Vol. 3, No. , pplt/CTTEgtltBRgt
- ltA HREF HTTP//../abst2.htmlgt Abstractlt/AgtltBRgt
- ltA HREFhttp//../paper2.ps.Zgt Full versionlt/Agt
- lt/LIgt
- lt/ULgt
- ltH2gt Programming langlt/H2gt