Title: Representation of Web Data in a Web Warehouse
1Representation of Web Data in a Web Warehouse
- Ragini A.S.
-
- Shipra Dutta
- November 20th, 2001
2Overview
- Need for a web warehouse
- WHOM- Data Model for WHOWEDA
- Concept of Node Link
- Model for representing metadata,Structure
Content of web documents Hyperlinks - NMT LMT
- Modeling of structural textual content of web
documents - NDT LDT
- Advantages
- Disadvantages
- Conclusion Future Word
3Need of a Web Warehouse
- Rapid growth of WWW,which is a distributed global
information resource. - Applications must be able to harness and analyze
web data. - Germination of mobile users.
- Necessity to exploit historical web data.
- Traditional information retrieval techniques
Search engines are not satisfactory -
4Data Model of WHOWEDAWHOM(Ware House Object
Model)
- Consists of two components
- (1) set of web objects
- (2) set of web operators
- Centered on the notion of web tables,which is a
set of web tuples - Web tuple is a set of directed graphs each
consisting of set of nodes and links satisfies
a web schema - Set of operators like global web coupling,web
join,web select etc., are used to manipulate the
web data -
5Metadata associated with HTML XML documents
(web documents)
- HTML or XML documents may have metadata as
- URL
- Format,size(in bytes),Date of last modification
- Information about author
- Hyperlink in web doc may have metadata as
- Source URL
- Target URL
- Type of Hyperlink(interior,local or global)
6Node Meta Data Attribute
- Represented using a data type node
- metadata-attribute
- Meta Attribute may be either atomic or complex
- Eg.for complex attribute, URL
- URL can be decomposed as server, port,
- protocol,path,filename
- Eg. for atomic attribute
- size, as it can not be further
decomposed
7Set of node meta data attributes
Figure 1
8Node Meta data Tree(NMT)
- Representation of instance of node meta-attribute
- Internal vertices of tree are meta-attribute
- names
- Leaf vertices of the tree are values of meta data
attribute
9Example
10Example of NMT
- Consider
- URL http//www.ninds.nih.gov/patients/Disorder/Al
exander/Alexander.htm - Last Modification Date Thursday,15th July,
1999,045053. - Size 10761K
- Attribute URL has following attribute/value
pairs - (Protocol,http),(Server, www.ninds.nih.gov),(P
ath, patients/Disorder/Alexander) and
(Filename, Alexander.htm) - Attribute Server has following sub
attribute/value pairs - (Name,www.ninds.nih),(Domain name,gov)
11Node Meta Data Tree
Figure 2
12Link Meta Data Attribute
- Represented using a data type link
- metadata-attribute
- Each Attribute can be either atomic or complex
- Eg. Complex attribute,
- Source URL or target URL can be decomposed
to server, port, protocol, path and file name. - Eg. Atomic attribute
- Link type-local, global or interior
13Set of link meta data attributes
Figure 3
14Link metadata tree (LMT)
- Representation of instance of a link
metadata-attribute - Corresponds to the link meta data attribute/value
pairs of hyperlinks. - Internal vertices of tree are meta-attribute
- names of hyperlinks
- Leaf vertices of the tree are values of meta data
attribute
15Figure 4
16Example of LMT
Figure 5
17Issues for modeling Structure Content
- Web data embedded within a HTML or XML document
should be written in compliance with the HTML
XML specifications respectively - Modeling tags tag less data
- Modeling hierarchical structure
- Attribute/Value pairs associated with tags
- Order of text
- Location information of a portion of tag less
data
18Components of Node structural attribute
- Name
- Attribute_list
- Content
- Identifier
- Location_attribute
19Node Data Tree(NDT)
- Represents the structure content of web page
- Node structural objects which are instances of
node structural attributes satisfy some
dependency constraints ,which can be collectively
visualized as rooted,directed tree which is an
NDT.
20Figure 6
21(No Transcript)
22Figure 6
23Features Of NDT
- Rooted, directed tree
- Loss of structural information
- No loss of content data
- Exclusion of anchor tags
24Components of NDT
- name
- Attribute_list
- Identifier
- Content
- Location_attribute
25 Definition of Dependency Constraints
26NDT for HTML Documents
- Classification of HTML tags
- Non-noisy tags HTML tags which are considered
in node data tree. - Noisy tags HTML tags which are ignored while
mapping a HTML document to NDT.Three types of
noisy tags that our model considers are - 1. Tags used for formatting purpose
- 2. Tags used to represent a hyperlink
- 3. Tags with specification of executable
content
27Representation of non-noisy tags in NDT
- Classified as three types
- Type1 tags
- Type2 tags
- Tags3 tags
28Noisy and Non-noisy tag attributes
- Noisy attributes Attributes which are ignored
while generating a NDT from HTML doc. - Three types of noisy attributes are
- Attributes used for formatting purpose
- Attributes used to represent behavior of web
document - Attributes which specify execution content
- Non-noisy attributes Attributes considered
important in the context of modeling HTML
document are represented in NDT
29Representation Of Content and Structure in XML
Document.
- The XML Documents are mapped into a Node Data
Tree. - Node structural objects which are instances of
node structural attributes satisfy some
dependency constraints ,which can be collectively
visualized as rooted,directed tree which is an
NDT.
30Issues related to generating NDT from XML
Documents.
- The XML Documents dont have have a fixed set of
tags and attributes like HTML. - The tags and attributes are defined by the user.
- XML does not encounter the problem of elements
with no end tags and elements whose tags may be
omitted. - Thus no need to address the issues related to
type 2 and type 3 tags while generating the NDTs
from XML documents.
31Example of NDT generated from XML Document.
32Representing Structure and Contents of
hyperlinks.
- A hyperlink is an explicit relationship between
two or more data objects or portions of data
objects. - A hyperlink is defined by the data type Link
type. - A Link type consists of three components a
set of meta data attributes, a set of link
structural attributes and a reference identifier. - Link structural attributes are used to express
the structure and content if hyperlinks and the
reference identifier is used to specify the
location of hyperlinks in web documents.
33Issues for modeling hyperlinks.
- The ltagt tag and the attributes href or name are
used to specify hyperlinks in HTML documents. XML
links are specified by the use of attribute named
xmllink.Possible values are simple and extended,
as well as locator group and document. - When authors add a hyperlink to a document D,
they include the description of the document in
addition to the URL which are important. - The location of Hyperlinks is important as we may
need to impose constraints in a query to follow
only those links which are located in a
particular portion of web page.
34Link Structural attributes.
- Similar to node structural attributes, it
consists of three components - Name, corresponding to start-tag of HTML or XML
link. - Attribute_list, it is finite possibly empty set
of attributes associated with the tag. - Content, between start and end tags.
35Reference Identifier
- It is a unique identifier that references an
identifier in node structural attribute. For
example consider the web page in
36Node Data tree for XML Document.
37Link data Tree
- It can be represented by a set of instances of
link structural attributes. - HTML or XML documents are mapped into instances
of link structural attributes called link
structural objects,these objects and the
dependency constraints can be visualized as
rooted, directed tree called a link data tree. - The internal vertices represent tagged elements
containing tag names and a list of
attribute/value pairs,the leaf vertices represent
the label of the link.
38Link Data Tree for HTML Documents
- The ltagt tag marks a bock of HTML document as a
hypertext link. - ltagt can take several attributes like href or name
which specify the destination of hypertext link
or indicate that the marked text can be the
target of a hypertext link.
39Example of LDT for HTML Document
- Consider a code snippet
- lta href http//www.rxlist.com/cgi/generic/index
.htmlgtall RxList monographs(Nearly 300 of
them)lt/agt
40The Link data tree of the web page
41Illustration of LDT of hyperlinks which contain
image.
42Link Data Tree for XML Documents
- There are no fixed links tags to express links in
XML data so an element is an XmL link if either
it has xmllink attribute or the element and all
of its attributes and content adhere to syntactic
requirements. - Two types of links are to be considered simple
and extended links. - A simple link when mapped to LDT is always a
linear tree.
43Example of Extended XML Link
44Noisy Tags and Attributes
- XML tags are user defined and are not used for
formatting purpose. Thus, there are no noisy tags
to be ignored while generating LDTs. - The attributes which specify link behavior such
as show and actuate are however ignored.
45Advantages
- It provides location independent information
to the mobile users. - It is used in building web data repository that
supports historical web data. - It is an effective and efficient information
retrieval technique.
46Disadvantages
- Information retrieval is complex.
- It does not handle the executable contents of the
web documents. - Attributes used to represent behavior of web
document are not considered in this model.
47Conclusions
- This model logically separates the hyperlinks
from the web documents - Aids in the representation of metadata, contents
and structure of HTML and XML data as a tree-like
structure.