Representation of Web Data in a Web Warehouse

About This Presentation

Title:

Representation of Web Data in a Web Warehouse

Description:

The internal vertices represent tagged elements containing tag names and a list ... a href = http://www.rxlist.com/cgi/generic/index.html all RxList monographs ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 48

Provided by: UMR

Learn more at: https://web.mst.edu

Category:

more less

Transcript and Presenter's Notes

Title: Representation of Web Data in a Web Warehouse

1
Representation of Web Data in a Web Warehouse

Ragini A.S.
Shipra Dutta
November 20th, 2001

2
Overview

Need for a web warehouse
WHOM- Data Model for WHOWEDA
Concept of Node Link
Model for representing metadata,Structure
Content of web documents Hyperlinks
NMT LMT
Modeling of structural textual content of web
documents
NDT LDT
Advantages
Disadvantages
Conclusion Future Word

3
Need of a Web Warehouse

Rapid growth of WWW,which is a distributed global
information resource.
Applications must be able to harness and analyze
web data.
Germination of mobile users.
Necessity to exploit historical web data.
Traditional information retrieval techniques
Search engines are not satisfactory

4
Data Model of WHOWEDAWHOM(Ware House Object
Model)

Consists of two components
(1) set of web objects
(2) set of web operators
Centered on the notion of web tables,which is a
set of web tuples
Web tuple is a set of directed graphs each
consisting of set of nodes and links satisfies
a web schema
Set of operators like global web coupling,web
join,web select etc., are used to manipulate the
web data

5
Metadata associated with HTML XML documents
(web documents)

HTML or XML documents may have metadata as
URL
Format,size(in bytes),Date of last modification
Information about author
Hyperlink in web doc may have metadata as
Source URL
Target URL
Type of Hyperlink(interior,local or global)

6
Node Meta Data Attribute

Represented using a data type node
metadata-attribute
Meta Attribute may be either atomic or complex
Eg.for complex attribute, URL
URL can be decomposed as server, port,
protocol,path,filename
Eg. for atomic attribute
size, as it can not be further
decomposed

7
Set of node meta data attributes
Figure 1
8
Node Meta data Tree(NMT)

Representation of instance of node meta-attribute
Internal vertices of tree are meta-attribute
names
Leaf vertices of the tree are values of meta data
attribute

9
Example
10
Example of NMT

Consider
URL http//www.ninds.nih.gov/patients/Disorder/Al
exander/Alexander.htm
Last Modification Date Thursday,15th July,
1999,045053.
Size 10761K
Attribute URL has following attribute/value
pairs
(Protocol,http),(Server, www.ninds.nih.gov),(P
ath, patients/Disorder/Alexander) and
(Filename, Alexander.htm)
Attribute Server has following sub
attribute/value pairs
(Name,www.ninds.nih),(Domain name,gov)

11
Node Meta Data Tree
Figure 2
12
Link Meta Data Attribute

Represented using a data type link
metadata-attribute
Each Attribute can be either atomic or complex
Eg. Complex attribute,
Source URL or target URL can be decomposed
to server, port, protocol, path and file name.
Eg. Atomic attribute
Link type-local, global or interior

13
Set of link meta data attributes
Figure 3
14
Link metadata tree (LMT)

Representation of instance of a link
metadata-attribute
Corresponds to the link meta data attribute/value
pairs of hyperlinks.
Internal vertices of tree are meta-attribute
names of hyperlinks
Leaf vertices of the tree are values of meta data
attribute

15
Figure 4
16
Example of LMT
Figure 5
17
Issues for modeling Structure Content

Web data embedded within a HTML or XML document
should be written in compliance with the HTML
XML specifications respectively
Modeling tags tag less data
Modeling hierarchical structure
Attribute/Value pairs associated with tags
Order of text
Location information of a portion of tag less
data

18
Components of Node structural attribute

Name
Attribute_list
Content
Identifier
Location_attribute

19
Node Data Tree(NDT)

Represents the structure content of web page
Node structural objects which are instances of
node structural attributes satisfy some
dependency constraints ,which can be collectively
visualized as rooted,directed tree which is an
NDT.

20
Figure 6
21
(No Transcript)
22
Figure 6
23
Features Of NDT

Rooted, directed tree
Loss of structural information
No loss of content data
Exclusion of anchor tags

24
Components of NDT

name
Attribute_list
Identifier
Content
Location_attribute

25
Definition of Dependency Constraints
26
NDT for HTML Documents

Classification of HTML tags
Non-noisy tags HTML tags which are considered
in node data tree.
Noisy tags HTML tags which are ignored while
mapping a HTML document to NDT.Three types of
noisy tags that our model considers are
1. Tags used for formatting purpose
2. Tags used to represent a hyperlink
3. Tags with specification of executable
content

27
Representation of non-noisy tags in NDT

Classified as three types
Type1 tags
Type2 tags
Tags3 tags

28
Noisy and Non-noisy tag attributes

Noisy attributes Attributes which are ignored
while generating a NDT from HTML doc.
Three types of noisy attributes are
Attributes used for formatting purpose
Attributes used to represent behavior of web
document
Attributes which specify execution content
Non-noisy attributes Attributes considered
important in the context of modeling HTML
document are represented in NDT

29
Representation Of Content and Structure in XML
Document.

The XML Documents are mapped into a Node Data
Tree.
Node structural objects which are instances of
node structural attributes satisfy some
dependency constraints ,which can be collectively
visualized as rooted,directed tree which is an
NDT.

30
Issues related to generating NDT from XML
Documents.

The XML Documents dont have have a fixed set of
tags and attributes like HTML.
The tags and attributes are defined by the user.
XML does not encounter the problem of elements
with no end tags and elements whose tags may be
omitted.
Thus no need to address the issues related to
type 2 and type 3 tags while generating the NDTs
from XML documents.

31
Example of NDT generated from XML Document.
32
Representing Structure and Contents of
hyperlinks.

A hyperlink is an explicit relationship between
two or more data objects or portions of data
objects.
A hyperlink is defined by the data type Link
type.
A Link type consists of three components a
set of meta data attributes, a set of link
structural attributes and a reference identifier.
Link structural attributes are used to express
the structure and content if hyperlinks and the
reference identifier is used to specify the
location of hyperlinks in web documents.

33
Issues for modeling hyperlinks.

The ltagt tag and the attributes href or name are
used to specify hyperlinks in HTML documents. XML
links are specified by the use of attribute named
xmllink.Possible values are simple and extended,
as well as locator group and document.
When authors add a hyperlink to a document D,
they include the description of the document in
addition to the URL which are important.
The location of Hyperlinks is important as we may
need to impose constraints in a query to follow
only those links which are located in a
particular portion of web page.

34
Link Structural attributes.

Similar to node structural attributes, it
consists of three components
Name, corresponding to start-tag of HTML or XML
link.
Attribute_list, it is finite possibly empty set
of attributes associated with the tag.
Content, between start and end tags.

35
Reference Identifier

It is a unique identifier that references an
identifier in node structural attribute. For
example consider the web page in

36
Node Data tree for XML Document.
37
Link data Tree

It can be represented by a set of instances of
link structural attributes.
HTML or XML documents are mapped into instances
of link structural attributes called link
structural objects,these objects and the
dependency constraints can be visualized as
rooted, directed tree called a link data tree.
The internal vertices represent tagged elements
containing tag names and a list of
attribute/value pairs,the leaf vertices represent
the label of the link.

38
Link Data Tree for HTML Documents

The ltagt tag marks a bock of HTML document as a
hypertext link.
ltagt can take several attributes like href or name
which specify the destination of hypertext link
or indicate that the marked text can be the
target of a hypertext link.

39
Example of LDT for HTML Document

Consider a code snippet
lta href http//www.rxlist.com/cgi/generic/index
.htmlgtall RxList monographs(Nearly 300 of
them)lt/agt

40
The Link data tree of the web page
41
Illustration of LDT of hyperlinks which contain
image.

42
Link Data Tree for XML Documents

There are no fixed links tags to express links in
XML data so an element is an XmL link if either
it has xmllink attribute or the element and all
of its attributes and content adhere to syntactic
requirements.
Two types of links are to be considered simple
and extended links.
A simple link when mapped to LDT is always a
linear tree.

43
Example of Extended XML Link
44
Noisy Tags and Attributes

XML tags are user defined and are not used for
formatting purpose. Thus, there are no noisy tags
to be ignored while generating LDTs.
The attributes which specify link behavior such
as show and actuate are however ignored.

45
Advantages

It provides location independent information
to the mobile users.
It is used in building web data repository that
supports historical web data.
It is an effective and efficient information
retrieval technique.

46
Disadvantages

Information retrieval is complex.
It does not handle the executable contents of the
web documents.
Attributes used to represent behavior of web
document are not considered in this model.

47
Conclusions

This model logically separates the hyperlinks
from the web documents
Aids in the representation of metadata, contents
and structure of HTML and XML data as a tree-like
structure.

Write a Comment

User Comments (0)

About PowerShow.com

Representation of Web Data in a Web Warehouse - PowerPoint PPT Presentation

Representation of Web Data in a Web Warehouse

The internal vertices represent tagged elements containing tag names and a list ... a href = http://www.rxlist.com/cgi/generic/index.html all RxList monographs ... – PowerPoint PPT presentation