Representation of Web Data in a Web Warehouse - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Representation of Web Data in a Web Warehouse

Description:

The internal vertices represent tagged elements containing tag names and a list ... a href = http://www.rxlist.com/cgi/generic/index.html all RxList monographs ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 48
Provided by: UMR
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: Representation of Web Data in a Web Warehouse


1
Representation of Web Data in a Web Warehouse
  • Ragini A.S.
  • Shipra Dutta
  • November 20th, 2001

2
Overview
  • Need for a web warehouse
  • WHOM- Data Model for WHOWEDA
  • Concept of Node Link
  • Model for representing metadata,Structure
    Content of web documents Hyperlinks
  • NMT LMT
  • Modeling of structural textual content of web
    documents
  • NDT LDT
  • Advantages
  • Disadvantages
  • Conclusion Future Word

3
Need of a Web Warehouse
  • Rapid growth of WWW,which is a distributed global
    information resource.
  • Applications must be able to harness and analyze
    web data.
  • Germination of mobile users.
  • Necessity to exploit historical web data.
  • Traditional information retrieval techniques
    Search engines are not satisfactory

4
Data Model of WHOWEDAWHOM(Ware House Object
Model)
  • Consists of two components
  • (1) set of web objects
  • (2) set of web operators
  • Centered on the notion of web tables,which is a
    set of web tuples
  • Web tuple is a set of directed graphs each
    consisting of set of nodes and links satisfies
    a web schema
  • Set of operators like global web coupling,web
    join,web select etc., are used to manipulate the
    web data

5
Metadata associated with HTML XML documents
(web documents)
  • HTML or XML documents may have metadata as
  • URL
  • Format,size(in bytes),Date of last modification
  • Information about author
  • Hyperlink in web doc may have metadata as
  • Source URL
  • Target URL
  • Type of Hyperlink(interior,local or global)

6
Node Meta Data Attribute
  • Represented using a data type node
  • metadata-attribute
  • Meta Attribute may be either atomic or complex
  • Eg.for complex attribute, URL
  • URL can be decomposed as server, port,
  • protocol,path,filename
  • Eg. for atomic attribute
  • size, as it can not be further
    decomposed

7
Set of node meta data attributes
Figure 1
8
Node Meta data Tree(NMT)
  • Representation of instance of node meta-attribute
  • Internal vertices of tree are meta-attribute
  • names
  • Leaf vertices of the tree are values of meta data
    attribute

9
Example
10
Example of NMT
  • Consider
  • URL http//www.ninds.nih.gov/patients/Disorder/Al
    exander/Alexander.htm
  • Last Modification Date Thursday,15th July,
    1999,045053.
  • Size 10761K
  • Attribute URL has following attribute/value
    pairs
  • (Protocol,http),(Server, www.ninds.nih.gov),(P
    ath, patients/Disorder/Alexander) and
    (Filename, Alexander.htm)
  • Attribute Server has following sub
    attribute/value pairs
  • (Name,www.ninds.nih),(Domain name,gov)

11
Node Meta Data Tree
Figure 2
12
Link Meta Data Attribute
  • Represented using a data type link
  • metadata-attribute
  • Each Attribute can be either atomic or complex
  • Eg. Complex attribute,
  • Source URL or target URL can be decomposed
    to server, port, protocol, path and file name.
  • Eg. Atomic attribute
  • Link type-local, global or interior

13
Set of link meta data attributes
Figure 3
14
Link metadata tree (LMT)
  • Representation of instance of a link
    metadata-attribute
  • Corresponds to the link meta data attribute/value
    pairs of hyperlinks.
  • Internal vertices of tree are meta-attribute
  • names of hyperlinks
  • Leaf vertices of the tree are values of meta data
    attribute

15
Figure 4
16
Example of LMT
Figure 5
17
Issues for modeling Structure Content
  • Web data embedded within a HTML or XML document
    should be written in compliance with the HTML
    XML specifications respectively
  • Modeling tags tag less data
  • Modeling hierarchical structure
  • Attribute/Value pairs associated with tags
  • Order of text
  • Location information of a portion of tag less
    data

18
Components of Node structural attribute
  • Name
  • Attribute_list
  • Content
  • Identifier
  • Location_attribute

19
Node Data Tree(NDT)
  • Represents the structure content of web page
  • Node structural objects which are instances of
    node structural attributes satisfy some
    dependency constraints ,which can be collectively
    visualized as rooted,directed tree which is an
    NDT.

20
Figure 6
21
(No Transcript)
22
Figure 6
23
Features Of NDT
  • Rooted, directed tree
  • Loss of structural information
  • No loss of content data
  • Exclusion of anchor tags

24
Components of NDT
  • name
  • Attribute_list
  • Identifier
  • Content
  • Location_attribute

25
Definition of Dependency Constraints
26
NDT for HTML Documents
  • Classification of HTML tags
  • Non-noisy tags HTML tags which are considered
    in node data tree.
  • Noisy tags HTML tags which are ignored while
    mapping a HTML document to NDT.Three types of
    noisy tags that our model considers are
  • 1. Tags used for formatting purpose
  • 2. Tags used to represent a hyperlink
  • 3. Tags with specification of executable
    content

27
Representation of non-noisy tags in NDT
  • Classified as three types
  • Type1 tags
  • Type2 tags
  • Tags3 tags

28
Noisy and Non-noisy tag attributes
  • Noisy attributes Attributes which are ignored
    while generating a NDT from HTML doc.
  • Three types of noisy attributes are
  • Attributes used for formatting purpose
  • Attributes used to represent behavior of web
    document
  • Attributes which specify execution content
  • Non-noisy attributes Attributes considered
    important in the context of modeling HTML
    document are represented in NDT

29
Representation Of Content and Structure in XML
Document.
  • The XML Documents are mapped into a Node Data
    Tree.
  • Node structural objects which are instances of
    node structural attributes satisfy some
    dependency constraints ,which can be collectively
    visualized as rooted,directed tree which is an
    NDT.

30
Issues related to generating NDT from XML
Documents.
  • The XML Documents dont have have a fixed set of
    tags and attributes like HTML.
  • The tags and attributes are defined by the user.
  • XML does not encounter the problem of elements
    with no end tags and elements whose tags may be
    omitted.
  • Thus no need to address the issues related to
    type 2 and type 3 tags while generating the NDTs
    from XML documents.

31
Example of NDT generated from XML Document.
32
Representing Structure and Contents of
hyperlinks.
  • A hyperlink is an explicit relationship between
    two or more data objects or portions of data
    objects.
  • A hyperlink is defined by the data type Link
    type.
  • A Link type consists of three components a
    set of meta data attributes, a set of link
    structural attributes and a reference identifier.
  • Link structural attributes are used to express
    the structure and content if hyperlinks and the
    reference identifier is used to specify the
    location of hyperlinks in web documents.

33
Issues for modeling hyperlinks.
  • The ltagt tag and the attributes href or name are
    used to specify hyperlinks in HTML documents. XML
    links are specified by the use of attribute named
    xmllink.Possible values are simple and extended,
    as well as locator group and document.
  • When authors add a hyperlink to a document D,
    they include the description of the document in
    addition to the URL which are important.
  • The location of Hyperlinks is important as we may
    need to impose constraints in a query to follow
    only those links which are located in a
    particular portion of web page.

34
Link Structural attributes.
  • Similar to node structural attributes, it
    consists of three components
  • Name, corresponding to start-tag of HTML or XML
    link.
  • Attribute_list, it is finite possibly empty set
    of attributes associated with the tag.
  • Content, between start and end tags.

35
Reference Identifier
  • It is a unique identifier that references an
    identifier in node structural attribute. For
    example consider the web page in

36
Node Data tree for XML Document.
37
Link data Tree
  • It can be represented by a set of instances of
    link structural attributes.
  • HTML or XML documents are mapped into instances
    of link structural attributes called link
    structural objects,these objects and the
    dependency constraints can be visualized as
    rooted, directed tree called a link data tree.
  • The internal vertices represent tagged elements
    containing tag names and a list of
    attribute/value pairs,the leaf vertices represent
    the label of the link.

38
Link Data Tree for HTML Documents
  • The ltagt tag marks a bock of HTML document as a
    hypertext link.
  • ltagt can take several attributes like href or name
    which specify the destination of hypertext link
    or indicate that the marked text can be the
    target of a hypertext link.

39
Example of LDT for HTML Document
  • Consider a code snippet
  • lta href http//www.rxlist.com/cgi/generic/index
    .htmlgtall RxList monographs(Nearly 300 of
    them)lt/agt

40
The Link data tree of the web page
41
Illustration of LDT of hyperlinks which contain
image.

42
Link Data Tree for XML Documents
  • There are no fixed links tags to express links in
    XML data so an element is an XmL link if either
    it has xmllink attribute or the element and all
    of its attributes and content adhere to syntactic
    requirements.
  • Two types of links are to be considered simple
    and extended links.
  • A simple link when mapped to LDT is always a
    linear tree.

43
Example of Extended XML Link
44
Noisy Tags and Attributes
  • XML tags are user defined and are not used for
    formatting purpose. Thus, there are no noisy tags
    to be ignored while generating LDTs.
  • The attributes which specify link behavior such
    as show and actuate are however ignored.

45
Advantages
  • It provides location independent information
    to the mobile users.
  • It is used in building web data repository that
    supports historical web data.
  • It is an effective and efficient information
    retrieval technique.

46
Disadvantages
  • Information retrieval is complex.
  • It does not handle the executable contents of the
    web documents.
  • Attributes used to represent behavior of web
    document are not considered in this model.

47
Conclusions
  • This model logically separates the hyperlinks
    from the web documents
  • Aids in the representation of metadata, contents
    and structure of HTML and XML data as a tree-like
    structure.
Write a Comment
User Comments (0)
About PowerShow.com