WHOWEDA%20:%20Warehouse%20of%20Web%20Data - PowerPoint PPT Presentation

About This Presentation
Title:

WHOWEDA%20:%20Warehouse%20of%20Web%20Data

Description:

Title: Web Warehousing : Design and Issues Author: skm Last modified by: skm Created Date: 9/16/1998 7:53:23 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 105
Provided by: skm8
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: WHOWEDA%20:%20Warehouse%20of%20Web%20Data


1
  • WHOWEDA Warehouse of Web Data
  • Sanjay Kumar Madria
  • Department of Computer Science
  • Purdue University, West Lafayette, IN 47907
  • skm_at_cs.purdue.edu

2
www.is.a.mess
3
WWW
  • collection of multimedia documents in the form of
    web pages connected via hyperlinks.

4
Characteristics of WWW
  • WWW is a set of directed graphs
  • data in the WWW has a heterogeneous nature
  • unstructured versus structured information
  • no central authority to manage information
  • Dynamic verses static information
  • Web information discoveries - search engines

5
As WWW grows, more chaotic it becomes
  • Web is fast growing, distributed,
    non-administered global information resource
  • WWW allows access to text, image, video, sound
    and graphic data
  • more business organizations creating web servers
  • more chaotic environment to locate information of
    interest
  • lost in hyperspace syndrome

6
Does it affect the corporate world?
  • Lack of credibility of data
  • Different sites with different data
  • Same site different data
  • Historical information is not available
  • Previous versions of web data
  • How does web data change with time
  • Summarization over time
  • Data to information
  • Reduction in productivity
  • Analysis is manual

7
How users find web sites
  • Indexes and search engines 75
  • UseNet newsgroups 44
  • Cool lists 27
  • New lists 24
  • Listservers 23
  • Print ads 21
  • Word-of-mouth and e-mail 17
  • Linked web advertisement 4

8
Limitations of Search Engines
  • Do not exploit hyperlinks
  • search is limited to string matching
  • Queries are evaluated on archived data rather
    than up-to-date data no indexing on current data
  • low accuracy
  • replicated results
  • no further manipulation possible

9
Limitations of Search Engines
  • ERROR 404!
  • No efficient document management
  • Query results cannot be further manipulated
  • No efficient means for knowledge discovery

10
Current Research Projects
  • Web Query System
  • W3QS, WebSQL, AKIRA, NetQL, RAW,
  • WebLog
  • Semistructured Data
  • LOREL, UnQL, WebOQL
  • Website Management System
  • STRUDEL
  • Web Warehouse
  • - WHOWEDA

11
WHOWEDA -Key Objectives
  • Design a suitable data model to represent web
    information
  • development of web algebra and query language
  • Maintenance of Web data
  • Development of knowledge discovery and web mining
    tools
  • Web warehouse

12
WHOWEDA - What?
  • WareHouse Of Web Data
  • Subject - oriented
  • Integrated
  • Temporal
  • Granularity - Lower, higher
  • Some summary
  • Not updatable
  • Alternative information sources

13
What is a Web Warehouse?
  • Subject-oriented, integrated, time-variant,
    non-volatile repository of web data for direct
    querying and analysis for some sort of decision
    making
  • A process whereby organizations or individuals
    extract value from their Web informational assets
    through the use of special stores called web
    warehouses

14
WHOWEDA! www.cais.ntu.edu.sg8000/whoweda
  • A WareHouse Of WEb DAta
  • Web Information Coupling Model (WICM)
  • Web Objects
  • Web Schema
  • Web Information Coupling Algebra
  • Web Information Maintenance
  • Web Mining and Knowledge discovery

15
User
WWW
Warehouse Concept Mart
Web Querying Analysis Component
Web Information Coupling System
Web Information Maintenance System
Web Information Mining System
Web Mart
Web Mart
Web Warehouse
Web Mart
Web Mart
16
User
WWW
Web Query Display
Warehouse Concept Mart
Global Web Manipulation
Global Web Coupling
Pre processing
Global Ranking
Data Visualization
Schema Tightness
Web Warehouse
Data Visualization
Web Union
Web Select
Web Intersection
Web Project
Local Web Manipulation
Local Web Coupling
Schema Tightness
Local Ranking
Schema Search
Web Join
Schema Match
17
Web Objects
  • Node - url, title, format, size, date, text
  • Link - source-url, target-url, label, link-type
  • Web tuple
  • Web table
  • Web schema
  • Web database

18
Web Schema
  • Metadata in the warehouse
  • Structural summary of web table
  • Information Coupling using a Query graph
  • Query graph -gtWeb schema
  • directed graph represented by Ordered 4-tuple
  • Set of node variables
  • Set of link variables
  • Connectivities
  • Predicates

19
(No Transcript)
20
(No Transcript)
21
url contains headlines
22
(No Transcript)
23
Schema- example
  • Node variables Xn x, y, z, w
  • Link variable Xl e, f, g
  • Connectivities C xltegty and xltfg-gtz and
    xltfh-gtw
  • The symbol represents an anonymous node
    variable, a node variable not restricted by any
    predicate.

24
  • Predicates
  • Px.urlhttp//www.mediacity.com.sg/i-square,
  • y.url CONTAINS headlines
  • e.target_url CONTAINS "article",
  • f.target.url CONTAINS "newshub/specials",
  • g.label CONTAINS "Local News",
  • z.url CONTAINS "local",
  • h.label CONTAINS "World News",
  • w.url CONTAINS "world"

25
Query Graph - Example 1
  • Query graph - same as schema except that it has
    one more parameter to control the results
    returned.
  • Informally, it is directed connected graph
    consists of nodes, links and keywords imposed on
    them.
  • Produce a list of diseases with their symptoms,
    evaluation procedures and treatment starting from
    the web site at http//www.panacea.org/
  • Web table Diseases

26
Treatment list
q
Treatment
g
http//www.panacea.org/
Issues
Symptoms list
f
y
x
z
Symptoms
List of Diseases
e
Evaluation
Evaluation
w
p
27
Treatment list
q1
g1
Treatment
http//www.panacea.org/
Issues
f1
Symptoms list
x0
z1
y1
Symptoms
AIDS
List of Diseases
e1
Evaluation
Evaluation
w1
p2
Elisa Test
28
Example 2
  • Produce a list of drugs, and their uses and side
    effects starting from the web site at
    http//www.panacea.org/
  • Web table Drugs

29
(No Transcript)
30
Side effects of Indavir
Drug list
http//www.panacea.org/
Issues
r1
AIDS
a0
b1
c1
d1
Indavir
Side effects
List of Diseases
Use
s1
k1
Uses of Indavir
31
Query Language
  • Starting from the CS deptt home page at NTU, find
    all documents that are linked through paths of
    length less than two containing only local links,
    and have in their text database.

32
  • COUPLE WEBTABLE W FROM WWW
  • SUCH THAT NODE I, j IN WWW and LINK e,f,g IN WWW
    AND Iltef,ggtj WHERE I.url EQUALS
    http//www.ntu.edu.sg AND j.text CONTAINS
    database AND f.link-type EQUALS local AND
    g.link-type EQUALS local

33
Web Algebra
  • Formal foundation of data representation and
    manipulation in a web warehouse
  • Web operators
  • Information access operator
  • Information manipulation operators
  • Web schema operators
  • Data visualization operators

34
Information access operator
  • Global Web Coupling

35
Information Manipulation
  • - Web select
  • Web project
  • Local web coupling
  • Web join
  • Web cartesian product
  • Web union
  • Web intersect
  • Local Web coupling

36
Web Select
  • Extracts web tuples from web tables satisfying
    certain conditions on node and link variables and
    on connectivities
  • Input is select Schema
  • Output is a web table satisfying the select schema

37
  • select W1 tuples that contain world news about
    Indonesia since May 1 1998.
  • sMsW1 where
  • Ms lt Xsn, Xsl, Cs, Ps gt,
  • Xsn x, w , Xsl ,
  • Cs ,
  • Ps x.date gt "1May1998", w.text CONTAINS
    Indonesia

38
  • Xn x, y, z, w ,Xl e, f, g
  • C xltegty and xltfg-gtz and xltfh-gtw
  • Px.urlhttp//www.mediacity.com.sg/i-square,
    x.date gt "1May1998",
  • e.target_url CONTAINS "article", f.target.url
    CONTAINS "newshub/specials",
  • g.label CONTAINS "Local News",
  • z.url CONTAINS "local",
  • h.label CONTAINS "World News",
  • w.url CONTAINS "world",
  • w.text CONTAINS Indonesia

39
Web Information Coupling System
  • A database system to couple related web
    information
  • Global web Coupling and Local Web Coupling

40
Global Coupling - Information Access
  • To integrate data from the Web
  • To create historical data
  • To couple related information from the WWW
    satisfying a query graph
  • Operator to create web tables
  • From web with no schema to web table with web
    schema

41
Why local web coupling?
  • Directly querying the WWW to gather these
    information is an expensive and repetitive affair
  • Web documents containing similar information can
    reside in different web tables in a web warehouse
  • A mechanism to gather these similar information
    by additional manipulation of the materialized
    web tables

42
Local Web Couple operator
  • Two web tuples and can be coupled if
    there exist atleast one pair of nodes from
    and which contains similar information.

43
Local Web Couple operator
  • The web couple operator is basically a web
    cartesian product followed by web select
  • We denote web couple by the symbol

44
Web Coupling
45
  • M2 lt Xn, Xl, C,P gt for W2
  • Xn s, t, u, Xl k, l, m, n ,
  • C sltklgtt and sltmngtu ,
  • Ps.url http//www.asia1.com.sg/straitstimes/,
  • k.label REGION,
  • l.target_url http//www.asia1.com.sg/straitstime
    s/pages/sea.html, m.label WORLD,
  • n.target_urlhttp//www.asia1.com.sg/straitstimes
    /pages/wrld.html

46
  • W1 qq W2 where
  • q (x.dates.date) (w.text CONTAINS
    Indonesia) (t.text CONTAINS Indonesia)

47
  • Xn x, y, z, w, s, t, u , Xl e, f,
    g, k, l, m, n , C xltegty and xltfg-gtz and
    xltfh-gtw and sltklgtt and sltmngtu
  • P x.urlhttp//www.mediacity.com.sg/i-square
    , e.target_url CONTAINS "article",
  • f.target.url CONTAINS "newshub/specials",
  • g.label CONTAINS "Local News",
  • z.url CONTAINS "local",
  • h.label CONTAINS "World News",
  • w.url CONTAINS "world",
  • s.url http//www.asia1.com.sg/straitstimes/,

48
  • k.label REGION, l.target_url
    http//www.asia1.com.sg/straitstimes/pages/sea.h
    tml,
  • m.label WORLD,
  • n.target_url http//www.asia1.com.sg/straitstim
    es/pages/wrld.html,
  • x.date s.date,
  • w.text CONTAINS Indonesia,
  • t.text CONTAINS Indonesia"

49
Local Web Coupling
  • Initiated explicitly by the user
  • User provides the pair of node variables and the
    keyword set based on which coupling is to be
    performed
  • Coupling nodes in each pair of web tuples in the
    input web tables must satisfy one of the coupling
    conditions

50
Construction of coupled table
  • First perform a web cartesian product on the two
    web tables
  • For each web tuple in the resultant web table
  • the specified instances of node variables are
    inspected to determine whether the web tuple
    satisfy coupling compatibility condition(s)

51
Construction of coupled table
  • If a pair of nodes satisfy none of the
    conditions, the corresponding web tuple is
    rejected
  • Otherwise, the web tuple is stored in a separate
    web table

52
Types of web coupling
  • System driven web coupling In this case the
    system to decide which are the node variables to
    be coupled (coupling nodes). If atleast a pair of
    coupling nodes cannot be identified then the web
    tables cannot be coupled.

53
Types of web coupling
  • User driven web coupling In this case the user
    decides which are the node variables to be
    coupled (coupling nodes).
  • Coupling is performed only on those user
    specified node variable(s).

54
Types of web coupling
  • Attribute driven web coupling In this case the
    user specifies the coupling attributes.
  • Coupling is performed only on those user
    specified coupling attribute(s).

55
Attribute driven web coupling
  • COUPLE TABLE3
  • FROM TABLE1 AND TABLE 2
  • ON ATTRIBUTE TEXT
  • AT SCHEMA/TUPLE(optional)

56
Types of web coupling
  • Value driven web coupling In this case the user
    specifies the values of the attributes of the
    nodes on which coupling should be performed.
  • Coupling is performed only on those user
    specified attribute values.

57
Value driven web coupling
  • COUPLE TABLE3
  • FROM TABLE1 AND TABLE 2
  • ON VALUE Software Agents
  • AT SCHEMA/TUPLE(optional)

58
Schema level web coupling
  • We inspect the schemas to decide whether the two
    web tables can be coupled.
  • If coupling conditions cannot be identified then
    the two web tables cannot be coupled.
  • We do not inspect the web tuples in the web
    table.
  • Number of web tuples coupled will be nm.

59
Tuple level web coupling
  • We inspect the web tuples of the two input web
    tables to identify nodes with similar
    information.
  • The number of web tuples in the coupled web table
    ltnm

60
Why two levels?
  • A schema does not capture all the information of
    the web documents in a web table not always
    possible to identify coupling condition by
    inspecting the schemas.
  • possible to find existence of coupling nodes
    which are not defined in the schemas.

61
Why two levels?
  • Tuple level coupling gives us a mean to correlate
    web documents containing similar information from
    the web tables (that cannot be identified from
    their schemas) at the expense of additional
    processing.

62
Join Processing in Web Databases

63
Web Join
  • Concatenate tuples based on identical nodes or
    documents
  • Input are two web tables and their schemas
  • Output is a joined table
  • Types
  • Pi-web join, theta-web join, outer joins, web
    composition, semi web join

64
Web Join
  • Used for combining related data from various web
    tables
  • Mechanism to detect changes
  • Mechanism to find alternative web document in
    case of Document Not Found error

65
Web Join Operator
  • Information manipulation operator
  • Manipulate information residing in a web database
    to derive additional information
  • Harness useful, composite information from two
    web tables
  • Capitalize on the reuse of retrieved data from
    the WWW in order to reduce execution time of
    queries

66
Joinable Nodes
  • Node variables participating in the web join
    process
  • Expressed as a pair
  • Each node in the pair should have identical URLs

67
Web Join
  • Combine two web tables by concatenating a web
    tuple of one web table with a web tuple of other
    web table whenever there exist joinable nodes
  • Joinable nodes are identified from the schemas of
    the two web tables
  • URLs of the joinable nodes are identical

68
Treatment list
q
Treatment
g
http//www.panacea.org/
Symptoms list
Issues
List of Diseases
f
y
x
z
Symptoms
e
Evaluation
Evaluation
Drug list
w
p
Issues
r
Side effects
b
c
d
Side effects
Use
s
Uses
k
69
AIDS treatment
q1
g1
Symptoms of AIDS
http//www.panacea.org/
f1
y1
x0
z1
AIDS
e1
AIDS
Evaluation
Elisa Test
w1
p2
r1
Side effects of Indavir
b1
c1
d1
Indavir
s1
Uses of Indavir
k1
70
Join Existence
  • Given two web tables, we determine if these two
    web tables are joinable
  • Inspect the schemas of the web tables
  • Satisfy joinability conditions based on
  • node predicates
  • link predicates
  • node and link predicates
  • locus of a node relative to a joinable node

71
Join Construction
  • To construct a joined schema, we construct
  • node set
  • link set
  • connectivity set
  • predicate set
  • Construction of joined table
  • Concatenating the web tuples of the two input
    tables over the joinable nodes

72
Web Bags
  • Existence of identical web tuples.
  • Created due to web project operation.
  • Structure based mining
  • Used for discovering
  • Visible nodes
  • Luminous nodes
  • Luminous paths

73
Definitions
  • Visibility of a web document or node D in a web
    table W measures the number of different web
    documents in W that have links to D
  • Luminosity - Reverse of visibility, the number of
    other distinct documents that are linked from D
  • Luminous paths - a set of inter-linked nodes
    which occurs number of times in a web table

74
Steps to find visible nodes
  • Input Web table W, node variable x, visibility
    threshold v
  • Output Set of visible nodes
  • Create a web table from W where each web tuple
    contains distinct instances of node x and the
    preceeding node which is linked to x
  • Eliminate the nodes linked to x in each tuple of
    the web table using web project

75
Steps to find visible nodes
  • Input Web table W, node variable x, visibility
    threshold v
  • Output Set of visible nodes
  • Create a web table from W where each web tuple
    contains distinct instances of node x and the
    preceeding node which is linked to x
  • Eliminate the nodes linked to x in each tuple of
    the web table using web project

76
Steps to find visible nodes
  • Check if the collection of web tuples of node x
    thus created is a web bag by comparing their URLs
  • Create multiplets for each collection of
    identical nodes
  • For each multiplet calculate the node visibility
  • Determine the multiplets with node visibility
    greater than the threshold
  • Create the visible node set

77
Steps to find luminous nodes
  • Input Web table W, node variable x, luminosity
    threshold l
  • Output Set of luminous nodes
  • Steps are similar to that of visible node
    discovery
  • We consider the nodes linked from x in place of
    nodes linked to x

78
Steps to find luminous nodes
  • Input Web table W, node variable x, luminosity
    threshold l
  • Output Set of luminous nodes
  • Steps are similar to that of visible node
    discovery
  • We consider the nodes linked from x in place of
    nodes linked to x

79
Steps to find luminous paths
  • Create the collection of multiplets
  • Compute path luminosity for each multiplet
  • If the path luminosity value of a multiplet is
    greater than or equal to threshold then a path
    in the multiplet is a luminous path
  • Otherwise, we create a collection of linear web
    tuples from the above collection of web tuples

80
Steps to find luminous paths
  • This is to identify if there exist a subset of
    inter-linked nodes between x and y that are
    luminous paths
  • We repeat the procedure to compute path
    luminosity for these set of inter-linked nodes

81
Web Schema
Cancer
http//www.panacea.org/
e
f
x
y
z
Cancer
Diseases
82
Cancer
http//www.panacea.org/
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z1
x0
y0
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z2
x0
y0
Cancer
e0
Cancer
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z4
x0
y0
Cancer
e0
Web Table
83
Projected schema
84
Cancer
Web Table after eliminating x and y
85
Projected schema
Cancer
http//www.panacea.org/
e
z
x
y
Diseases
86
http//www.cancer.org/desc.html
http//www.cancer.org/desc.html
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
http//www.panacea.org/
Diseases
Cancer
x0
y0
z4
Web Bag
87
After removal of identical tuples
http//www.cancer.org/desc.html
88
Cancer
z1
http//www.cancer.org/desc.html
Cancer
http//www.cancer.org/desc.html
z1
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
89
http//www.cancer.org/desc.html
90
Visible Nodes
Cancer
http//www.cancer.org/desc.html
z1
Cancer
z2
http//www.disease.com/cancer/skin.htm
Cancer
z1
http//www.cancer.org/desc.html
Cancer
z4
http//www.jhu.edu/medical/research/cancer.htm
91
Luminous Paths
92
More Operators . . .
  • Web schema operators
  • Schema tightness operator, Schema match operator,
    Schema search operator
  • Data visualization operators
  • Ranking operators (Global Local), Web Nest, Web
    Un-nest, Web Coalesce, Web Expand, Web Pack, Web
    Unpack, Web Sort

93
Partitioning of web tables
  • Partitioning web tables
  • restructured easily
  • indexed easily
  • monitored easily
  • reorganized easily
  • By
  • time
  • schema tree structure
  • keywords

94
Warehouse Concept Mart (WCMart)
  • Subject oriented
  • Concept generation.
  • Manually -gt Autonomous.
  • Used for
  • Ranking tuples
  • Global web coupling
  • Content based mining

95
Mining in Web Warehouse
  • Web Structure Mining
  • Web Content Mining
  • Web usage Mining

96
Web Data Refinement
  • Improve web schema - schema tightness operator
  • Partition web tables based on content and
    structure

97
Partitioning of web tables
  • Partitioning web tables
  • restructured easily
  • indexed easily
  • monitored easily
  • reorganized easily
  • By
  • time
  • schema tree structure
  • keywords

98
WWW
Warehouse Concept Mart
Global Web Coupling
Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
99
Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
Lower-level Granularity
Web Information Manipulation Operators
Higher level Granularity
Summarized data
100
User
WWW
Warehouse Concept Mart
Web Querying Analysis Component
Web Information Coupling System
Web Information Mining System
Web Warehouse
101
What type of information can be summarized?
  • Structural
  • Content-based
  • time-variant analysis
  • snapshot analysis
  • compare one period with another
  • trend analysis

102
Structural Summarization
  • Most volatile documents
  • Sites which change frequently
  • Rate of change over time
  • a pointer to directly access documents which
    change rapidly
  • Most visible nodes, luminous nodes, luminous
    paths
  • Change with time
  • Decrease or increase - Analyze the reason

103
Content Summarization
  • What can be aggregrated in a web page?
  • Number of links with identical labels
  • Number of keywords
  • Changes in content with time
  • Comparing the changes
  • Open question
  • XML will improve the ability of analysis of web
    data

104
Summary
  • Current status
  • Mechanism for accessing and manipulating web
    information in WHOWEDA
  • Implementing various web operators and query
    language
  • Future research
  • What types of information can be summarized?
  • What types of knowledge can be mined?
  • Refine web warehouse architecture
  • www.cais.ntu.edu.sg8000/whoweda
Write a Comment
User Comments (0)
About PowerShow.com