WHOWEDA%20:%20Warehouse%20of%20Web%20Data

About This Presentation

Title:

WHOWEDA%20:%20Warehouse%20of%20Web%20Data

Description:

Title: Web Warehousing : Design and Issues Author: skm Last modified by: skm Created Date: 9/16/1998 7:53:23 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 105

Provided by: skm8

Learn more at: https://web.mst.edu

Category:

more less

Transcript and Presenter's Notes

Title: WHOWEDA%20:%20Warehouse%20of%20Web%20Data

1

WHOWEDA Warehouse of Web Data
Sanjay Kumar Madria
Department of Computer Science
Purdue University, West Lafayette, IN 47907
skm_at_cs.purdue.edu

2
www.is.a.mess
3
WWW

collection of multimedia documents in the form of
web pages connected via hyperlinks.

4
Characteristics of WWW

WWW is a set of directed graphs
data in the WWW has a heterogeneous nature
unstructured versus structured information
no central authority to manage information
Dynamic verses static information
Web information discoveries - search engines

5
As WWW grows, more chaotic it becomes

Web is fast growing, distributed,
non-administered global information resource
WWW allows access to text, image, video, sound
and graphic data
more business organizations creating web servers
more chaotic environment to locate information of
interest
lost in hyperspace syndrome

6
Does it affect the corporate world?

Lack of credibility of data
Different sites with different data
Same site different data
Historical information is not available
Previous versions of web data
How does web data change with time
Summarization over time
Data to information
Reduction in productivity
Analysis is manual

7
How users find web sites

Indexes and search engines 75
UseNet newsgroups 44
Cool lists 27
New lists 24
Listservers 23
Print ads 21
Word-of-mouth and e-mail 17
Linked web advertisement 4

8
Limitations of Search Engines

Do not exploit hyperlinks
search is limited to string matching
Queries are evaluated on archived data rather
than up-to-date data no indexing on current data
low accuracy
replicated results
no further manipulation possible

9
Limitations of Search Engines

ERROR 404!
No efficient document management
Query results cannot be further manipulated
No efficient means for knowledge discovery

10
Current Research Projects

Web Query System
W3QS, WebSQL, AKIRA, NetQL, RAW,
WebLog
Semistructured Data
LOREL, UnQL, WebOQL
Website Management System
STRUDEL
Web Warehouse
- WHOWEDA

11
WHOWEDA -Key Objectives

Design a suitable data model to represent web
information
development of web algebra and query language
Maintenance of Web data
Development of knowledge discovery and web mining
tools
Web warehouse

12
WHOWEDA - What?

WareHouse Of Web Data
Subject - oriented
Integrated
Temporal
Granularity - Lower, higher
Some summary
Not updatable
Alternative information sources

13
What is a Web Warehouse?

Subject-oriented, integrated, time-variant,
non-volatile repository of web data for direct
querying and analysis for some sort of decision
making
A process whereby organizations or individuals
extract value from their Web informational assets
through the use of special stores called web
warehouses

14
WHOWEDA! www.cais.ntu.edu.sg8000/whoweda

A WareHouse Of WEb DAta
Web Information Coupling Model (WICM)
Web Objects
Web Schema
Web Information Coupling Algebra
Web Information Maintenance
Web Mining and Knowledge discovery

15
User
WWW
Warehouse Concept Mart
Web Querying Analysis Component
Web Information Coupling System
Web Information Maintenance System
Web Information Mining System
Web Mart
Web Mart
Web Warehouse
Web Mart
Web Mart
16
User
WWW
Web Query Display
Warehouse Concept Mart
Global Web Manipulation
Global Web Coupling
Pre processing
Global Ranking
Data Visualization
Schema Tightness
Web Warehouse
Data Visualization
Web Union
Web Select
Web Intersection
Web Project
Local Web Manipulation
Local Web Coupling
Schema Tightness
Local Ranking
Schema Search
Web Join
Schema Match
17
Web Objects

Node - url, title, format, size, date, text
Link - source-url, target-url, label, link-type
Web tuple
Web table
Web schema
Web database

18
Web Schema

Metadata in the warehouse
Structural summary of web table
Information Coupling using a Query graph
Query graph -gtWeb schema
directed graph represented by Ordered 4-tuple
Set of node variables
Set of link variables
Connectivities
Predicates

19
(No Transcript)
20
(No Transcript)
21
url contains headlines
22
(No Transcript)
23
Schema- example

Node variables Xn x, y, z, w
Link variable Xl e, f, g
Connectivities C xltegty and xltfg-gtz and
xltfh-gtw
The symbol represents an anonymous node
variable, a node variable not restricted by any
predicate.

Predicates
Px.urlhttp//www.mediacity.com.sg/i-square,
y.url CONTAINS headlines
e.target_url CONTAINS "article",
f.target.url CONTAINS "newshub/specials",
g.label CONTAINS "Local News",
z.url CONTAINS "local",
h.label CONTAINS "World News",
w.url CONTAINS "world"

25
Query Graph - Example 1

Query graph - same as schema except that it has
one more parameter to control the results
returned.
Informally, it is directed connected graph
consists of nodes, links and keywords imposed on
them.
Produce a list of diseases with their symptoms,
evaluation procedures and treatment starting from
the web site at http//www.panacea.org/
Web table Diseases

26
Treatment list
q
Treatment
g
http//www.panacea.org/
Issues
Symptoms list
f
y
x
z
Symptoms
List of Diseases
e
Evaluation
Evaluation
w
p
27
Treatment list
q1
g1
Treatment
http//www.panacea.org/
Issues
f1
Symptoms list
x0
z1
y1
Symptoms
AIDS
List of Diseases
e1
Evaluation
Evaluation
w1
p2
Elisa Test
28
Example 2

Produce a list of drugs, and their uses and side
effects starting from the web site at
http//www.panacea.org/
Web table Drugs

29
(No Transcript)
30
Side effects of Indavir
Drug list
http//www.panacea.org/
Issues
r1
AIDS
a0
b1
c1
d1
Indavir
Side effects
List of Diseases
Use
s1
k1
Uses of Indavir
31
Query Language

Starting from the CS deptt home page at NTU, find
all documents that are linked through paths of
length less than two containing only local links,
and have in their text database.

COUPLE WEBTABLE W FROM WWW
SUCH THAT NODE I, j IN WWW and LINK e,f,g IN WWW
AND Iltef,ggtj WHERE I.url EQUALS
http//www.ntu.edu.sg AND j.text CONTAINS
database AND f.link-type EQUALS local AND
g.link-type EQUALS local

33
Web Algebra

Formal foundation of data representation and
manipulation in a web warehouse
Web operators
Information access operator
Information manipulation operators
Web schema operators
Data visualization operators

34
Information access operator

Global Web Coupling

35
Information Manipulation

- Web select
Web project
Local web coupling
Web join
Web cartesian product
Web union
Web intersect
Local Web coupling

36
Web Select

Extracts web tuples from web tables satisfying
certain conditions on node and link variables and
on connectivities
Input is select Schema
Output is a web table satisfying the select schema

select W1 tuples that contain world news about
Indonesia since May 1 1998.
sMsW1 where
Ms lt Xsn, Xsl, Cs, Ps gt,
Xsn x, w , Xsl ,
Cs ,
Ps x.date gt "1May1998", w.text CONTAINS
Indonesia

Xn x, y, z, w ,Xl e, f, g
C xltegty and xltfg-gtz and xltfh-gtw
Px.urlhttp//www.mediacity.com.sg/i-square,
x.date gt "1May1998",
e.target_url CONTAINS "article", f.target.url
CONTAINS "newshub/specials",
g.label CONTAINS "Local News",
z.url CONTAINS "local",
h.label CONTAINS "World News",
w.url CONTAINS "world",
w.text CONTAINS Indonesia

39
Web Information Coupling System

A database system to couple related web
information
Global web Coupling and Local Web Coupling

40
Global Coupling - Information Access

To integrate data from the Web
To create historical data
To couple related information from the WWW
satisfying a query graph
Operator to create web tables
From web with no schema to web table with web
schema

41
Why local web coupling?

Directly querying the WWW to gather these
information is an expensive and repetitive affair
Web documents containing similar information can
reside in different web tables in a web warehouse
A mechanism to gather these similar information
by additional manipulation of the materialized
web tables

42
Local Web Couple operator

Two web tuples and can be coupled if
there exist atleast one pair of nodes from
and which contains similar information.

43
Local Web Couple operator

The web couple operator is basically a web
cartesian product followed by web select
We denote web couple by the symbol

44
Web Coupling
45

M2 lt Xn, Xl, C,P gt for W2
Xn s, t, u, Xl k, l, m, n ,
C sltklgtt and sltmngtu ,
Ps.url http//www.asia1.com.sg/straitstimes/,
k.label REGION,
l.target_url http//www.asia1.com.sg/straitstime
s/pages/sea.html, m.label WORLD,
n.target_urlhttp//www.asia1.com.sg/straitstimes
/pages/wrld.html

W1 qq W2 where
q (x.dates.date) (w.text CONTAINS
Indonesia) (t.text CONTAINS Indonesia)

Xn x, y, z, w, s, t, u , Xl e, f,
g, k, l, m, n , C xltegty and xltfg-gtz and
xltfh-gtw and sltklgtt and sltmngtu
P x.urlhttp//www.mediacity.com.sg/i-square
, e.target_url CONTAINS "article",
f.target.url CONTAINS "newshub/specials",
g.label CONTAINS "Local News",
z.url CONTAINS "local",
h.label CONTAINS "World News",
w.url CONTAINS "world",
s.url http//www.asia1.com.sg/straitstimes/,

k.label REGION, l.target_url
http//www.asia1.com.sg/straitstimes/pages/sea.h
tml,
m.label WORLD,
n.target_url http//www.asia1.com.sg/straitstim
es/pages/wrld.html,
x.date s.date,
w.text CONTAINS Indonesia,
t.text CONTAINS Indonesia"

49
Local Web Coupling

Initiated explicitly by the user
User provides the pair of node variables and the
keyword set based on which coupling is to be
performed
Coupling nodes in each pair of web tuples in the
input web tables must satisfy one of the coupling
conditions

50
Construction of coupled table

First perform a web cartesian product on the two
web tables
For each web tuple in the resultant web table
the specified instances of node variables are
inspected to determine whether the web tuple
satisfy coupling compatibility condition(s)

51
Construction of coupled table

If a pair of nodes satisfy none of the
conditions, the corresponding web tuple is
rejected
Otherwise, the web tuple is stored in a separate
web table

52
Types of web coupling

System driven web coupling In this case the
system to decide which are the node variables to
be coupled (coupling nodes). If atleast a pair of
coupling nodes cannot be identified then the web
tables cannot be coupled.

53
Types of web coupling

User driven web coupling In this case the user
decides which are the node variables to be
coupled (coupling nodes).
Coupling is performed only on those user
specified node variable(s).

54
Types of web coupling

Attribute driven web coupling In this case the
user specifies the coupling attributes.
Coupling is performed only on those user
specified coupling attribute(s).

55
Attribute driven web coupling

COUPLE TABLE3
FROM TABLE1 AND TABLE 2
ON ATTRIBUTE TEXT
AT SCHEMA/TUPLE(optional)

56
Types of web coupling

Value driven web coupling In this case the user
specifies the values of the attributes of the
nodes on which coupling should be performed.
Coupling is performed only on those user
specified attribute values.

57
Value driven web coupling

COUPLE TABLE3
FROM TABLE1 AND TABLE 2
ON VALUE Software Agents
AT SCHEMA/TUPLE(optional)

58
Schema level web coupling

We inspect the schemas to decide whether the two
web tables can be coupled.
If coupling conditions cannot be identified then
the two web tables cannot be coupled.
We do not inspect the web tuples in the web
table.
Number of web tuples coupled will be nm.

59
Tuple level web coupling

We inspect the web tuples of the two input web
tables to identify nodes with similar
information.
The number of web tuples in the coupled web table
ltnm

60
Why two levels?

A schema does not capture all the information of
the web documents in a web table not always
possible to identify coupling condition by
inspecting the schemas.
possible to find existence of coupling nodes
which are not defined in the schemas.

61
Why two levels?

Tuple level coupling gives us a mean to correlate
web documents containing similar information from
the web tables (that cannot be identified from
their schemas) at the expense of additional
processing.

62
Join Processing in Web Databases

63
Web Join

Concatenate tuples based on identical nodes or
documents
Input are two web tables and their schemas
Output is a joined table
Types
Pi-web join, theta-web join, outer joins, web
composition, semi web join

64
Web Join

Used for combining related data from various web
tables
Mechanism to detect changes
Mechanism to find alternative web document in
case of Document Not Found error

65
Web Join Operator

Information manipulation operator
Manipulate information residing in a web database
to derive additional information
Harness useful, composite information from two
web tables
Capitalize on the reuse of retrieved data from
the WWW in order to reduce execution time of
queries

66
Joinable Nodes

Node variables participating in the web join
process
Expressed as a pair
Each node in the pair should have identical URLs

67
Web Join

Combine two web tables by concatenating a web
tuple of one web table with a web tuple of other
web table whenever there exist joinable nodes
Joinable nodes are identified from the schemas of
the two web tables
URLs of the joinable nodes are identical

68
Treatment list
q
Treatment
g
http//www.panacea.org/
Symptoms list
Issues
List of Diseases
f
y
x
z
Symptoms
e
Evaluation
Evaluation
Drug list
w
p
Issues
r
Side effects
b
c
d
Side effects
Use
s
Uses
k
69
AIDS treatment
q1
g1
Symptoms of AIDS
http//www.panacea.org/
f1
y1
x0
z1
AIDS
e1
AIDS
Evaluation
Elisa Test
w1
p2
r1
Side effects of Indavir
b1
c1
d1
Indavir
s1
Uses of Indavir
k1
70
Join Existence

Given two web tables, we determine if these two
web tables are joinable
Inspect the schemas of the web tables
Satisfy joinability conditions based on
node predicates
link predicates
node and link predicates
locus of a node relative to a joinable node

71
Join Construction

To construct a joined schema, we construct
node set
link set
connectivity set
predicate set
Construction of joined table
Concatenating the web tuples of the two input
tables over the joinable nodes

72
Web Bags

Existence of identical web tuples.
Created due to web project operation.
Structure based mining
Used for discovering
Visible nodes
Luminous nodes
Luminous paths

73
Definitions

Visibility of a web document or node D in a web
table W measures the number of different web
documents in W that have links to D
Luminosity - Reverse of visibility, the number of
other distinct documents that are linked from D
Luminous paths - a set of inter-linked nodes
which occurs number of times in a web table

74
Steps to find visible nodes

Input Web table W, node variable x, visibility
threshold v
Output Set of visible nodes
Create a web table from W where each web tuple
contains distinct instances of node x and the
preceeding node which is linked to x
Eliminate the nodes linked to x in each tuple of
the web table using web project

75
Steps to find visible nodes

Input Web table W, node variable x, visibility
threshold v
Output Set of visible nodes
Create a web table from W where each web tuple
contains distinct instances of node x and the
preceeding node which is linked to x
Eliminate the nodes linked to x in each tuple of
the web table using web project

76
Steps to find visible nodes

Check if the collection of web tuples of node x
thus created is a web bag by comparing their URLs
Create multiplets for each collection of
identical nodes
For each multiplet calculate the node visibility
Determine the multiplets with node visibility
greater than the threshold
Create the visible node set

77
Steps to find luminous nodes

Input Web table W, node variable x, luminosity
threshold l
Output Set of luminous nodes
Steps are similar to that of visible node
discovery
We consider the nodes linked from x in place of
nodes linked to x

78
Steps to find luminous nodes

Input Web table W, node variable x, luminosity
threshold l
Output Set of luminous nodes
Steps are similar to that of visible node
discovery
We consider the nodes linked from x in place of
nodes linked to x

79
Steps to find luminous paths

Create the collection of multiplets
Compute path luminosity for each multiplet
If the path luminosity value of a multiplet is
greater than or equal to threshold then a path
in the multiplet is a luminous path
Otherwise, we create a collection of linear web
tuples from the above collection of web tuples

80
Steps to find luminous paths

This is to identify if there exist a subset of
inter-linked nodes between x and y that are
luminous paths
We repeat the procedure to compute path
luminosity for these set of inter-linked nodes

81
Web Schema
Cancer
http//www.panacea.org/
e
f
x
y
z
Cancer
Diseases
82
Cancer
http//www.panacea.org/
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z1
x0
y0
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z2
x0
y0
Cancer
e0
Cancer
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z4
x0
y0
Cancer
e0
Web Table
83
Projected schema
84
Cancer
Web Table after eliminating x and y
85
Projected schema
Cancer
http//www.panacea.org/
e
z
x
y
Diseases
86
http//www.cancer.org/desc.html
http//www.cancer.org/desc.html
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
http//www.panacea.org/
Diseases
Cancer
x0
y0
z4
Web Bag
87
After removal of identical tuples
http//www.cancer.org/desc.html
88
Cancer
z1
http//www.cancer.org/desc.html
Cancer
http//www.cancer.org/desc.html
z1
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
89
http//www.cancer.org/desc.html
90
Visible Nodes
Cancer
http//www.cancer.org/desc.html
z1
Cancer
z2
http//www.disease.com/cancer/skin.htm
Cancer
z1
http//www.cancer.org/desc.html
Cancer
z4
http//www.jhu.edu/medical/research/cancer.htm
91
Luminous Paths
92
More Operators . . .

Web schema operators
Schema tightness operator, Schema match operator,
Schema search operator
Data visualization operators
Ranking operators (Global Local), Web Nest, Web
Un-nest, Web Coalesce, Web Expand, Web Pack, Web
Unpack, Web Sort

93
Partitioning of web tables

Partitioning web tables
restructured easily
indexed easily
monitored easily
reorganized easily
By
time
schema tree structure
keywords

94
Warehouse Concept Mart (WCMart)

Subject oriented
Concept generation.
Manually -gt Autonomous.
Used for
Ranking tuples
Global web coupling
Content based mining

95
Mining in Web Warehouse

Web Structure Mining
Web Content Mining
Web usage Mining

96
Web Data Refinement

Improve web schema - schema tightness operator
Partition web tables based on content and
structure

97
Partitioning of web tables

Partitioning web tables
restructured easily
indexed easily
monitored easily
reorganized easily
By
time
schema tree structure
keywords

98
WWW
Warehouse Concept Mart
Global Web Coupling
Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
99
Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
Lower-level Granularity
Web Information Manipulation Operators
Higher level Granularity
Summarized data
100
User
WWW
Warehouse Concept Mart
Web Querying Analysis Component
Web Information Coupling System
Web Information Mining System
Web Warehouse
101
What type of information can be summarized?