Title: WEDAGEN:%20A%20Synthetic%20Web%20Database%20Generator
1WEDAGEN A Synthetic Web Database Generator
2Presentation Outline
- Existing WWW search mechanisms
- WHOWEDA A Warehouse of Web Data
- Modular structure of WEDAGEN
- Configuration parameters
- Performance evaluation
- Summary and future work
3Existing W3 Search Mechanisms
- Time delay in manual navigation of the web
- Overwhelming results and unwanted information
- No tool for organizing and storing harnessed
information for further manipulation
4Existing W3 Search Mechanisms
- Time delay in manual navigation of the web
- Overwhelming results and unwanted information
- No tool for organizing and storing harnessed
information for further manipulation - Search engines and browsers are not always the
best ways to systematically harness information
from the web
5Existing W3 Search Mechanisms
- Time delay in manual navigation of the web
- Overwhelming results and unwanted information
- No tool for organizing and storing harnessed
information for further manipulation - Search engines and browsers are not always the
best ways to systematically harness information
from the web - The WHOWEDA approach _at_ NTU
6Overview of WHOWEDA
- A web warehousing system to store and manipulate
web information - Store extracted information as web tables and
provide web operators to manipulate web tables - To extract information from W3, user defines a
query graph - Results of extraction is a set of web tuples
each tuple instantiates the query graph - More information
- http//www.cais.ntu.edu.sg8000/whoweda
7Example Query graph (web schema)
8Example Query results
9Objectives
- Need to perform systematic evaluation of web
operators during WHOWEDA development - Limitations of testing using real web data
- To design a testbed that is controllable,
comprehensive and systematic for evaluating web
database systems - To control the quantity and quality of synthetic
web tuples by allowing users to specify
configuration parameters and web schemas
10Objectives
- Need to perform systematic evaluation of web
operators during WHOWEDA development - Limitations of testing using real web data
- To design a testbed that is controllable,
comprehensive and systematic for evaluating web
database systems - To control the quantity and quality of synthetic
web tuples by allowing users to specify
configuration parameters and web schemas - WEBAGEN A Web Database Generator
11System Architecture of WEDAGEN
12Configuration Input Parameters
NumSourceNodeInstances FanOut NumKeyWordsPerNodeIn
stance NumWordsPerNodeInstance NumWordsPerLinkLabe
l NumWordsPerTitle NumWordsPerHostName LocalGlobal
Link
NodeSelectivity TableSelectivity
Web Schema
NumTuples NumSourceNodeInstances FanOut NumKeyWord
sPerNodeInstance NumWordsPerNodeInstance NumWordsP
erLinkLabel NumWordsPerHostName NumWordsPerTitle L
ocalGlobalLink
Fan-In
13Parameter Values Suggestion
user change specific parameters
Generate specific parameter values
Start
Calculate NumSourceNodeInstances to generate
specified number of tuples
Calculate max. no. of tuples to be generated
Is calculated value gt NumTuples
Store suggested values in file
User change specific parameters
Invoke instance generation module
End
14Instance Generation Module (IGM)
15Directed Graph Output from IGM
16Tuple Extraction Module (TEM)
- IGM generates all node and link instances
interconnected as directed graph(s) - TEM extracts and constructs individual web tuples
from the directed graph(s) - Node and link instances have IDs assigned
- Web tuples stored in a web table file
- A web table has been constructed that is complete
with node, link and tuple information
17Extracted Web Tuples
18Preliminary Evaluation
- Elapsed time used to measure overhead of web
table generation - A set of sample test configurations identified
consisting of typical combinations of 4 web
schemas and input parameters - Performance measured with respect to
- Complexity of schema
- Total number of node instances and total number
of tuples
19Four Test Schemas
20Three Table Sizes
21Elapsed Time Vs No. of Tuples
22Experimental Findings
- Time elapsed in generating web table increases
with size of table - Rate of growth is different for different
schemas i.e., schema complexity affects elapsed
time - Generating table of tree schema (schema 2) takes
longer than that of linear schema (schema 1) - Generating table of schema 2 takes longer than
that of schema 4
23Summary
- Identified parameters to create web data of
different sizes and complexities successfully
determined - Designed and implemented WEDAGEN and has been
successfully integrated into the WHOWEDA system - Able to scale up well with increasing web schema
complexity and web table size - Time and effort required to evaluate web database
system performance can be reduced with WEBAGEN
24Future Work
- Inclusion of more parameters
- Minimum and maximum depth of a tuple.
- Average ratio of bound and unbound nodes in a
tuple. - Apply WEDAGEN to other database systems similar
to WHOWEDA - Develop WHOWEDA into a full-fledged benchmark
toolkit