WEDAGEN:%20A%20Synthetic%20Web%20Database%20Generator - PowerPoint PPT Presentation

About This Presentation
Title:

WEDAGEN:%20A%20Synthetic%20Web%20Database%20Generator

Description:

No tool for organizing and storing harnessed information for further manipulation ... Develop WHOWEDA into a full-fledged benchmark toolkit ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 25
Provided by: hfl1
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: WEDAGEN:%20A%20Synthetic%20Web%20Database%20Generator


1
WEDAGEN A Synthetic Web Database Generator
2
Presentation Outline
  • Existing WWW search mechanisms
  • WHOWEDA A Warehouse of Web Data
  • Modular structure of WEDAGEN
  • Configuration parameters
  • Performance evaluation
  • Summary and future work

3
Existing W3 Search Mechanisms
  • Time delay in manual navigation of the web
  • Overwhelming results and unwanted information
  • No tool for organizing and storing harnessed
    information for further manipulation

4
Existing W3 Search Mechanisms
  • Time delay in manual navigation of the web
  • Overwhelming results and unwanted information
  • No tool for organizing and storing harnessed
    information for further manipulation
  • Search engines and browsers are not always the
    best ways to systematically harness information
    from the web

5
Existing W3 Search Mechanisms
  • Time delay in manual navigation of the web
  • Overwhelming results and unwanted information
  • No tool for organizing and storing harnessed
    information for further manipulation
  • Search engines and browsers are not always the
    best ways to systematically harness information
    from the web
  • The WHOWEDA approach _at_ NTU

6
Overview of WHOWEDA
  • A web warehousing system to store and manipulate
    web information
  • Store extracted information as web tables and
    provide web operators to manipulate web tables
  • To extract information from W3, user defines a
    query graph
  • Results of extraction is a set of web tuples
    each tuple instantiates the query graph
  • More information
  • http//www.cais.ntu.edu.sg8000/whoweda

7
Example Query graph (web schema)
8
Example Query results
9
Objectives
  • Need to perform systematic evaluation of web
    operators during WHOWEDA development
  • Limitations of testing using real web data
  • To design a testbed that is controllable,
    comprehensive and systematic for evaluating web
    database systems
  • To control the quantity and quality of synthetic
    web tuples by allowing users to specify
    configuration parameters and web schemas

10
Objectives
  • Need to perform systematic evaluation of web
    operators during WHOWEDA development
  • Limitations of testing using real web data
  • To design a testbed that is controllable,
    comprehensive and systematic for evaluating web
    database systems
  • To control the quantity and quality of synthetic
    web tuples by allowing users to specify
    configuration parameters and web schemas
  • WEBAGEN A Web Database Generator

11
System Architecture of WEDAGEN
12
Configuration Input Parameters
NumSourceNodeInstances FanOut NumKeyWordsPerNodeIn
stance NumWordsPerNodeInstance NumWordsPerLinkLabe
l NumWordsPerTitle NumWordsPerHostName LocalGlobal
Link
NodeSelectivity TableSelectivity
Web Schema
NumTuples NumSourceNodeInstances FanOut NumKeyWord
sPerNodeInstance NumWordsPerNodeInstance NumWordsP
erLinkLabel NumWordsPerHostName NumWordsPerTitle L
ocalGlobalLink
Fan-In
13
Parameter Values Suggestion
user change specific parameters
Generate specific parameter values
Start
Calculate NumSourceNodeInstances to generate
specified number of tuples
Calculate max. no. of tuples to be generated
Is calculated value gt NumTuples
Store suggested values in file
User change specific parameters
Invoke instance generation module
End
14
Instance Generation Module (IGM)
15
Directed Graph Output from IGM
16
Tuple Extraction Module (TEM)
  • IGM generates all node and link instances
    interconnected as directed graph(s)
  • TEM extracts and constructs individual web tuples
    from the directed graph(s)
  • Node and link instances have IDs assigned
  • Web tuples stored in a web table file
  • A web table has been constructed that is complete
    with node, link and tuple information

17
Extracted Web Tuples
18
Preliminary Evaluation
  • Elapsed time used to measure overhead of web
    table generation
  • A set of sample test configurations identified
    consisting of typical combinations of 4 web
    schemas and input parameters
  • Performance measured with respect to
  • Complexity of schema
  • Total number of node instances and total number
    of tuples

19
Four Test Schemas
20
Three Table Sizes
21
Elapsed Time Vs No. of Tuples
22
Experimental Findings
  • Time elapsed in generating web table increases
    with size of table
  • Rate of growth is different for different
    schemas i.e., schema complexity affects elapsed
    time
  • Generating table of tree schema (schema 2) takes
    longer than that of linear schema (schema 1)
  • Generating table of schema 2 takes longer than
    that of schema 4

23
Summary
  • Identified parameters to create web data of
    different sizes and complexities successfully
    determined
  • Designed and implemented WEDAGEN and has been
    successfully integrated into the WHOWEDA system
  • Able to scale up well with increasing web schema
    complexity and web table size
  • Time and effort required to evaluate web database
    system performance can be reduced with WEBAGEN

24
Future Work
  • Inclusion of more parameters
  • Minimum and maximum depth of a tuple.
  • Average ratio of bound and unbound nodes in a
    tuple.
  • Apply WEDAGEN to other database systems similar
    to WHOWEDA
  • Develop WHOWEDA into a full-fledged benchmark
    toolkit
Write a Comment
User Comments (0)
About PowerShow.com