Web Information Extraction - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Web Information Extraction

Description:

Requirement of in-depth technical knowledge. Ordinary users unable to use IE systems ... An Example from BBC News Online. http://news.bbc.co.uk ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 31
Provided by: liuz7
Category:

less

Transcript and Presenter's Notes

Title: Web Information Extraction


1
Web Information Extraction
  • Liu Zehua
  • Centre for Advanced Information Systems
  • School of Computer Engineering
  • Nanyang Technological University
  • Email aszhliu_at_ntu.edu.sg

2
Presentation Outline
  • Introduction
  • Web Information Extraction
  • The WICCAP System
  • Summary

3
Introduction
  • The Web for Information Delivery
  • Information Overloading
  • WebPages are designed for human browsing
  • Solution
  • Information Extraction from the Web
  • Data Model to represent information
  • Agent to perform extraction

4
Introduction
  • Problem
  • Requirement of in-depth technical knowledge
  • Ordinary users unable to use IE systems
  • Solution
  • Automatic wrapper generation
  • Separate information modeling and information
    extraction

5
Introduction
  • Problem
  • New sites appearing and existing sites changing
    frequently
  • Manual creation of data model is too slow
  • Solution
  • A Formal process for data model generation
  • Assistant tools to automate the process

6
Web IE Systems
  • Information Sources
  • Free Text
  • Structured Text
  • Semi-Structured Text
  • Information from the Web
  • Mostly in HTML pages
  • Structured Text
  • Inter-Linked, Dynamic, Ill-formed

7
Web IE Systems
  • Wrappers
  • Extract information from web pages and convert it
    into explicit data structures
  • Main Components
  • Extraction Rules
  • Codes for executing the rules
  • Wrapper Generation
  • Manual
  • Semi-Automatic
  • Automatic

8
Web IE Systems
  • Key Issues
  • Defining a flexible data model for representing
    the information from the Web
  • Extracting the information according to the data
    model
  • Presenting the retrieved information in an
    accessible manner for further processing or
    integration

9
Web IE Systems
  • Data Model
  • Intuitive to ordinary users
  • Flexible enough to work with a heterogeneous
    categories of websites
  • Open enough to allow other applications to
    inter-operate

10
Web IE Systems
  • Applications
  • Viewing information
  • Software agent
  • Integration of multiple sources
  • Query processing
  • Rebuild websites

11
The WICCAP System
  • WWW Information Collection, Collaging And
    Programming
  • Is a software system for the generation of
    logical views of web resources, and the
    extraction and presentation of desired
    information
  • Deals with the logical structure of information
  • Makes information from Web accessible in a simple
    and open manner

12
Three-Layer Architecture
  • Generation of logical data model of information
  • Extraction of information from the source
    according to the data model
  • Presentation of the extracted data in an animated
    and programmable manner

13
WICCAP System Architecture
WICCAP System
WWW
Mapping Rule
Extracted Content
Formatted Content
Network Extraction Agent
Mapping Wizard
Presentation Toolkit
WWW Information Source
User
14
Decoupling of Layers
  • The input and output of each layer are stored in
    XML with pre-defined XML Schemas to allow
    decoupling of layers and reuse of intermediate
    output
  • Expert users specify how to extract
  • Ordinary users specify what to extract

15
Logical View of Website
  • BBC_News
  • World
  • ArticleList
  • Article
  • Title
  • Link
  • Description
  • Asia-Pacific
  • Europe
  • Sports
  • Business
  • Education

Relates the information from a website in terms
of commonly perceived logical structure, instead
of the physical directory location
An Example from BBC News Online http//news.bbc.co
.uk
16
Logical Data Model
  • WICCAP Data Model (WDM) defines the basic data
    elements and how these elements are organized to
    form the logical view of a website
  • WDM Schema Defines the logical organization of
    the information
  • Mapping Rule Defines the mapping between the
    logical view of a specific website and its
    physical location
  • Defined using XML and XML Schema

17
Mapping Wizard
WICCAP System
Mapping Rule
Extracted Content
Formatted Content
WWW
Network Extraction Agent
Mapping Wizard
Presentation Toolkit
WWW Information Source
User
18
Mapping Wizard
  • The goal
  • Facilitates and automates the process of
    producing a Mapping Rule of a given Website
  • Defines the logical data model
  • Formalizes the data model generation process
  • Provides a set of tools to automate the process

19
Formal Process
20
Mapping Wizard
21
Assistant Tools
  • Aim To automate the formal process as much as
    possible
  • Impossible to fully automate the whole process
  • Approach
  • Identify bottlenecks that slow down the process
  • Build tools to accelerate those parts one by one
  • Efficiency of the overall process is improved

22
Network Extraction Agent
WICCAP System
Mapping Rule
Extracted Content
Formatted Content
Network Extraction Agent
Mapping Wizard
Presentation Toolkit
WWW Information Source
User
23
Network Extraction Agent
  • Performs extraction of information from websites
    based on Mapping Rules
  • Provides post-processing features, such as
    filtering and consolidation
  • Capable of handling HTML Form
  • Scheduler for extraction job scheduling
  • Information Storage Management for storing the
    extracted information

24
Network Extraction Agent
25
Presentation Toolkit - WIPAP
WICCAP System
Mapping Rule
Extracted Content
Formatted Content
Network Extraction Agent
Mapping Wizard
Presentation Toolkit
WWW Information Source
User
26
Presentation Toolkit - WIPAP
  • Web Information Programmer And Player (WIPAP)
  • Player - Presents the extracted information to
    end users in different ways
  • Programmer - Enables users to control what
    information to be presented, how it is presented,
    and when to present

27
Presentation Toolkit - WIPAP
28
Summary
  • WICCAP provides a set of simple yet powerful
    software tools for Information Extraction from
    the WWW
  • Three-Layer System Architecture
  • Important Features
  • Wiccap Data Model to represent logical views of
    information
  • Mapping Wizard for auto-generation of Mapping
    Rules
  • Agent for automatic extraction and
    post-processing
  • WIPAP for programmable presentation of
    information

29
Thank You
30
?? ??
Write a Comment
User Comments (0)
About PowerShow.com