ConceptualModelBased Web Data Extraction by Example - PowerPoint PPT Presentation

About This Presentation
Title:

ConceptualModelBased Web Data Extraction by Example

Description:

Conceptual-Model-Based Methodology is resilient 'By Example' ... Canon. Extraction Ontology. Relationship Set and Constraints. Extraction Patterns. Keywords ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 24
Provided by: yuch3
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: ConceptualModelBased Web Data Extraction by Example


1
Conceptual-Model-Based Web Data Extraction by
Example
Yuanqiu (Joe) Zhou Data Extraction Group Brigham
Young University Sponsored by NSF
2
Motivation
  • Data-rich Websites in abundance
  • Conceptual-Model-Based Methodology is resilient
  • By Example approach is user-friendly

3
By Example Approach
  • Web users specify desired information by creating
    a form
  • Users collect sample pages on the Web
  • An ontology generator learns the task by
    analyzing the form and the sample pages
  • Interactions may be needed to improve or complete
    the ontology

4
Architecture
Data Frame Libraries
Sample Pages
Ontology Generator
User Created Form
GUI
Extraction Engine
Target Pages
Populated Database
5
Sample Web Page
User Created Form
Canon
PowerShot G2
6
Extraction Ontology
  • Relationship Set and Constraints
  • Extraction Patterns
  • Keywords
  • Context Expressions

7
Relationship Set and Constraints
  • DigitalCamera - object
  • DigitalCamera 01 has Brand 1
  • DigitalCamera 01 has Model 1
  • DigitalCamera 01 has CCDResolution 1
  • DigitalCamera 01 has ImageResolution 1
  • DigitalCamera 01 has OpticalZoom 1
  • DigitalCamera 01 has DigitalZoom 1
  • Primary Object Name
  • Other Objects Names
  • Participation Constraints

8
Relationship Set and Constraints
  • DigitalCamera - object
  • DigitalCamera 01 has Brand 1
  • DigitalCamera 01 has Model 1
  • DigitalCamera 01 has CCDResolution 1
  • DigitalCamera 01 has ImageResolution 1
  • DigitalCamera 01 has OpticalZoom 1
  • DigitalCamera 01 has DigitalZoom 1
  • Primary Object Name
  • Other Objects Names
  • Participation Constraints

9
Relationship Set and Constraints
  • DigitalCamera - object
  • DigitalCamera 01 has Brand 1
  • DigitalCamera 01 has Model 1
  • DigitalCamera 01 has CCDResolution 1
  • DigitalCamera 01 has ImageResolution 1
  • DigitalCamera 01 has OpticalZoom 1
  • DigitalCamera 01 has DigitalZoom 1
  • Primary Object Name
  • Other Objects Names
  • Participation Constraints

10
Relationship Set and Constraints
  • DigitalCamera - object
  • DigitalCamera 01 has Brand 1
  • DigitalCamera 01 has Model 1
  • DigitalCamera 01 has CCDResolution 1
  • DigitalCamera 01 has ImageResolution 1
  • DigitalCamera 01 has OpticalZoom 1
  • DigitalCamera 01 has DigitalZoom 1
  • Primary Object Name
  • Other Objects Names
  • Participation Constraints

11
Extraction Patterns
From Data Frame Libraries
  • Data Frame Libraries
  • Lexicons
  • Synonym Dictionary
  • Regular Expressions
  • Extraction Pattern
  • Lexicons for Brand and Model
  • Regular Expressions for numbers and Image
    resolution

12
Extraction Patterns
Data Frame Libraries
  • Features a high-quality 4.0 Megapixel
    Resolution CCD
  • The new Nikon Coolpix 995 offers a boasting
    3.34 Megapixel CCD
  • 3 effective megapixel

CCDResolution matches 20 constant extract
"\b\d(\.\d1,2)?\b" keyword
"\bMegapixel\b", "\bCCD\b", "\bResolution\b"

13
Keywords
  • Features a high-quality 4.0 Megapixel Resolution
    CCD
  • The new Nikon Coolpix 995 offers a boasting 3.34
    Megapixel CCD
  • 3 effective megapixel

14
Keywords
  • Features a high-quality 4.0 Megapixel Resolution
    CCD
  • The new Nikon Coolpix 995 offers a boasting 3.34
    Megapixel CCD
  • 3 effective megapixel

15
Keywords
  • Features a high-quality 4.0 Megapixel Resolution
    CCD
  • The new Nikon Coolpix 995 offers a boasting 3.34
    Megapixel CCD
  • 3 effective megapixel

CCDResolution matches 20 constant extract
"\b\d(\.\d1,2)?\b" keyword
"\bMegapixel\b", "\bCCD\b", "\bResolution\b"

16
Context Expressions
  • 3.5x optical zoom (2.5x digital)
  • a superior 4x Optical Zoom Nikkor lens, plus 4x
    stepless digital zoom
  • optical 3X /digital 6X zoom

OpticalZoom matches 10 constant extract
"\b\d(\.\d)?" context "\b\d(\.\d)?(x)\b"
keyword "\boptical\b"
17
Extraction Ontology
DigitalCamera - object DigitalCamera 01
has Brand 1 Brand matches 10 constant
extract "\bNikon\b", extract
"\bCanon\b", extract "\bOlympus\b",
extract "\bMinolta\b", extract
"\bSony\b" end DigitalCamera 01 has
CCDResolution 1 CCDResolution matches 20
constant extract "\b\d(\.\d1,2)?\b"
keyword "\bMegapixel\b,
"\bCCD\b", "\bResolution\b" end
DigitalCamera 01 has ImageResolution
1 ImageResolution matches 20 constant
extract "\b\d4(\s)?(x)(\s)?\d4\b" ,
extract "\b\d4(\s)?(x)(\s)?\d4\b"
keyword "\bResolution\b",
"\bImage\b" end DigitalCamera 01 has
OpticalZoom 1 OpticalZoom matches
10 constant extract "\b\d"
context "\b\d(x)\b" keyword
"\boptical\b" end
18
Extraction Ontology
DigitalCamera - object DigitalCamera 01
has Brand 1 Brand matches 10 constant
extract "\bNikon\b", extract
"\bCanon\b", extract "\bOlympus\b",
extract "\bMinolta\b", extract
"\bSony\b" end DigitalCamera 01 has
CCDResolution 1 CCDResolution matches 20
constant extract "\b\d(\.\d1,2)?\b"
keyword "\bMegapixel\b,
"\bCCD\b", "\bResolution\b" end
DigitalCamera 01 has ImageResolution
1 ImageResolution matches 20 constant
extract "\b\d4(\s)?(x)(\s)?\d4\b" ,
extract "\b\d4(\s)?(x)(\s)?\d4\b"
keyword "\bResolution\b",
"\bImage\b" end DigitalCamera 01 has
OpticalZoom 1 OpticalZoom matches
10 constant extract "\b\d"
context "\b\d(x)\b" keyword
"\boptical\b" end
19
Extraction Ontology
DigitalCamera - object DigitalCamera 01
has Brand 1 Brand matches 10 constant
extract "\bNikon\b", extract
"\bCanon\b", extract "\bOlympus\b",
extract "\bMinolta\b", extract
"\bSony\b" end DigitalCamera 01 has
CCDResolution 1 CCDResolution matches 20
constant extract "\b\d(\.\d1,2)?\b"
keyword "\bMegapixel\b,
"\bCCD\b", "\bResolution\b" end
DigitalCamera 01 has ImageResolution
1 ImageResolution matches 20 constant
extract "\b\d4(\s)?(x)(\s)?\d4\b" ,
extract "\b\d4(\s)?(x)(\s)?\d4\b"
keyword "\bResolution\b",
"\bImage\b" end DigitalCamera 01 has
OpticalZoom 1 OpticalZoom matches
10 constant extract "\b\d(\.\d)"
context "\b\d(\.\d)?(x)\b"
keyword "\boptical\b" end
20
Extraction Ontology
DigitalCamera - object DigitalCamera 01
has Brand 1 Brand matches 10 constant
extract "\bNikon\b", extract
"\bCanon\b", extract "\bOlympus\b",
extract "\bMinolta\b", extract
"\bSony\b" end DigitalCamera 01 has
CCDResolution 1 CCDResolution matches 20
constant extract "\b\d(\.\d1,2)?\b"
keyword "\bMegapixel\b,
"\bCCD\b", "\bResolution\b" end
DigitalCamera 01 has ImageResolution
1 ImageResolution matches 20 constant
extract "\b\d4(\s)?(x)(\s)?\d4\b" ,
extract "\b\d4(\s)?(x)(\s)?\d4\b"
keyword "\bResolution\b",
"\bImage\b" end DigitalCamera 01 has
OpticalZoom 1 OpticalZoom matches
10 constant extract "\b\d(\.\d)"
context "\b\d(\.\d)?(x)\b"
keyword "\boptical\b" end
21
Results (Same Site)
22
Results (Different Site)
23
Summary and Future Work
  • The example indicates that the approach is
    feasible
  • Some open questions need to be explored
Write a Comment
User Comments (0)
About PowerShow.com