An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

Description:

cloudy, tungsten, ... ... Attribute normalization: ... White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 33
Provided by: won51
Category:

less

Transcript and Presenter's Notes

Title: An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites


1
An Unsupervised Framework for Extracting and
Normalizing Product Attributes from Multiple Web
Sites
  • Tak-Lam Wong
  • Dept. of Computer Science and Engineering
  • The Chinese University of Hong Kong
  • Wai Lam, Tik-Shun Wong
  • Dept. of Systems Engineering and Engineering
    Management
  • The Chinese University of Hong Kong

_at_ SIGIR 2008 Singapore
2
Presentation Outline
  • Introduction
  • Problem Definition
  • Our Model
  • Inference Method
  • Experimental Results
  • Conclusions

3
Motivation
(Source http//www.crayeon3.com)
(Source http//www.superwarehouse.com)
4
Information Extraction
  • To extract product attributes
  • Prior knowledge about content
  • Effective sensor resolution
  • Layout format
  • White balance, shutter speed
  • Mutual influence
  • Light sensitivity

5
Attribute Normalization
  • Samples of extracted text fragments from a page
  • Cloudy, daylight, etc
  • What do they refer to?
  • A text fragment extracted from another page
  • white balance auto, daylight,cloudy, tungsten,
  • Attribute normalization
  • To cluster text fragments into the same group
  • Better indexing for product search
  • Easier understanding and interpretation

6
Existing Works
  • Supervised wrapper induction
  • They need training examples.
  • The wrapper learned from a Web site cannot be
    applied to other sites.
  • Template-independent extraction (Zhu et al.,
    2007)
  • They cannot handle previously unseen attributes.
  • Unsupervised wrapper learning (Crescenzi et al,
    2001)
  • Extracted data are not normalized.

7
Contributions
  • Unsupervised learning framework for jointly
    extracting and normalizing product attributes
    from multiple Web sites.
  • Our framework considers page-independent content
    information and page-dependent layout
    information.
  • Can extract unlimited number of product
    attributes (Dirichlet process)
  • Can visualize the semantic meaning of each
    product attribute

8
Presentation Outline
  • Introduction
  • Problem Definition
  • Our Model
  • Inference Method
  • Experimental Results
  • Conclusions

9
Problem Definition (1)
  • A product domain,
  • E.g., Digital camera domain
  • A set of reference attributes,
  • E.g., resolution, white balance, etc.
  • A special element, , representing
    not-an-attribute
  • A collection of Web pages from any Web sites,
    , each of which contains a single product
  • Let be any text fragment from a Web page

10
Problem Definition (2)
Line separator
ltTRgt ltTDgt ltPgt ltSPANgt White
balance lt/SPANgt lt/Pgt lt/TDgt ltTDgt
ltPgt ltSPANgt Auto, daylight, cloudy,
tungstem, fluorescent, fluorescent H,
custom lt/SPANgt lt/Pgt lt/TDgt lt/TRgt ltTRgt
Line separator
11
Problem Definition (3)
  • Information extraction
  • Attribute normalization
  • Joint attribute extraction and normalization

Attribute information
Target information
Layout information
Content information
12
Problem Definition (4)
  • White balance Auto, daylight, cloudy, tungstem,
    fluorescent, fluorescent H, custom
  • T1
  • Awhite balance
  • Cloudy, daylight
  • T1
  • Awhite balance
  • View larger image
  • T0
  • Anot-an-attribute

13
Presentation Outline
  • Introduction
  • Problem Definition
  • Our Model
  • Inference Method
  • Experimental Results
  • Conclusions

14
Our Model
Dirichlet Process Prior(Infinite Mixture Model)
S Different Web Site
N Text Fragment
k-th component proportion
Content info. generation
Target info. generation
15
Generation Process
16
Generation Process
  • The joint probability for generating a particular
    text fragment given the parameters, , ,
    , and,
  • Inference
  • Intractable

17
Presentation Outline
  • Introduction
  • Problem Definition
  • Our Model
  • Inference Method
  • Experimental Results
  • Conclusions

18
Variational Method (1)
  • Finding is intractable
  • Our goal Design a tractable distribution
    such that should be as close to
    as possible.
  • KL divergence

19
Variational Method (2)
  • Truncated stick-breaking process (Ishwaran and
    James, 2001)
  • Replace infinity with a truncation level K
  • Max

20
Variational Method (3)
  • One important variational parameters
  • How likely does come from the k-th
    component?
  • Attribute normalization!

21
Variational Method (4)
  • Another important variational parameter
  • where
  • How likely should be extracted?
  • Attribute extraction!

22
Unsupervised Approach
  • What should be extracted?
  • Make use of the prior knowledge about a domain.
  • Only a few terms about the product attributes
  • E.g., resolution, light sensitivity, shutter
    speed, etc.
  • Can be easily obtained, for example, by just
    highlighting the attributes of a Web page
  • Initialization

23
Presentation Outline
  • Introduction
  • Problem Definition
  • Our Model
  • Inference Method
  • Experimental Results
  • Conclusions

24
Experiments
  • We have conducted experiments on four different
    domains
  • Digital camera 85 Web pages from 41 different
    sites
  • MP3 player 96 Web pages from 62 different sites
  • Camcorder 111 Web pages from 61 different sites
  • Restaurant 29 Web pages from LA-Weekly
    Restaurant Guide
  • In each domain, we conducted 10 runs of
    experiments.
  • In each run, we randomly selected a Web page and
    use the attributes inside as prior knowledge.

25
Evaluation on Attribute Normalization
  • Baseline approach
  • Agglomerative clustering
  • Edit distance between text fragments
  • Evaluation metrics
  • Pairwise recall (R)
  • Pairwise precision (P)
  • Pairwise F1-measure (F)

26
Results of Attribute Normalization
27
Visualize the Normalized Attributes
  • The top five weighted terms in the ten largest
    normalized attributes in the digital camera
    domain

28
Evaluation on Attribute Extraction
  • Surprisingly, in the restaurant domain, our
    framework achieves a performance (0.95
    F1-measure) which is comparable to the supervised
    method (Muslea et al. 2001)

29
Presentation Outline
  • Introduction
  • Problem Definition
  • Our Model
  • Inference Method
  • Experimental Results
  • Conclusions

30
Conclusions (1)
  • We aim at simultaneously extracting and
    normalizing product attributes from Web pages
    collected from different sites.
  • Our method considers page-independent content
    information and the page-dependent layout
    information.
  • We have developed a graphical model, which
    employs Dirichlet process prior, to model the
    generation of text fragments in Web pages.

31
Conclusions (2)
  • An unsupervised inference algorithm based on
    variational method is designed.
  • We formally show that content and layout
    information can collaborate and improve both
    extraction and normalization performance under
    our model.
  • Experiments on four different domains have been
    conducted to show the robustness and
    effectiveness of our approach.

32
Questions and Answers
Write a Comment
User Comments (0)
About PowerShow.com