An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

Description:

cloudy, tungsten, ... ... Attribute normalization: ... White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 33

Provided by: won51

Category:

more less

Transcript and Presenter's Notes

Title: An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites

1
An Unsupervised Framework for Extracting and
Normalizing Product Attributes from Multiple Web
Sites

Tak-Lam Wong
Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Wai Lam, Tik-Shun Wong
Dept. of Systems Engineering and Engineering
Management
The Chinese University of Hong Kong

_at_ SIGIR 2008 Singapore
2
Presentation Outline

Introduction
Problem Definition
Our Model
Inference Method
Experimental Results
Conclusions

3
Motivation
(Source http//www.crayeon3.com)
(Source http//www.superwarehouse.com)
4
Information Extraction

To extract product attributes
Prior knowledge about content
Effective sensor resolution
Layout format
White balance, shutter speed
Mutual influence
Light sensitivity

5
Attribute Normalization

Samples of extracted text fragments from a page
Cloudy, daylight, etc
What do they refer to?
A text fragment extracted from another page
white balance auto, daylight,cloudy, tungsten,
Attribute normalization
To cluster text fragments into the same group
Better indexing for product search
Easier understanding and interpretation

6
Existing Works

Supervised wrapper induction
They need training examples.
The wrapper learned from a Web site cannot be
applied to other sites.
Template-independent extraction (Zhu et al.,
2007)
They cannot handle previously unseen attributes.
Unsupervised wrapper learning (Crescenzi et al,
2001)
Extracted data are not normalized.

7
Contributions

Unsupervised learning framework for jointly
extracting and normalizing product attributes
from multiple Web sites.
Our framework considers page-independent content
information and page-dependent layout
information.
Can extract unlimited number of product
attributes (Dirichlet process)
Can visualize the semantic meaning of each
product attribute

8
Presentation Outline

Introduction
Problem Definition
Our Model
Inference Method
Experimental Results
Conclusions

9
Problem Definition (1)

A product domain,
E.g., Digital camera domain
A set of reference attributes,
E.g., resolution, white balance, etc.
A special element, , representing
not-an-attribute
A collection of Web pages from any Web sites,
, each of which contains a single product
Let be any text fragment from a Web page

10
Problem Definition (2)
Line separator
ltTRgt ltTDgt ltPgt ltSPANgt White
balance lt/SPANgt lt/Pgt lt/TDgt ltTDgt
ltPgt ltSPANgt Auto, daylight, cloudy,
tungstem, fluorescent, fluorescent H,
custom lt/SPANgt lt/Pgt lt/TDgt lt/TRgt ltTRgt
Line separator
11
Problem Definition (3)

Information extraction
Attribute normalization
Joint attribute extraction and normalization

Attribute information
Target information
Layout information
Content information
12
Problem Definition (4)

White balance Auto, daylight, cloudy, tungstem,
fluorescent, fluorescent H, custom
T1
Awhite balance
Cloudy, daylight
T1
Awhite balance
View larger image
T0
Anot-an-attribute

13
Presentation Outline

Introduction
Problem Definition
Our Model
Inference Method
Experimental Results
Conclusions

14
Our Model
Dirichlet Process Prior(Infinite Mixture Model)
S Different Web Site
N Text Fragment
k-th component proportion
Content info. generation
Target info. generation
15
Generation Process
16
Generation Process

The joint probability for generating a particular
text fragment given the parameters, , ,
, and,
Inference
Intractable

17
Presentation Outline

Introduction
Problem Definition
Our Model
Inference Method
Experimental Results
Conclusions

18
Variational Method (1)

Finding is intractable
Our goal Design a tractable distribution
such that should be as close to
as possible.
KL divergence

19
Variational Method (2)

Truncated stick-breaking process (Ishwaran and
James, 2001)
Replace infinity with a truncation level K
Max

20
Variational Method (3)

One important variational parameters
How likely does come from the k-th
component?
Attribute normalization!

21
Variational Method (4)

Another important variational parameter
where
How likely should be extracted?
Attribute extraction!

22
Unsupervised Approach

What should be extracted?
Make use of the prior knowledge about a domain.
Only a few terms about the product attributes
E.g., resolution, light sensitivity, shutter
speed, etc.
Can be easily obtained, for example, by just
highlighting the attributes of a Web page
Initialization

23
Presentation Outline

Introduction
Problem Definition
Our Model
Inference Method
Experimental Results
Conclusions

24
Experiments

We have conducted experiments on four different
domains
Digital camera 85 Web pages from 41 different
sites
MP3 player 96 Web pages from 62 different sites
Camcorder 111 Web pages from 61 different sites
Restaurant 29 Web pages from LA-Weekly
Restaurant Guide
In each domain, we conducted 10 runs of
experiments.
In each run, we randomly selected a Web page and
use the attributes inside as prior knowledge.

25
Evaluation on Attribute Normalization

Baseline approach
Agglomerative clustering
Edit distance between text fragments
Evaluation metrics
Pairwise recall (R)
Pairwise precision (P)
Pairwise F1-measure (F)

26
Results of Attribute Normalization
27
Visualize the Normalized Attributes

The top five weighted terms in the ten largest
normalized attributes in the digital camera
domain

28
Evaluation on Attribute Extraction

Surprisingly, in the restaurant domain, our
framework achieves a performance (0.95
F1-measure) which is comparable to the supervised
method (Muslea et al. 2001)

29
Presentation Outline

Introduction
Problem Definition
Our Model
Inference Method
Experimental Results
Conclusions

30
Conclusions (1)

We aim at simultaneously extracting and
normalizing product attributes from Web pages
collected from different sites.
Our method considers page-independent content
information and the page-dependent layout
information.
We have developed a graphical model, which
employs Dirichlet process prior, to model the
generation of text fragments in Web pages.

31
Conclusions (2)

An unsupervised inference algorithm based on
variational method is designed.
We formally show that content and layout
information can collaborate and improve both
extraction and normalization performance under
our model.
Experiments on four different domains have been
conducted to show the robustness and
effectiveness of our approach.

32
Questions and Answers

Write a Comment

User Comments (0)