Title: An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites
1An Unsupervised Framework for Extracting and
Normalizing Product Attributes from Multiple Web
Sites
- Tak-Lam Wong
- Dept. of Computer Science and Engineering
- The Chinese University of Hong Kong
- Wai Lam, Tik-Shun Wong
- Dept. of Systems Engineering and Engineering
Management - The Chinese University of Hong Kong
_at_ SIGIR 2008 Singapore
2Presentation Outline
- Introduction
- Problem Definition
- Our Model
- Inference Method
- Experimental Results
- Conclusions
3Motivation
(Source http//www.crayeon3.com)
(Source http//www.superwarehouse.com)
4Information Extraction
- To extract product attributes
- Prior knowledge about content
- Effective sensor resolution
- Layout format
- White balance, shutter speed
- Mutual influence
- Light sensitivity
5Attribute Normalization
- Samples of extracted text fragments from a page
- Cloudy, daylight, etc
- What do they refer to?
- A text fragment extracted from another page
- white balance auto, daylight,cloudy, tungsten,
- Attribute normalization
- To cluster text fragments into the same group
- Better indexing for product search
- Easier understanding and interpretation
6Existing Works
- Supervised wrapper induction
- They need training examples.
- The wrapper learned from a Web site cannot be
applied to other sites. - Template-independent extraction (Zhu et al.,
2007) - They cannot handle previously unseen attributes.
- Unsupervised wrapper learning (Crescenzi et al,
2001) - Extracted data are not normalized.
7Contributions
- Unsupervised learning framework for jointly
extracting and normalizing product attributes
from multiple Web sites. - Our framework considers page-independent content
information and page-dependent layout
information. - Can extract unlimited number of product
attributes (Dirichlet process) - Can visualize the semantic meaning of each
product attribute
8Presentation Outline
- Introduction
- Problem Definition
- Our Model
- Inference Method
- Experimental Results
- Conclusions
9Problem Definition (1)
- A product domain,
- E.g., Digital camera domain
- A set of reference attributes,
- E.g., resolution, white balance, etc.
- A special element, , representing
not-an-attribute -
- A collection of Web pages from any Web sites,
, each of which contains a single product - Let be any text fragment from a Web page
10Problem Definition (2)
Line separator
ltTRgt ltTDgt ltPgt ltSPANgt White
balance lt/SPANgt lt/Pgt lt/TDgt ltTDgt
ltPgt ltSPANgt Auto, daylight, cloudy,
tungstem, fluorescent, fluorescent H,
custom lt/SPANgt lt/Pgt lt/TDgt lt/TRgt ltTRgt
Line separator
11Problem Definition (3)
-
- Information extraction
-
- Attribute normalization
-
- Joint attribute extraction and normalization
-
Attribute information
Target information
Layout information
Content information
12Problem Definition (4)
- White balance Auto, daylight, cloudy, tungstem,
fluorescent, fluorescent H, custom - T1
- Awhite balance
- Cloudy, daylight
- T1
- Awhite balance
- View larger image
- T0
- Anot-an-attribute
13Presentation Outline
- Introduction
- Problem Definition
- Our Model
- Inference Method
- Experimental Results
- Conclusions
14Our Model
Dirichlet Process Prior(Infinite Mixture Model)
S Different Web Site
N Text Fragment
k-th component proportion
Content info. generation
Target info. generation
15Generation Process
16Generation Process
- The joint probability for generating a particular
text fragment given the parameters, , ,
, and, - Inference
-
- Intractable
17Presentation Outline
- Introduction
- Problem Definition
- Our Model
- Inference Method
- Experimental Results
- Conclusions
18Variational Method (1)
- Finding is intractable
- Our goal Design a tractable distribution
such that should be as close to
as possible. - KL divergence
-
-
19Variational Method (2)
- Truncated stick-breaking process (Ishwaran and
James, 2001) - Replace infinity with a truncation level K
- Max
20Variational Method (3)
- One important variational parameters
-
- How likely does come from the k-th
component? - Attribute normalization!
21Variational Method (4)
- Another important variational parameter
-
- where
-
- How likely should be extracted?
- Attribute extraction!
22Unsupervised Approach
- What should be extracted?
- Make use of the prior knowledge about a domain.
- Only a few terms about the product attributes
- E.g., resolution, light sensitivity, shutter
speed, etc. - Can be easily obtained, for example, by just
highlighting the attributes of a Web page - Initialization
23Presentation Outline
- Introduction
- Problem Definition
- Our Model
- Inference Method
- Experimental Results
- Conclusions
24Experiments
- We have conducted experiments on four different
domains - Digital camera 85 Web pages from 41 different
sites - MP3 player 96 Web pages from 62 different sites
- Camcorder 111 Web pages from 61 different sites
- Restaurant 29 Web pages from LA-Weekly
Restaurant Guide - In each domain, we conducted 10 runs of
experiments. - In each run, we randomly selected a Web page and
use the attributes inside as prior knowledge.
25Evaluation on Attribute Normalization
- Baseline approach
- Agglomerative clustering
- Edit distance between text fragments
- Evaluation metrics
- Pairwise recall (R)
- Pairwise precision (P)
- Pairwise F1-measure (F)
26Results of Attribute Normalization
27Visualize the Normalized Attributes
- The top five weighted terms in the ten largest
normalized attributes in the digital camera
domain
28Evaluation on Attribute Extraction
- Surprisingly, in the restaurant domain, our
framework achieves a performance (0.95
F1-measure) which is comparable to the supervised
method (Muslea et al. 2001)
29Presentation Outline
- Introduction
- Problem Definition
- Our Model
- Inference Method
- Experimental Results
- Conclusions
30Conclusions (1)
- We aim at simultaneously extracting and
normalizing product attributes from Web pages
collected from different sites. - Our method considers page-independent content
information and the page-dependent layout
information. - We have developed a graphical model, which
employs Dirichlet process prior, to model the
generation of text fragments in Web pages.
31Conclusions (2)
- An unsupervised inference algorithm based on
variational method is designed. - We formally show that content and layout
information can collaborate and improve both
extraction and normalization performance under
our model. - Experiments on four different domains have been
conducted to show the robustness and
effectiveness of our approach.
32Questions and Answers