Adaptive Web Sites: Automatically Synthesizing Web Pages - PowerPoint PPT Presentation

About This Presentation
Title:

Adaptive Web Sites: Automatically Synthesizing Web Pages

Description:

Web sites that automatically reconfigure their organization and presentation by ... Voorhees-86,Willet-88,Rasmussen-92. Similarity metric over documents ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 35
Provided by: University643
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Web Sites: Automatically Synthesizing Web Pages


1
Adaptive Web SitesAutomatically Synthesizing
Web Pages
  • Mike Perkowitz and Oren Etzioni
  • www.cs.washington.edu/homes/map/adaptive/

2
Adaptive Web Sites
  • Web sites that automatically reconfigure their
    organization and presentation by learning from
    user access patterns.
  • (Perkowitz Etzioni, IJCAI97)

3
Adaptive Web Sites
  • Individual Customization site learns you like
    sports
  • Group Transformation site learns most sports
    lovers also read Tank McNamara and cross-links
    them

4
Group Transformations
  • Our approach history-based
  • Previously Simple transformations (Perkowitz
    Etzioni, WWW6)
  • Goal change in view

5
machines.hyperreal.org
6
Drum Machine Samples
7
(No Transcript)
8
Index Page Synthesis
  • Find groups of related documents at the site and
    create new pages linking to those documents.
  • Input web site, access log
  • Output pages of links to related pages

9
Questions
  • What links are on the index page?
  • How are the contents ordered?
  • What is the title?
  • How are links labeled?
  • How do we make the index comprehensive?

10
Outline
  • Motivation
  • Plausible approaches
  • Clustering
  • Frequent sets
  • Our approach Cluster Mining
  • Algorithm PageGather
  • Evaluation

11
Clustering
  • Voorhees-86,Willet-88,Rasmussen-92
  • Similarity metric over documents
  • Cluster items close together, far from others
  • Algorithms
  • Hierarchical Agglomerative Clustering (HAC)
  • K-means clustering

12
Clustering
  • Visit set of pages accessed by an individual
  • Document page
  • Similarity co-occurrence in visits
  • Cluster ? index page contents

13
Clustering Problems
  • Clustering induces a partition over data
  • Clustering can be slow

14
Frequent Sets
  • Agrawal, Imielinski, Swami-93
  • Set of transactions basket of items
  • Find all frequently-occurring itemsets
  • Algorithm
  • A priori

15
Frequent Sets
  • Visit set of pages accessed by an individual
  • Item page
  • Transaction visit
  • Frequent set ? index page contents

16
Frequent Sets Problems
  • Frequent Item Problem
  • Finds many similar itemsets
  • low minimum frequency ? high running time

17
Idea Cluster Mining
  • Find only high-quality clusters
  • Not a partition
  • Clusters may overlap

18
The PageGather Algorithm
  • Graph-based representation
  • Nodes pages
  • Edges if P(P1P2) and P(P2P1) is high
  • Fast and accurate

19
www.hyperreal.comcrawl3.atext.comGET
/robots.txt HTTP/1.0text/html3011997/07/03-235
908-188---ArchitextSpider www.apache.orgbl
izzard-ext.wise.edt.ericsson.seGET
/related_projects.html HTTP/1.0text/html2001997
/07/03-235909-5047--http//www.apache.org/
Mozilla/3.01Gold (X11 I SunOS 5.5.1 sun4u) via
Harvest Cache version 3.0pl5-Solaris www.hyperreal
.orgmd27-001.mun.compuserve.comGET
/music/labels/recycle_or_die/ralf_hildenbeutel.gif
HTTP/1.0image/gif3041997/07/03-235909---
-http//www.hyperreal.org/music/labels/recycle_or
_die/Mozilla/2.02E de-Beta2 (Win95 I
16bit) www.hyperreal.orgras87.brunnet.netGET
/raves/media/cyberia/link.gif HTTP/1.0image/gif2
001997/07/03-235909-415--http//www.hyperr
eal.org/raves/media/cyberia/Mozilla/4.01 en
(Win95 I) www.apache.orgblizzard-ext.wise.edt.er
icsson.seGET /images/apache_sub.gif
HTTP/1.0image/gif2001997/07/03-235910-6083
--http//www.apache.org/related_projects.htmlMo
zilla/3.01Gold (X11 I SunOS 5.5.1 sun4u) via
Harvest Cache version 3.0pl5-Solaris www.apache.or
g210.140.143.27GET /images/apache_pb.gif
HTTP/1.0image/gif3041997/07/03-235910----
http//www.apache.org/Mozilla/3.01 ja (Win95
I) www.apache.orgr2d2.dd.dkGET /docs/
HTTP/1.0text/html2001997/07/03-235911-2207
--http//www.apache.org/Mozilla/2.0
(compatible MSIE 3.01 Windows
95) www.hyperreal.orgmd27-001.mun.compuserve.com
GET /music/labels/recycle_or_die/oliver_lieb.gif
HTTP/1.0image/gif3041997/07/03-235911----
http//www.hyperreal.org/music/labels/recycle_or_
die/Mozilla/2.02E de-Beta2 (Win95 I
16bit) www.hyperreal.orgdu5-ts1.lascruces.comGET
/wally/epsilon.gif HTTP/1.0image/gif2001997/0
7/03-235911-4002--http//www.hyperreal.org/
music/artists/fsol/www/Mozilla/2.0 (compatible
MSIE 3.02 Update a Windows 95) www.hyperreal.org
du5-ts1.lascruces.comGET /wally/hyperreal.gif
HTTP/1.0image/gif2001997/07/03-235911-2525
--http//www.hyperreal.org/music/artists/fsol/ww
w/Mozilla/2.0 (compatible MSIE 3.02 Update a
Windows 95) www.hyperreal.orgmd27-001.mun.compuse
rve.comGET /music/labels/recycle_or_die/baked_bea
ns.gif HTTP/1.0image/gif3041997/07/03-235911
----http//www.hyperreal.org/music/labels/recy
cle_or_die/Mozilla/2.02E de-Beta2 (Win95 I
16bit) www.hyperreal.orgcc6145d.comm.sfu.caGET
/music/machines/categories/effects/
HTTP/1.0text/html2001997/07/03-235912-3844
--http//www.hyperreal.org/music/machines/catego
ries/Mozilla/2.02 (Macintosh I
Log
Visits
Co-occurrence
Graph
Clique/CC
New Page
20
PageGather
  • Implement with Cliques or CCs
  • Find all candidates, return best
  • Clique maximal cliques of size ? k
  • Clique and CC versions comparable in time and
    performance

21
Experiments
  • machines.hyperreal.org
  • Site gets 1200 visitors/day (10k hits)
  • Site contains 2500 distinct documents
  • Training a month of access data
  • Testing ten days of data

22
Performance Metric
  • Are index pages helpful to users?
  • How well do clusters predict user navigation?
  • Q(C) Given that a user visits one page in
    cluster C, how likely is she to visit any other?

23
Cluster Mining vs. Clustering
  • PageGather using
  • Clique ? 10 clusters 105 min
  • HAC ? 10 clusters 48 hours
  • K-means ? 10 clusters 335 min

24
Cluster Mining vs. Clustering
  • PageGather using
  • Clique ? 10 clusters 105 min
  • HAC ? 10 clusters 48 hours
  • K-means ? 10 clusters 335 min
  • HAC ? 8 clusters 2155 min
  • (threshold, less data, mining)

25
Cluster Mining vs. Clustering
  • PageGather using
  • Clique ? 10 clusters 105 min
  • HAC ? 10 clusters 48 hours
  • K-means ? 10 clusters 335 min
  • HAC ? 7 clusters 29308 min
  • (threshold, less data, mining)

26
Cluster Mining vs. Clustering
Q
Top 10 Clusters
27
Cluster Mining vs. Clustering
Q
Top 10 Clusters
28
Cluster Mining vs. Clustering
Q
Top 10 Clusters
29
PageGather vs. Frequent Sets
  • PG/Clique? 10 clusters 105 min
  • A priori ? 10 frequent sets 141 min

30
PageGather vs. Frequent Sets
Q
Top 10 Clusters
31
Contributions
  • Motivating problem Web page synthesis
  • Method Cluster mining
  • well suited for discovery of coherent sets
  • comparison to clustering, frequent sets
  • Algorithm PageGather
  • graph-based, fast and accurate

32
Clique vs. Conn-component
Q
Top 10 Clusters
33
Clique vs. Conn-component
  • Comparable accuracy
  • Clique finds fewer, smaller clusters than CC
  • Clique more accurate (at first)
  • Comparable running time (in practice)

34
Future Directions
  • Meta-Information to improve coherence
  • Conceptual clustering
  • Improve coherence
  • Naming pages
  • Cluster mining to generate association rules
Write a Comment
User Comments (0)
About PowerShow.com