Website Clustering

About This Presentation

Title:

Description:

Number of Views:21

Avg rating:3.0/5.0

Slides: 10

Provided by: ray76

Learn more at: https://nlp.stanford.edu

Category:

Tags: arthur | clustering | website

Transcript and Presenter's Notes

Title: Website Clustering

1
Website Clustering

2
Traditional Lexical Features

Traditional website clustering uses lexical data
parsed from each webpage to classify the websites
into different categories.
Regular text
ltTITLEgt tags
ltMETAgt tags (description, keywords, arthur)
What if the webpage consists of mainly
automatically generated content from scripts?
What if the webpage is a empty frame page with
two or more frame?

3
AOL Clickthrough Data

Back in August 2006, AOL released 2.2 GBs of
search logs, which includes queries, clicked
websites, and website page rank information.
brochures for business 5 http//www.hp.com
brochures for business 6 http//www.hansonmarketin
g.com
brochures for business 8 http//www.smallbusinessb
rief.com
brochures for business 10 http//www.quickbrochure
s.com
brochures for business 9 http//www.smallbusinessb
rief.com
brochures for business 7 http//www.printingforles
s.com

4
Query-Website Graph

We parsed a subset of this data to generate a
query-document bipartite graph, where each link
in the graph represents the number of times each
query lead a website being clicked.

Q1
Q2
Q3
Q4
Q5
Queries
D1
D2
D3
D4
D5
Documents
5
Query-Website Graph

A graph like this is most likely too sparse to be
useful.
There are a lot of unobserved clicks between
queries and other related webpages.
We use an iterative process to smooth out the
bipartite relationship between queries and
websites, based on the observation that
Documents are considered similar to some extent
if they have been seen by the same query.
Queries are considered similar to some extent
if they produce the same document.

6
Query-Website Graph

This will produce a more realistic query-website
bipartite relationship
We can then use a list of queries associated with
each website as a semantic feature vector.

D1
D1
Q1
Q1
D2
D2
Q2
Q2
D3
D3
7
Combined Feature Vectors

We have three sets of feature vectors for each
document
Lexical features (consists of text and different
html tags from the webpage itself)
Semantic features (consists of queries
information related to each webpage)
Combination of both
There are 10000 words and 2000 queries too many
features.

8
Latent Semantic Analysis

We then apply Latent Semantic Analysis to reduce
the 12000 features into a lower-ranked 30
virtual concepts approximation
Chicken, Beef, Apple, Oranges -gt Meat, Fruits
Each website is transformed from the original
vector of features into a new vector of virtual
concepts.

9
K-Means Results

We then apply K-means on this new vector space to
classify websites into different categories.
Results show that, while using only the semantic
query vector performs worse than using the
lexical feature vector, combining both features
together results in a slightly better clustering
performance.
Lexical Semantic Query F1 0.50
Lexical only F1 0.47
Queries only F1 0.30