Website Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Website Clustering

Description:

META tags (description, keywords, arthur) What if the webpage consists of mainly automatically generated content from scripts? ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 10
Provided by: ray76
Learn more at: https://nlp.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Website Clustering


1
Website Clustering
  • Combining Website Lexical Data
  • and Query Semantic Data
  • Nana Huang, Ray Li

2
Traditional Lexical Features
  • Traditional website clustering uses lexical data
    parsed from each webpage to classify the websites
    into different categories.
  • Regular text
  • ltTITLEgt tags
  • ltMETAgt tags (description, keywords, arthur)
  • What if the webpage consists of mainly
    automatically generated content from scripts?
  • What if the webpage is a empty frame page with
    two or more frame?

3
AOL Clickthrough Data
  • Back in August 2006, AOL released 2.2 GBs of
    search logs, which includes queries, clicked
    websites, and website page rank information.
  • brochures for business 5 http//www.hp.com
  • brochures for business 6 http//www.hansonmarketin
    g.com
  • brochures for business 8 http//www.smallbusinessb
    rief.com
  • brochures for business 10 http//www.quickbrochure
    s.com
  • brochures for business 9 http//www.smallbusinessb
    rief.com
  • brochures for business 7 http//www.printingforles
    s.com

4
Query-Website Graph
  • We parsed a subset of this data to generate a
    query-document bipartite graph, where each link
    in the graph represents the number of times each
    query lead a website being clicked.

Q1
Q2
Q3
Q4
Q5
Queries
D1
D2
D3
D4
D5
Documents
5
Query-Website Graph
  • A graph like this is most likely too sparse to be
    useful.
  • There are a lot of unobserved clicks between
    queries and other related webpages.
  • We use an iterative process to smooth out the
    bipartite relationship between queries and
    websites, based on the observation that
  • Documents are considered similar to some extent
    if they have been seen by the same query.
  • Queries are considered similar to some extent
    if they produce the same document.

6
Query-Website Graph
  • This will produce a more realistic query-website
    bipartite relationship
  • We can then use a list of queries associated with
    each website as a semantic feature vector.

D1
D1
Q1
Q1
D2
D2
Q2
Q2
D3
D3
7
Combined Feature Vectors
  • We have three sets of feature vectors for each
    document
  • Lexical features (consists of text and different
    html tags from the webpage itself)
  • Semantic features (consists of queries
    information related to each webpage)
  • Combination of both
  • There are 10000 words and 2000 queries too many
    features.

8
Latent Semantic Analysis
  • We then apply Latent Semantic Analysis to reduce
    the 12000 features into a lower-ranked 30
    virtual concepts approximation
  • Chicken, Beef, Apple, Oranges -gt Meat, Fruits
  • Each website is transformed from the original
    vector of features into a new vector of virtual
    concepts.

9
K-Means Results
  • We then apply K-means on this new vector space to
    classify websites into different categories.
  • Results show that, while using only the semantic
    query vector performs worse than using the
    lexical feature vector, combining both features
    together results in a slightly better clustering
    performance.
  • Lexical Semantic Query F1 0.50
  • Lexical only F1 0.47
  • Queries only F1 0.30
Write a Comment
User Comments (0)
About PowerShow.com