Measuring Contribution of HTML Features in Web Document Clustering - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Measuring Contribution of HTML Features in Web Document Clustering

Description:

Which HTML feature is the most important to provide good clustering results? ... Find meaningful groups from a web document collection. ... [Bock&Diday, 2000] ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 28
Provided by: oldemarr
Category:

less

Transcript and Presenter's Notes

Title: Measuring Contribution of HTML Features in Web Document Clustering


1
Measuring Contribution of HTML Features in
WebDocument Clustering
  • Oldemar Rodríguez
  • School of Mathematics, UCR
  • and Predisoft
  • Esteban Meneses
  • Computing Research Center, ITCR

2
Motivation
3
Motivation
  • Which HTML feature is the most important to
    provide good clustering results?
  • Using symbolic objects to cluster web documents.
  • 15th World Wide Web Conference (2006)

4
HTML Document Clustering
  • Find meaningful groups from a web document
    collection.
  • Effectively represent web document clusters for
    further analysis.

5
HTML Document
6
(No Transcript)
7
Classical Representations
  • Different approaches for representing a web
    document.

lt5,22,19,4,...,38gt
8
Vectorial Representation
  • Every document is represented by a vector
    inn-dimensional space.
  • Bag of words scheme. Each variable represents the
    relative weight of a term in the document.

9
Symbolic Objects
  • Real-life objects are too complex to be
    represented by points in a vectorial space.
    BockDiday, 2000
  • Symbolic objects overcome this limitation by
    representing concepts rather than individuals.
  • In a symbolic data array each variable can have
    one of many data types sets, intervals,
    histograms, trees, graphs, functions, fuzzy data,
    etc.

10
Symbolic Data Table
11
Symbolic Data Table
From relational data bases to symbolic data bases
Millions
Multivariate Numeric Analysis
Individual Age Profession Wage Location
3457 36 Lawyer 2,500.00 San José
1251 28 Teacher 1,750.00 Alajuela
3245 39 Doctor 2,400.00 San José
7635 33 Teacher 1,900.00 Alajuela
3245 35 Engineer 1,850.00 Alajuela
5367 27 Engineer 1,900.00 Heredia
6486 34 Manager 1,600.00 Heredia
Data
Hundreds
Multivariate Symbolic Analysis
Individual Age Profession Wage
San José 36,39 Law, 50,Doc,50 2,4 2,5
Alajuela 28,35 Tea,66,Eng,33 1,75 1,9
Heredia 2,34 Eng,50,Mgn,50 1,6 1,9
Concepts
12
Symbolic Data Base
Relational Data Base
Symbolic Data Base
100 knowledge 15 Gigabyte
90 knowledge 10.3 Megabyte
13
Symbolic Representations
  • A complex representation that takes into account
    term frequency, word order and phrases.

14
The K-Means Clustering Method
15
But, there are some problems .
16
Distance Measures
17
Teorema Igualdad de Fisher
  • Inercia total Inercia inter-clases
  • Inercia intra-clases

18
Problemas en el caso simbólico
  1. Representar una clase por su centro de gravedad,
    esto es, por su vector de promedios.
  2. Qué es el centro de gravedad?

19
Qué el centro de gravedad?
20
(No Transcript)
21
Evaluation Criteria
  • Rand Index
  • Mutual Information
  • F-Measure
  • Entropy

22
Experiments
23
Experiments
24
Experiments
25
Experiments
26
Conclusions
  • Symbolic representations are richer and more
    flexible than classical representations.
  • The text in the HTML document seems to be the
    more important factor to cluster HTML documents.

27
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com