Title: Measuring Contribution of HTML Features in Web Document Clustering
1Measuring Contribution of HTML Features in
WebDocument Clustering
- Oldemar RodrÃguez
- School of Mathematics, UCR
- and Predisoft
- Esteban Meneses
- Computing Research Center, ITCR
2Motivation
3Motivation
- Which HTML feature is the most important to
provide good clustering results? - Using symbolic objects to cluster web documents.
- 15th World Wide Web Conference (2006)
4HTML Document Clustering
- Find meaningful groups from a web document
collection. - Effectively represent web document clusters for
further analysis.
5HTML Document
6(No Transcript)
7Classical Representations
- Different approaches for representing a web
document.
lt5,22,19,4,...,38gt
8Vectorial Representation
- Every document is represented by a vector
inn-dimensional space. - Bag of words scheme. Each variable represents the
relative weight of a term in the document.
9Symbolic Objects
- Real-life objects are too complex to be
represented by points in a vectorial space.
BockDiday, 2000 - Symbolic objects overcome this limitation by
representing concepts rather than individuals. - In a symbolic data array each variable can have
one of many data types sets, intervals,
histograms, trees, graphs, functions, fuzzy data,
etc.
10Symbolic Data Table
11Symbolic Data Table
From relational data bases to symbolic data bases
Millions
Multivariate Numeric Analysis
Individual Age Profession Wage Location
3457 36 Lawyer 2,500.00 San José
1251 28 Teacher 1,750.00 Alajuela
3245 39 Doctor 2,400.00 San José
7635 33 Teacher 1,900.00 Alajuela
3245 35 Engineer 1,850.00 Alajuela
5367 27 Engineer 1,900.00 Heredia
6486 34 Manager 1,600.00 Heredia
Data
Hundreds
Multivariate Symbolic Analysis
Individual Age Profession Wage
San José 36,39 Law, 50,Doc,50 2,4 2,5
Alajuela 28,35 Tea,66,Eng,33 1,75 1,9
Heredia 2,34 Eng,50,Mgn,50 1,6 1,9
Concepts
12Symbolic Data Base
Relational Data Base
Symbolic Data Base
100 knowledge 15 Gigabyte
90 knowledge 10.3 Megabyte
13Symbolic Representations
- A complex representation that takes into account
term frequency, word order and phrases.
14The K-Means Clustering Method
15But, there are some problems .
16Distance Measures
17Teorema Igualdad de Fisher
- Inercia total Inercia inter-clases
-
- Inercia intra-clases
18Problemas en el caso simbólico
- Representar una clase por su centro de gravedad,
esto es, por su vector de promedios. - Qué es el centro de gravedad?
19Qué el centro de gravedad?
20(No Transcript)
21Evaluation Criteria
- Rand Index
- Mutual Information
- F-Measure
- Entropy
22Experiments
23Experiments
24Experiments
25Experiments
26Conclusions
- Symbolic representations are richer and more
flexible than classical representations. - The text in the HTML document seems to be the
more important factor to cluster HTML documents.
27