Title: Clustering High Dimensional Data Using SVM
1Clustering High Dimensional Data Using SVM
- Tsau Young Lin and Tam Ngo
- Department of Computer Science
- San José State University
2Overview
- Introduction
- Support Vector Machine (SVM)
- What is SVM
- How SVM Works
- Data Preparation Using SVD
- Singular Value Decomposition (SVD)
- Analysis of SVD
- The Project
- Conceptual Exploration
- Result Analysis
- Conclusion
- Future Work
3Introduction
- World Wide Web
- No. 1 place for information
- contains billions of documents
- impossible to classify by humans
- Projects Goals
- Cluster documents
- Reduce documents size
- Get reasonable results when compared to humans
classification
4Support Vector Machine (SVM)
- a supervised learning machine
- outperforms many popular methods for text
classification - used for bioinformatics, signature/hand writing
recognition, image and text classification,
pattern recognition, and e-mail spam
categorization
5Motivation for SVM
- How do we separate these points?
- with a hyperplane
Source Authors Research
6SVM Process Flow
Feature Space
Input Space
Input Space
Source DTREG
7Convex Hulls
Source Bennett, K. P., Campbell, C., 2000
8Simple SVM Example
Class X1
1 0
-1 1
-1 2
1 3
- How would SVM separates these points?
- use the kernel trick
- ? F(X1) (X1, X12)
- It becomes 2-deminsional
Source Authors Research
9Simple Points in Feature Space
Class X1 X12
1 0 0
-1 1 1
-1 2 4
1 3 9
- All points here are support vectors.
Source Authors Research
10SVM Calculation
- Positive ?w ? x? b 1
- Negative ?w ? x? b -1
- Hyperplane ?w ? x? b 0
- find the unknowns, w and b
- Expending the equations
- w1x1 w2x2 b 1
- w1x1 w2x2 b -1
- w1x1 w2x2 b 0
11Use Linear Algebra to Solve w and b
- w1x1 w2x2 b 1
- ? w10 w20 b 1
- ? w13 w29 b 1
- w1x1 w2x2 b -1
- ? w11 w21 b -1
- ? w12 w24 b -1
- Solution is w1 -3, w2 1, b 1
- SVM algorithm can find the solution that returns
a hyperplane with the largest margin
12Use Solutions to Draw the Planes
Positive Plane ?w ? x? b 1 w1x1 w2x2 b
1 ? -3x1 1x2 1 1 ? x2 3x1
Negative Plane ?w ? x? b -1 w1x1 w2x2 b
-1 ? -3x1 1x2 1 -1 ? x2 -2 3x1
Hyperplane ?w ? x? b 0 w1x1 w2x2 b
0 ? -3x1 1x2 1 0 ? x2 -1 3x1
X1 X2
0 0
1 3
2 6
3 9
X1 X2
0 -2
1 1
2 4
3 7
X1 X2
0 -1
1 2
2 5
3 8
Source Authors Research
13Simple Data Separated by a Hyperplane
Source Authors Research
14LIBSVM and Parameter C
- LIBSVM A Java Library for SVM
- C is very small SVM only considers about
maximizing the margin and the points can be on
the wrong side of the plane. - C value is very large SVM will want very small
slack penalties to make sure that all data points
in each group are separated correctly.
15Choosing Parameter C
Source LIBSVM
164 Basic Kernel Types
- LIBSVM has implemented 4 basic kernel types
linear, polynomial, radial basis function, and
sigmoid - 0 -- linear u'v
- 1 -- polynomial (gammau'v coef0)degree
- 2 -- radial basis function exp(-gammau-v2)
- 3 -- sigmoid tanh(gammau'v coef0)
- We use radial basis function with large parameter
C for this project.
17Data Preparation Using SVD
- SVM is excellent for text classification, but
requires labeled documents to use for training - Singular Value Decomposition (SVD)
- separates a matrix into three parts left
eigenvectors, singular values, and right
eigenvectors - decompose data such as images and text.
- reduce data size
- We will use SVD to cluster
18SVD Example of 4 Documents
- D1 Shipment of gold damaged in a fire
- D2 Delivery of silver arrived in a silver truck
- D3 Shipment of gold arrived in a truck
- D4 Gold Silver Truck
Source Garcia, E., 2006
19Matrix A USVT
D1 D2 D3 D4
a 1 1 1 0
arrived 0 1 1 0
damaged 1 0 0 0
delivery 0 1 0 0
fire 1 0 0 0
gold 1 0 1 1
in 1 1 1 0
of 1 1 1 0
shipment 1 0 1 0
silver 0 2 0 1
truck 0 1 1 1
Given a matrix A, we can factor it into three
parts U, S, and VT.
Source Garcia, E., 2006
20Using JAMA to Decompose Matrix A
- U
- 0.3966 -0.1282 -0.2349 0.0941
- 0.2860 0.1507 -0.0700 0.5212
- 0.1106 -0.2790 -0.1649 -0.4271
- 0.1523 0.2650 -0.2984 -0.0565
- 0.1106 -0.2790 -0.1649 -0.4271
- 0.3012 -0.2918 0.6468 -0.2252
- 0.3966 -0.1282 -0.2349 0.0941
- 0.3966 -0.1282 -0.2349 0.0941
- 0.2443 -0.3932 0.0635 0.1507
- 0.3615 0.6315 -0.0134 -0.4890
- 0.3428 0.2522 0.5134 0.1453
- S
- 4.2055 0.0000 0.0000 0.0000
- 0.0000 2.4155 0.0000 0.0000
- 0.0000 0.0000 1.4021 0.0000
- 0.0000 0.0000 0.0000 1.2302
Source JAMA (MathWorks and the National
Institute of Standards and Technology (NIST))
21Using JAMA to Decompose Matrix A
- V
- 0.4652 -0.6738 -0.2312 -0.5254
- 0.6406 0.6401 -0.4184 -0.0696
- 0.5622 -0.2760 0.3202 0.7108
- 0.2391 0.2450 0.8179 -0.4624
- VT
- 0.4652 0.6406 0.5622 0.2391
- -0.6738 0.6401 -0.2760 0.2450
- -0.2312 -0.4184 0.3202 0.8179
- -0.5254 -0.0696 0.7108 -0.4624
- Matrix A can be reconstructed by multiplying
matrices USVT
Source JAMA
22Rank 2 Approximation (Reduced U, S, and V
Matrices)
- U
- 0.3966 -0.1282
- 0.2860 0.1507
- 0.1106 -0.2790
- 0.1523 0.2650
- 0.1106 -0.2790
- 0.3012 -0.2918
- 0.3966 -0.1282
- 0.3966 -0.1282
- 0.2443 -0.3932
- 0.3615 0.6315
- 0.3428 0.2522
- S
- 4.2055 0.0000
- 0.0000 2.4155
- V
- 0.4652 -0.6738
- 0.6406 0.6401
- 0.5622 -0.2760
- 0.2391 0.2450
23Use Matrix V to Calculate Cosine Similarities
- calculate cosine similarities for each document.
- sim(D, D) (D D) / (D D)
- example, Calculate for D1
- sim(D1, D2) (D1 D2) / (D1 D2)
- sim(D1, D3) (D1 D3) / (D1 D3)
- sim(D1, D4) (D1 D4) / (D1 D4)
24Result for Cosine Similarities
- Example result for D1
- sim(D1, D2) ((0.4652 0.6406) (-0.6738
0.6401)) -0.1797 - ?( (0.4652)2 (-0.6738)2 ) ?( (0.6406)2
(0.6401) 2 ) - sim(D1, D3) ((0.4652 0.5622) (-0.6738
-0.2760)) 0.8727 - ?( (0.4652)2 (-0.6738)2 ) ?( (0.5622)2
(-0.2760)2 ) - sim(D1, D4) ((0.4652 0.2391) (-0.6738
0.2450)) -0.1921 - ?( (0.4652)2 (-0.6738)2 ) ?( (0.2391)2
(0.2450)2 ) - D3 returns the highest value, pair D1 with D3
- Do the same for D2, D3, and D4.
25Result of Simple Data Set
- label 1
- D1 Shipment of gold damaged in a fire
- D3 Shipment of gold arrived in a truck
- label 2
- D2 Delivery of silver arrived in a silver truck
- D4 Gold Silver Truck
26Check Cluster Using SVM
- Now we have the label, we can use it to train
with SVM - SVM input format on original data
- 1 11.00 20.00 31.00 40.00 51.00 61.00
71.00 81.00 91.00 100.00 110.00 - 2 11.00 21.00 30.00 41.00 50.00 60.00
71.00 81.00 90.00 102.00 111.00 - 1 11.00 21.00 30.00 40.00 50.00 61.00
71.00 81.00 91.00 100.00 111.00 - 2 10.00 20.00 30.00 40.00 50.00 61.00
70.00 80.00 90.00 101.00 111.00
27Results from SVMs Prediction
Results from SVMs Prediction on Original Data
Documents use for Training Predict the Following Document SVM Prediction Result SVD Cluster Result
D1, D2, D3 D4 1.0 2
D1, D2, D4 D3 1.0 1
D1, D3, D4 D2 2.0 2
D2, D3, D4 D1 1.0 1
Source Authors Research
28Using Truncated V Matrix
- We want to reduce data size, more practical to
use truncated V matrix - SVM input format (truncated V matrix)
- 1 10.4652 2-0.6738
- 2 10.6406 20.6401
- 1 10.5622 2-0.2760
- 2 10.2391 20.2450
29SVM Result From Truncated V Matrix
Results from SVMs Prediction on Reduced Data
Documents use for Training Predict the Following Document SVM Prediction Result SVD Cluster Result
D1, D2, D3 D4 2.0 2
D1, D2, D4 D3 1.0 1
D1, D3, D4 D2 2.0 2
D2, D3, D4 D1 1.0 1
Using truncated V matrix gives better results.
Source Authors Research
30Vector Documents on a Graph
D2
D4
D3
D1
Source Authors Research
31Analysis of the Rank Approximation
Cluster Results from Different Ranking
Approximation
Rank 1 Rank 2 Rank 3 Rank 4
D1 4 D2 4 D3 4 D4 3 D1 3 D2 4 D3 1 D4 2 D1 3 D2 3 D3 1 D4 3 D1 2 D2 3 D3 2 D4 2
label 1 1 4 2 3 label 1 1 3 label 2 2 4 label 1 1 3 2 4 label 1 1 2 3 4
Source Authors Research
32Program Process Flow
- use the previous methods on larger data sets
- compare the results with that of humans
classification - Program Process Flow
33Conceptual Exploration
- Reuters-21578
- a collection of newswire articles that have been
human-classified by Carnegie Group, Inc. and
Reuters, Ltd - most widely used data set for text categorization
34Result Analysis
Clustering with SVD vs. Humans Classification
First Data Set
First Data Set from Reuters-21578 (200 x 9928) First Data Set from Reuters-21578 (200 x 9928) First Data Set from Reuters-21578 (200 x 9928)
of Naturally Formed Cluster using SVD SVD Cluster Accuracy SVM Prediction Accuracy
Rank 002 80 75.0 65.0
Rank 005 66 81.5 82.0
Rank 010 66 60.5 54.0
Rank 015 64 52.0 51.5
Rank 020 67 38.0 46.5
Rank 030 72 60.0 54.0
Rank 040 72 62.5 58.5
Rank 050 73 54.5 51.5
Rank 100 75 45.5 58.5
Source Authors Research
35Result Analysis
Clustering with SVD vs. Humans Classification
Second Data Set
Second Data Set from Reuters-21578 (200 x 9928) Second Data Set from Reuters-21578 (200 x 9928) Second Data Set from Reuters-21578 (200 x 9928)
of Naturally Formed Cluster using SVD SVD Cluster Accuracy SVM Prediction Accuracy
Rank 002 76 67.0 84.5
Rank 005 73 67.0 84.5
Rank 010 64 70.0 85.5
Rank 015 64 63.0 81.0
Rank 020 67 59.5 50.0
Rank 030 69 68.5 83.5
Rank 040 69 59.0 79.0
Rank 050 76 44.5 25.5
Rank 100 71 52.0 47.0
Source Authors Research
36Result Analysis
- highest percentage accuracy for SVD clustering is
81.5 - lower rank value seems to give better results
- SVM predicts about the same accuracy as SVD
cluster
37Result Analysis Why results may not be higher?
- humans classification is more subjective than a
program - reducing many small clusters to only 2 clusters
by computing the average may decrease the
accuracy
38Conclusion
- Showed how SVM works
- Explore the strength of SVM
- Showed how SVD can be used for clustering
- Analyzed simple and complex data
- the method seems to cluster data reasonably
- Our method is able to
- reduce data size (by using truncated V matrix)
- cluster data reasonably
- classify new data efficiently (based on SVM)
- By combining known methods, we created a form of
unsupervised SVM
39Future Work
- extend SVD to very large data set that can only
be stored in secondary storage - looking for more efficient kernels of SVM
40Thank You!
41References
- Bennett, K. P., Campbell, C. (2000). Support
Vector Machines Hype or - Hellelujah?. ACM SIGKDD Explorations. VOl. 2,
No. 2, 1-13 - Chang, C Lin, C. (2006). LIBSVM a library for
support vector machines, - Retrived November 29, 2006, from
http//www.csie.ntu.edu.tw/cjlin/libsvm - Cristianini, N. (2001). Support Vector and Kernel
Machines. Retrieved November 29, 2005, from
http//www.support-vector.net/icml-tutorial.pdf - Cristianini, N., Shawe-Taylor, J. (2000). An
Introduction to Support Vector - Machines. Cambridge UK Cambridge University
Press - Garcia, E. (2006). SVD and LSI Tutorial 4 Latent
Semantic Indexing (LSI) How-to Calculations.
Retrieved November 28, 2006, from - http//www.miislita.com/information-retrieval-tut
orial/svd-lsi-tutorial-4-lsi-how-to-calculations.h
tml - Guestrin, C. (2006). Machine Learning. Retrived
November 8, 2006, from - http//www.cs.cmu.edu/guestrin/Class/10701/
- Hicklin, J., Moler, C., Webb, P. (2005). JAMA
A Java Matrix Package. Retrieved November 28,
2006, from http//math.nist.gov/javanumerics/jama/
42References
- Joachims, T. (1998). Text Categorization with
Support Vector Machines Learning with Many
Relevant Features. http//www.cs.cornell.edu/Peopl
e/tj/publications/joachims_98a.pdf - Joachims, T. (2004). Support Vector Machines.
Retrived November 28, 2006, from
http//svmlight.joachims.org/ - Reuters-21578 Text Categorization Test
Collection. - Retrived November 28, 2006, from
http//www.daviddlewis.com/resources/testcollectio
ns/reuters21578/ - SVM - Support Vector Machines. DTREG. Retrived
November 28, 2006, from - http//www.dtreg.com/svm.htm
- Vapnik, V. N. (2000, 1995). The Nature of
Statistical Learning Theory. - Springer-Verlag New York, Inc.