Title: Scalable Clustering for Vision using GPUs
 1Scalable Clustering for Vision using GPUs
- K Wasif MohiuddinP J Narayanan 
- Center for Visual Information TechnologyInternati
 onal Institute of Information Technology
 (IIIT)Hyderabad
2Publications
-  
-  K Wasif Mohiuddin and P J Narayanan 
-  Scalable Clustering using Multiple GPUs. 
-  HIPC 11 (Conference on High Performance 
 Computing), Bangalore, India)
- 2) K Wasif Mohiuddin and P J Narayanan 
-  GPU Assisted Video Organizing Application. 
-  ICCV11, Workshop on GPU in Computer Vision 
 Applications, Barcelona, Spain).
3Presentation Flow
- Scalable Clustering on Multiple GPUs 
- GPU assisted Personal Video Organizer
4Introduction
- Classification of data desired for meaningful 
 representation.
- Unsupervised learning for finding hidden 
 structure.
- Application in computer vision, data mining with 
- Image Classification 
- Document Retrieval 
- K-Means algorithm 
5Clustering
Mean Evaluation
Select Centers
Labeling
Relabeling 
 6Need for High Performance Clustering
- Clustering 125k vectors of 128 dimension with 2k 
 clusters took nearly 8 minutes on CPU per
 iteration.
- A fast, efficient clustering implementation is 
 needed to deal with large data, high
 dimensionality and large centers.
- In computer vision, SIFT(128 dim) and GIST are 
 common. Features can run into several millions
- Bag of Words for Vocabulary generation using SIFT 
 vectors
7Challenges and Contributions
- Computational O(ndk1 log n) 
- Growing n, k, d for large scale applications. 
- Contributions A complete GPU based 
 implementation with
- Exploitation of intra-vector parallelism 
- Efficient Mean evaluation 
- Data Organization for coalesced access 
- Multi GPU framework 
8Related Work
- General Improvements 
- KD-trees Moor et al, SIGKKD-1999 
- Triangle Inequality Elkan, ICML-2003 
- Distributed Systems Dhillon et al, LSPDM-2000 
- Pre CUDA GPU Efforts Improvements 
- Fragment Shader Hall et al, SIGGRAPH-2004 
9Related Work (cont)
- Recent GPU efforts 
- Mean on CPU Che et al, JPDC-2008 
- Mean on CPU  GPU Hong et al, WCCSIE-2009 
- GPU Miner Wenbin et al, HKUSTCS-2008 
- HPK-Means Wu et al, UCHPC-2009 
- Divide  Rule Li et al, ICCIT-2010 
- One thread assigned per vector. Parallelism not 
 exploited within data object.
- Lacking efficiency in Mean evaluation 
- Proposed techniques are parameter dependant. 
10K-Means
- Objective Function ?i?j?xi(j) -cj?2 
-  1 i  n, 1 j  k 
- K random centers are initially chosen from input 
 data objects.
- Steps 
- Membership Evaluation 
- New Mean Evaluation 
- Convergence 
11GPU Architecture
- Fermi architecture has16 Streaming 
 Multiprocessors (SM)
- Each SM having 32 cores, so overall has 512 CUDA 
 cores.
- Kernels unleash multiple threads to perform a 
 task in a Single Instruction Multiple Data (SIMD)
 fashion.
- Each SM has registers divided equally amongst its 
 threads. Each thread has a private local memory.
- Single uni?ed memory request path for loads and 
 stores using the L1 cache per SM and L2 cache
 that services all operations
- Double precision, faster context switching, 
 faster atomic operations and multiple kernel
 execution
12K-Means on GPU
- Membership Evaluation 
- Involves Distance and Minima evaluation. 
- Single thread per component of vector 
- Parallel computation done on d components of 
 input and center vectors stored in row major
 format.
- Log summation for distance evaluation. 
- For each input vector we traverse across all 
 centers.
13Membership on GPU
1
2
p
p
i
Label
Input Vector
k-1
k 
 14Membership on GPU(Cont)
- Data objects stored in row major format 
- Provides coalesced access 
- Distance evaluation using shared memory. 
- Square root finding avoided 
15K-Means on GPU (Cont)
- Mean Evaluation Issues 
- Random reads and writes 
- Concurrent writes 
- Non uniform distribution of data objects per 
 label.
Write
Read/Write
Threads
Data 
 16Mean Evaluation on GPU 
- Store labels and index in 64 bit records 
- Group data objects with same membership using 
 Splitsort operation.
- We split using labels as key 
- Gather primitive used to rearrange input in order 
 of labels.
- Sorted global index of input vectors is 
 generated.
Splitsort  Suryakant  Narayanan IIITH, TR 2009 
 17Splitsort  Transpose Operation 
 18Mean Evaluation on GPU (cont)
- Row major storage of vectors enabled coalesced 
 access.
- Segmented scan followed by compact operation for 
 histogram count.
- Transpose operation before rearranging input 
 vectors.
- Using segmented scan again we evaluated mean of 
 rearranged vectors as per labels.
19Implementation Details
- Tesla 
- 2 vectors per block , 2 centers at a time 
- Centers accessed via texture memory 
- Fermi 
- 2 vectors per block, 4 centers at a time 
- Centers accessed via global memory using L2 cache 
- More shared memory for distance evaluation 
- Occupancy of 83 achieved in case of Fermi and 
 Tesla.
20Limitations of a GPU device
- Highly computational  memory consuming 
 algorithm.
- Overloading on GPU device 
- Limited Global and Shared memory on a GPU device. 
- Handling of large data vectors 
- Scalability of the algorithm 
21Multi GPU Approach
- Partition input data into chunks proportional to 
 number of cores.
- Broadcast k centers to all the nodes. 
- Perform Membership  partial mean on each of the 
 GPUs sent to their respective nodes.
22Multi GPU Approach (cont)
- Nodes direct partial sums to Master node. 
- New means evaluated by Master node for next 
 iteration.
Master Node
S  SaSb..Sz
Sa
Sb
Sz
New Centers
Node A
Node B
Node Z 
 23Results
- Generated Gaussian SIFT vectors 
- Variation in parameters n, d, k 
- Performance on CPU(1 Gb RAM, 2.7 Ghz), Tesla T10, 
 GTX 480, 8600 tested up to nmax 4 Million, kmax
 8000 , dmax  256
- MultiGPU (4xT10  GTX 480) using MPI 
 nmax  32 Million, kmax  8000, dmax  256
- Comparison with previous GPU implementations. 
24Overall Results
 N, K CPU GPU Tesla T10 GTX 480 4xT10
10K, 80 1.3 0.119 0.18 0.097
50K, 800 71.3 2.73 1.73 0.891
125K, 2K 463.6 14.18 7.71 2.47
250K, 4K 1320 38.5 27.7 7.45
1M, 8K 28936 268.6 170.6 68.5
Times of K-Means on CPU, GPUs in seconds for 
d128. 
 25Performance on GPUs
Performance of 8600 (32 cores), Tesla(240 cores), 
GTX 480(480 cores) for d128 and k1,000. 
 26Performance vs n
Linear in n, with d128 and k4,000. 
 27Overall Performance
- Multi GPU provided linear speedup 
- Speedup of up to 170 on GTX 480 
- 6 Million vectors of 128 dimension clustered in 
 just 136 sec per iteration.
- Low end GPUs provide nearly 10-20 times of 
 speedup.
28Comparison
 N K D Li et al Wu et al Our K-Means
2 Million 400 8 1.23 4.53 1.27
4 Million 100 8 0.689 4.95 0.734
4 Million 400 8 2.26 9.03 2.4
51,200 32 64 0.403 - 0.191
51,200 32 128 0.475 - 0.262
Up to twice increase in speedup against the best 
GPU implementation on GTX 280 
 29Multi GPU Results
 N Dim 1 Tesla 4xTesla 4xTeslaGTX480
1 M 128 120.4 33.6 22.8
1.5 M 128 181.7 47.2 34.8
3 M 128 364.2 95.67 67.4
6 M 128 - 183.8 136.7
16 M 16 220.4 57.8 40.9
32 M 16 - 116 84.3
Scalable to number of cores in a Multi GPU, 
Results on Tesla, GTX 480 in seconds for d128, 
k4000 
 30Time Division
Time on GTX 480 device. Mean evaluation reduced 
to 6 of the total time for large input of high 
dimensional data. 
 31Presentation Flow
- Scalable Clustering on Multiple GPUs 
- GPU assisted Personal Video Organizer
321 2 3 4 
 33Motivation
- Many and varied videos in everyones collection 
 and growing every day
- Sports, TV Shows, Movies, home events, etc. 
- Categorizing them based on content useful 
- No effective tools for video (or images) 
- Existing efforts are very category specific 
- Cant need heavy training or large clusters of 
 computers
- Goal Personal categorization performed on 
 personal machines
- Training and testing on a personal scale
34Challenges and Contributions
- Algorithmic Extend image classification to 
 videos.
- Data Use small amount of personal videos span 
 across wide class of categories.
- Computational Need do it on laptops or personal 
 workstations.
- Contributions A video organization scheme with 
- Learning categories from user-labelled data 
- Fast category assignment for the collection. 
- Exploiting the GPU for computation 
- Good performance even on personal machines 
35Related Work
- Image Categorization 
- ACDSee, Dbgallery, Flickr, Picasa, etc 
- Image Representation 
- SIFT Lowe IJCV04, GIST Torralba IJCV01, HOG 
 Dalal  Triggs CVPR05 etc.
- Key Frame extraction 
-  Difference of Histograms Gianluigi SPIE05 
36Related Workcontd
- Genre Classification 
- SVM Ekenel et al AIEMPro2010 
- HMM Haoran et al ICICS2003 
- GMM Truong et al, ICPR2000 
- Motion and color Chen et al, JVCIR2011 
- Spatio-temporal behavior Rea et al, ICIP2000 
- Involved extensive learning of categories for a 
 specific type of videos
- Not suitable for personal collections that vary 
 greatly.
37Video Classification Steps
- Category Determination 
- User tags videos separately for each class 
- Learning done using these videos 
- Cluster centers derived for each class 
- Category Assignment 
- Use the trained categories on remaining videos 
- Final assigning done based on scoring 
- Ambiguities resolved by user 
38Category Determination
- Segmentation  Thresholding 
- Keyframe extrction  PHOG Features 
- K-Means
39Work Division
- Less intensive steps processed on CPU. 
- Computationally expensive steps moved onto GPU. 
- Steps like key frame extraction, feature 
 extraction and clustering are time consuming.
40Key frame Extraction
- Segmentation 
- Compute color histogram for all the frames. 
- Divide video into shots using the score of 
 difference of histograms across consecutive
 frames.
- Thresholding 
- Shots having more than 60 frames selected. 
- Four equidistant frames chosen as key frames from 
 every shot.
41PHOG
- Edge Contours extracted using canny edge 
 detector.
- Orientation gradients computed with a 3 x 3 Sobel 
 mask without Gaussian smoothing.
- HOG descriptor discretized into K orientation 
 bins.
- HOG vector is computed for each grid cell at each 
 pyramid resolution levelBosch et al. CIVR2007
42Final Representation
- Cluster the accumulated key frames separately for 
 every category.
- Grouping of similar frames into single cluster. 
- Meaningful representation of key frames for each 
 category is achieved.
- Reduced search space for the test videos.
43K-Means
- Partitions n data objects into k partitions 
- Clustering of extracted training key frames. 
- Separately for each of the categories. 
- Represent each category with meaningful cluster 
 centers.
- For instance grouping frames consisting of pitch, 
 goal post, etc.
- 30 clusters per category generated. 
44PHOG on GPU
- HoG computed using previous code Prisacariu et 
 al. 2009
- Gradients evaluated using convolution kernels 
 from NVIDIA CUDA SDK.
- One thread per pixel and the thread block size is 
 1616.
- Each thread computes its own histogram 
- PHOG descriptors computed by applying HOG for 
 different scales and merging them.
- Downsample the image and send to HoG. 
45Category Assignment
- Segmentation, Thresholding, keyframes 
- Extract keyframes from untagged videos. 
- Compute PHOG for each keyframe 
- Classify each keyframe independently 
- K-Nearest Neighbor classifier 
- Allot each keyframe to the nearest k clusters 
- Final scoring for category assignment
46K-Nearest Neighbor
- Classification done based on closest training 
 samples.
- K nearest centers evaluated for each frame. 
- Euclidean distance used as distance metric. 
47KNN on GPU
- Each block handles L new key frames at a time 
 loops over all key frames.
- Find distances for each key frame against all 
 centers sequentially
- Deal each dimension in parallel using a thread 
- Find the vector distance using a log summation 
- Write back to global memory 
- Sort the distance as key for each key frame. 
- Keep the top k values 
48Scoring
- Use the distance ratio r  d1 / d2 of distances 
 d1 and d2 to the two neighbors.
- If r lt threshold, allot a single membership to 
 the keyframe. Threshold used 0.6
- Assign multiple memberships otherwise. We assign 
 to top c/2 categories.
- Final category 
- Count the votes for each category for the video 
- If the top category is a clear winner, assign to 
 it. (20 more score than the next)
- Seek manual assignment otherwise. 
49Results
- Selected four popular Sport categories 
- Cricket, Football, Tennis, Table Tennis 
- Collected a dataset of about 100 videos of10 to 
 15 minutes each.
- The user tags 3 videos per category. 
- Rest of the videos used for testing. 
- 4 frames considered to represent a shot. 
- Roughly 200 key frames per category. 
50Keyframes (Football) 
 51Keyframes (Cricket) 
 52Category Labeling
Final key frames for tagged multiple Cricket 
videos 
Final key frames for tagged multiple Football 
videos 
Clubbing of key frames from various tagged videos 
for each category. 
 53Category Labeling
Final key frames for tagged multiple Tennis 
videos 
Final key frames for tagged multiple Table Tennis 
videos 
 54Frame classification per category
- Variation of K nearest neighbors 
- Evaluated using 12 tagged videos, 3 per category. 
- Reduction in error percentage for certain 
 categories using 3 NN vs just NN.
- 64 to 73 for cricket 
- 58 to 66 for football 
- Achieved overall accuracy of nearly 96
55Category Determination
GPU Device No of Videos  Keyf-rames Segmentation (sec) PHOG Features (sec) K-Means (sec)
8600 4 756 182.7 139.6 3.94
8600 12 2432 584.3 468.4 14.3
280 4 756 24.8 19.2 0.59
280 12 2432 76.9 61.8 1.97
580 4 756 11.8 9.1 0.26
580 12 2432 37.91 30.2 0.89
80 secsper video
5 secsper video
Time taken to process the Category Labeling phase 
on NVIDIA 8600, GTX 280 and GTX 580 cards 
 56Category Assignment
- Videos of total duration 1375 minutes are 
 processed in less than 10 minutes.
- Time share for K-NN in seconds 
GPU Device No of Videos Keyframes K-NN
8600 88 16946 40.33 sec
280 88 16946 5.39 sec
580 88 16946 2.46 sec
Processing time per 10-15 minute video5 sec on 
GTX580, 80 sec on an 8600 
 57Conclusions 
- Complete GPU based implementation. 
- Achieved a speedup of up to 170 on single NVIDIA 
 Fermi GPU.
- High Performance for large d due to processing 
 of vector in parallel.
- Scalable in problem size n, d, k and number of 
 cores.
- Use of operations like Splitsort, Transpose for 
 coalesced memory access.
- Large datasets clustered using Multi GPU frame 
 work.
58Conclusions (contd)
- Achieved accuracy up to 96. 
- Involving user for ambiguous videos reduced 
 misclassification rate.
- Exploited the computational power of GPU for 
 vision algorithms.
- Effective training with variations in a single 
 category.
- Could be extended to other class of sport 
 categories as well as other genres of video.
- More sophisticated classification algorithms can 
 help accuracy.
59Future Work
- With evolving GPU architecture the approach may 
 be altered to enhance the performance.
- Improve Multi GPU framework by message passing. 
- Target applications in computer vision which use 
 extensive amount of clustering.
- Explore for more categories of video and 
 effective training.
60Thank You