Title: Collective Collaborative Tagging System
1Collective Collaborative Tagging System
- Jong Y. Choi, Joshua Rosen, Siddharth Maini,
Marlon E. Pierce, and Geoffrey C. Fox - Community Grids Laboratory
- Indiana University
2People-Powered Knowledge
Bookmark
Tags
Social Networks
People-generated
3People-Powered Knowledge
- Collaborative Tagging
- Online bookmarking with annotations
- Create social networks
- Utilize power of peoples knowledge
- Pros and cons
- High-quality classifier by using human
intelligence - But lack of control or authority
4Motivations
- Distributed and fragmented knowledge
- ? Need an unified data set
- ? More accurate and richer information
- No flexibility in choosing different information
retrieval (IR) algorithms - ? Need a playground to do experiment with
various IR techniques - ? Help to discover hidden knowledge
5Proposed System
Collective Collaborative Tagging (CCT) System
CCT System
Data Importer
RDF RSS Atom HTML
Data Coordinator
Distributed Tagging Data
Populate Bookmarks/ tags
Repository
Query with various options
User Service
Search Result SOAP, REST,
6Development Plan and Progress
- 1st - Service and algorithm development
- Identify services and algorithms
- 2nd - Interface development
- Web2.o style interface
- REST, SOAP,
- 3rd Export/import service development
- Merging distributed data sets
- Export data to build mesh-up sites
- So far, we are mainly in 1st stage and do some
experiments in 2nd stage
7Prototype
Different Data Sources
Various IR algorithms
Flexible Options
Result Comparison
8Service Types and Algorithms
Service
Description
Algorithm
Type
Searching
Given input tags, returning the most relevant X
(X URLs, tags, or users)
Latent Semantic Indexing (LSI), FolkRank
I
Recommendation
Indirect input tags, returning undiscovered X
II
Clustering
Community discovering. Finding a group or a
community with similar interests
K-Means, Deterministic Annealing Clustering
III
Trend detection
Analysis the tagging activities in time-series
manner and detect abnormality
Time Series Analysis
IV
9Data Models (I)
- Vector-space model (bag-of-words model)
- Assume n URLs and q tags
- A URL can be represented by q-dimension vector,
di (t1, t2, , tq) - A total data set can be represented by n-by-q
matrix - Pairwise Dissimilarity Matrix
- n-by-n symmetric matrix
- Distance (Euclidean, Manhattan, )
- Angles, cosine, sine,
- O(n2) complexity
10Data Models (II)
- Graph model
- Building a graph with nodes and edges
- Edges are indicating relationship
- Becoming complex networks (tag graph)
- Dissimilarity
- Related with path distance
- Finding path is important (Shortest path
problem) - Naive approach O(n3) complexity
(Source MSI-CIEC)
11Searching
- Latent Semantic Indexing
- Using vector-space model, find the most similar
URLs with users query tags - Dimension reduction from high q to low d (q gtgt d)
- Removing noisy terms, extracting latent concepts
Ideal Line
Recall
2 terms4 terms8 terms20 dim. reductionNone
Precision
12Clustering
- Discover the group structures of URLs
- Non-parametric learning algorithm
- Non-trivial optimization problem
- Should avoid local minima/maxima solution
13Deterministic Annealing Clustering
- Deterministically avoid local minima
- Tracing global solution by changing level of
energy - Analogy to physical annealing process (High ? Low)
14More Machine Learning Algorithms
- Classification
- To response more quickly to users requests
- Training data based on users input and answering
questions based on the training results - Artificial Neural Network, Support Vector
Machine, - Trend Detection
- Can be used for prediction/forecasting
- Time-series analysis of tagging activities
- Markov chain model, Fourier transform,
15Conclusion
- The goal of our Collective Collaborative Tagging
(CCT) system - Utilize various data sets
- Provide various information retrieval (IR)
algorithms - Help to utilize people-powered knowledge
- Currently various models and algorithms are
being investigated - Service interfaces and import/export function
will be added soon
16Thank you!! Questions?
jychoi_at_cs.indiana.edu
17Vector-space Vs. Graph
Vector-space Model
Graph Model
-. q-dimensional vector -. q-by-n matrix
Represen-tation
-. G(V, E) -. V URL, tags, users
-. Distances, cosine, -. O(N2) complexity
Dis-similarity
-. Paths, hops, connectivity, -. O(N3)
complexity
-. Latent Semantic Indexing -. Dimension
reduction schemes -. PCA
Algorithm
-. PageRank, FolkRank, -. Pairwise
clustering -. MDS
18Pairwise Dissimilarity
- Pairwise clustering
- Input from vector-based model vs. graph model
- How to avoid local minima/maxima? (e.g, K-Means)
Vector-space model
Graph model