Title: EVENT IDENTIFICATION IN SOCIAL MEDIA
1EVENT IDENTIFICATION IN SOCIAL MEDIA
- Hila Becker, Luis Gravano Mor Naaman
- Columbia University Rutgers University
2Social Media Sites Host Many Event Documents
- Event something that occurs at a certain time
in a certain place Yang et al. 99 - Popular, widely known eventsPresidential
Inauguration, Thanksgiving Day Parade - Smaller events, without traditional news
coverageLocal food drive, street fair -
Photo-sharing Flickr Video-sharing YouTube
Social networking Facebook
Social media documents for All Points West
festival, Liberty State Park, New Jersey, 8/8/08
3Identifying Events and Associated Social Media
Documents
- Applications
- Event search and browsing
- Local search
- General approach group similar documents via
clusteringEach cluster corresponds to one event
and its associated social media documents
4Event Identification Challenges
- Uneven data quality
- Missing, short, uninformative text
- but revealing structured context available
tags, date/time, geo-coordinates - Scalability
- Dynamic data stream of event information
- Unknown number of events
- Necessary for many clustering algorithms
- Difficult to estimate
5Clustering Social Media Documents
- Social media document representation
- Social media document similarity
- Social media document clustering
- Clustering task definition
- Ensemble algorithm combining multiple clustering
results - Preliminary evaluation
6Social Media Document Representation
Title
Description
Tags
Date/Time
Location
All-Text
7Social Media Document Similarity
Title
- Text tf-idf weights, cosine similarity
Title
Description
A
A
A
B
B
B
Description
- Time proximity in minutes
Tags
Tags
Date/Time-Keywords
time
Date/Time
Location-Keywords
- Location geo-coordinate proximity
Date/Time-Proximity
Location
Location-Proximity
All-Text
All-Text
8Social Media Document Clustering Framework
Social media documents
Document feature representation
Event clusters
9 Clustering Ensemble Algorithm
Ctitle
Ensemble clustering solution
Consensus Function combine ensemble similarities
Wtitle
f(C,W)
Wtags
Ctags
Wtime
Ctime
Learned in a training step
10Clustering Measuring Quality
?
?
- Metric Normalized Mutual Information
(NMI)Shared information between clustering
solution and ground truth
11Experimental Setup
- Data gt270K Flickr photos
- Event labels from Yahoo!s upcoming event
database - Split into 3 parts for training/validation/testing
- Clusterers single pass algorithm with centroid
similarity - Weighing scheme Normalized Mutual Information
(NMI) scores on validation set - Consensus function weighted average of
clusterers binary predictions - Final prediction step single pass clustering
algorithm
12Preliminary Evaluation Results
- Individual clusterer performance
- Highest NMI Tags, All-Text
- Lowest NMI Description, Title
- Ensemble performance, compared against all
individual clusterers - Highest overall performance in terms of NMI
- More homogenous clusters each event is spread
over fewer clusters
Details in paper
13Future Work Alternative Choices
- Document similarity metric
- Ensemble approach
- Weight assignment
- Choice of clusterers
- Train a classifier to predict document similarity
- Features correspond to similarity scores
- All-text, title, tags, time, location, etc.
- Numeric values in 0,1
- State-of-the-art classifiers SVM, Logistic
Regression,
14Future Work Alternative Choices
- Final clustering step
- Apply graph partitioning algorithms
- Requires estimating the number of clusters
- Evaluation metrics beyond NMI
- Datasets
- Flickr LastFM, YouTube
- Exploit social network connections
15Conclusions
- Identified events and their corresponding social
media documents - Proposed a clustering solution
- Leveraged different representations of social
media documents - Employed various social media similarity metrics
- Developed a weighted ensemble clustering approach
- Reported preliminary results of our event
identification approach on a large-scale dataset
of Flickr photographs