Title: Extracting Relevant
1Extracting Relevant Trustworthy Information
from Microblogs
- Joint work with Bimal Viswanath, Farshad Kooti,
Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly, - Fabricio Benevenuto
- MPI-SWS, Germany IIT Kharagpur, India UFOP,
Brazil
2My research The big picture
- Three fundamental trends challenges in social
Web - 1. User-generated content sharing
- can we protect privacy of users sharing personal
data? - 2. Word-of-mouth based content exchange
- can we understand leverage word-of-mouth
better?? - 3. Crowd-sourcing content rating and ranking
- can we find trustworthy relevant content
sources?
3Twitter microblogging site
- An important source for real-time Web content
- 500 million active users posting 400 million
tweets daily - Quality of tweets / content vary widely
- Any one can post tweets
- Celebrities, politicians, news media, academics,
spammers - Challenge Finding relevant trustworthy content
- Trustworthy Thwart spammers and their spam
- Relevance Identify authoritative experts on
specific topics
4Thwarting Spammers in TwitterWWW 2012
Part 1
5Background How spammers operate
- Twitter spammers try to gain lots of followers
- To promote spam directly
- To gain influence in the network
- Search engines rank tweets based on how
influential the user is - Most metrics depend on users network
connectivity - More followers help a user to gain influence
Incentivizes spammers to acquire links to gain
influence
6Acquiring followers via link farming
- Unrelated users exchange links with each other
- To gain more influence based on network
connectivity
David
Alice
Charlie
Influence based on connectivity is improved
Bob
7To thwart spammers
- We need to
- 1. Understand link farming activity in Twitter
- 2. Combat link farming activity in Twitter
- Prior works Focused on detecting spammers
- Via their characteristics, e.g., follower to
following ratios - Rat-race between spammers and spam fighters
- We focus on the spammer support network
8Identifying spammers
- Used Twitter network gathered from previous study
ICWSM10 - Data collected in August 2009
- 54M nodes, 1.9B links, 1.7B Tweets
- Identified accounts suspended by Twitter
- Account could be suspended for various reasons
- Found suspended users that posted blacklisted
URLs - Includes 41,352 such spammers
9Spammers farm links at large-scale
- Spam-targets 27 of all users followed by at
least one of 40,000 spammers! - Spam-followers 82 of all followers have been
targeted - Spammers have more followers than random users
- Avg follower count for Spammers 234, Random
users 36
10Who responds to links from spammers?
- Small number of followers respond most of the
time
We call these users link farmers
Top 100k users account for 60 of all links to
spammers
Top 100k followers exhibit high reciprocation of
0.8 on avg.
11Are link farmers real users or spammers?
- To find out if they are spammers or real users,
we - 1. Checked if they were suspended by Twitter
- 76 users not suspended, 235 of them verified by
Twitter - 2. Manually verified 100 random users
- 86 users are real with legitimate links in their
Tweets - 3. Analyzed their profiles
- More active in updating their profiles than
random users
12Are link farmers lay or popular users?
- Conventional wisdom
- Lay users more likely to follow back due to
social etiquette - Popular users might be more conservative in
following others
Probability increases with user popularity
Link farmers are popular users with lots of
followers
13Are link farmers lay or popular users?
- Top 5 link farmers according to Pagerank
- 1. Barack Obama Obama 2012 campaign staff
- 2. Britney Spears
- 3. NPR Politics Political coverage and
conversation - 4. UK Prime Minister PMs office
- 5 JetBlue Airways
Link farmers include legitimate, popular users
organizations
14What possibly motivates link farmers?
- One explanation
- Link farmers have similar incentives as spammers
- They seek to amass social capital influence in
the network - Link farmers rank among top 5 influential
Twitter users - In terms of various metrics like Pagerank
Followerrank
15Combating link farming
- Key challenge
- Real, popular and active users are involved in
link farming - Detecting and suspending spammers alone will not
help - Insight
- Discourage users from following others carelessly
- Penalize users following anyone found to be bad
- Lower the influence scores of users following
spammers
Incentivizes users to be more careful about who
they link to
16Collusionrank
- Borrows ideas from spam defense strategies for
Web WWW05 - Low Collusionrank score for a user indicates
- heavy linking to spammers or spam-followers
- Requires a seed set of known spammers
- Twitter operator periodically identifies and
updates spammers
17Collusionrank
Algorithm 1. Negatively bias the initial scores
to the set of spammers 2. In Pagerank style,
iteratively penalize users who follow spammers or
those who follow spam-followers
Collusionrank is based on the score of followings
of a user Because user is penalized based on who
he follows
18Evaluating Collusionrank
- Goal
- To penalize spammers and spam-followers
- Should not penalize users who are not following
spammers - Used a small subset of 600 spammers as seed set
- Compare ranks between
- Pagerank
- Pagerank Collusionrank
- Measures influence after accounting for link
farming activity
19Effect of Collusionrank on spammers
40 of spammers appear in top 20 according to
Pagerank
Most of the spammers get pushed to last 10
positions based on Collusionrank
20Effect on link farmers
98 of the link farmers get pushed to last 10
positions based on Collusionrank
87 of link farmers in top 2 users according to
Pagerank
21Effect on normal users
- Focus on top 100,000 users according to Pagerank
- Analyze the percentile difference in ranks
between - Pagerank (P) Pagerank Collusionrank (PC)
- Percentile Difference ( PC-P/N ) x 100
Only 20 of users get demoted heavily
Heavily demoted users follow many more spammers
than others
Collusion rank selectively filters out spammers
and spam-followers
22Summary Thwarting spammers
- Spammers infiltrate the Twitter network by
farming links - Link farming helps them gain influence to promote
spam - Search involves ranking users based on
connectivity influence - Analyzed link farming in Twitter by studying
spammers - Top link farmers are real, active and popular
users - Proposed an algorithm Collusionrank to limit link
farming - Incentivizes users to be careful about who they
connect with
23Finding Topic Experts in TwitterWOSN 2012
SIGIR 2012
Part 2
24Topic experts in Twitter
- Twitter is now an important source of current
news - 500 million users post 400 million tweets daily
- Quality of tweets posted by different users vary
widely - News, pointless babble, conversational tweets,
spam, - Challenge to find topic experts
- Sources of authoritative information on specific
topics
25Identifying topic experts in Twitter
- Existing approaches
- Research studies Pal WSDM 11, Weng WSDM 10
- Application systems Twitter Who-To-Follow,
Wefollow, - Existing approaches primarily rely on information
provided by the user herself - Bio, contents of tweets, network features e.g.
followers - We rely on wisdom of the Twitter crowd
- How do others describe a user?
26Twitter Lists
- A feature to organize tweets received from the
people whom a user is following - Create a List, add name description, add
Twitter users to the list - List meta-data offers cues for who-is-who
- Tweets from all listed users will be available as
a separate List stream
27(No Transcript)
28Mining Lists to infer expertise
- Collect Lists containing a given user U
- Identify Us topics from List meta-data
- Basic NLP techniques
- Extract nouns and adjectives
- Extracted words collected to obtain a topic
document for user - movies tv hollywood stars entertainment
- celebrity hollywood
29Lists vs. other features
Profile bio
Fallon, happy, love, fun, video, song, game,
hope, fjoln, fallonmono
Most common words from tweets
Most common words from Lists
celeb, funny, humor, music, movies, laugh,
comics, television, entertainers
30Dataset
- Collected Lists of 55 million Twitter users who
joined before or in 2009 - Our analysis infers topics for 1.3 million users
who are included in 10 or more Lists
31Evaluating inference quality
- Quality metrics
- Is the inference accurate?
- Is the inference informative?
- Evaluation of popular users
- Celebrities, News media sources, US Senators
- Using user feedback
32Popular users set 1 Celebrities
- The inferred attributes accurately capture
- Biographical information
- Topics of expertise
- Popular perception about the user
Biographical Tags Topics of Expertise Popular Perception
government, president, USA, democrat politics, government celebs, leader, famous, current events
sports, cyclist, athlete tdf, triathlon, cancer celebs, influential, famous, inspiration
33Popular users set 2 News media sources
- The inferred attributes indicate
- Primary topics of the media source
- Perceived political bias (Verified using ADA
scores)
Media Biographical Tags Topics of Expertise Popular Perception
CNN media, journalist, bloggers politics, sports, tech, weather, current influential outlets
The Nation media, journalist, magazines, blogs politics, government progressive, liberal
Townhall.com media, bloggers, commentary, journalists politics conservative, republican
GuardianFilm journalists, reviews movies, cinema, actors, theatre, hollywood film critics
34Popular users set 3 US Senators
- Out of the 100 US senators, 84 have Twitter
accounts - The inferred attributes correctly infer
- Their political party
- The state represented by them
- Their gender
- Female or Women for all 15 female senators
- Their political ideology
- progressive/liberal/conservative/tea-party
- The senate committees to which they belong
35Popular users set 3 US Senators
Biographical Tags Senate Committees Perception
Chuck Grassley politics, senator, republican, iowa, gop health, food, agriculture conservative
Claire McCaskill politics, democrats, missouri, women tech, security, power, health, commerce progressive, liberal
Jim Inhofe politics, congress, oklahoma, republican army, energy, climate, foreign conservative
John Kerry politics, senate, democrats, boston health, climate, tech progressive
36User feedback
Accurate Informative
Total Evaluations 345 342
Response Yes 274 277
Response No 18 20
Cant tell 53 45
- Ignoring cant tell responses,
- Accuracy 94
- Informative 93
37Evaluating inference coverage
- What fraction of Twitter can our method of
inference be applied to?
A large fraction of popular Twitter users are
covered
38Evaluating inference coverage
- We could also infer attributes of less popular
users - 6 of users with Follower Ranks between 1 and 10
Million - They are often experts on niche topics
User Twitter bio Followers Listed Inferred Attributes
spacespin news on robotic space exploration 56 11 science, space exploration, nasa, astronomy, planets
laithm Al-jazeera network battle cameraman 201 16 jounalists, photographer, al-jazeera, media
HumphreysLab Stem Cell, Regenrative Biology of Kidney 119 17 science, stem cell, genetics, cancer, physicians, biotech, nephrologist
39Cognos
- Search system for topic experts in Twitter
- Given a query (topic)
- Identify experts on the topic using Lists
- Rank identified experts
40Ranking experts
- Used a ranking scheme solely based on Lists
- Two components of ranking user U w.r.t. query Q
- Relevance of user to query cover density
ranking between topic document TU of user and Q - Popularity of user number of Lists including
the user
Topic relevance(TU, Q) log(Lists including U)
41Cognos results for stem cell
42Evaluation of Cognos
- System deployed and evaluated in-the-wild
- Evaluators were students researchers from the
three home institutes of authors
43Cognos vs. Twitter Who-To-Follow
44Cognos vs. Twitter Who-To-Follow
- Considering 27 distinct queries asked at least
twice - Judgment by majority voting
- Cognos judged better on 12 queries
- Computer science, Linux, Mac, Apple, Ipad,
Internet, Windows phone, photography, political
journalist, - Twitter Who-To-Follow judged better on 11 queries
- Music, Sachin Tendulkar, Anjelina Jolie, Harry
Potter, metallica, cloud computing, IIT
Kharagpur,
45Results for query music
46Summary Finding topic experts in Twitter
- Developed and deployed Cognos
- Uses Lists to infer topics of expertise and rank
users - Competes favorably with Twitter Who-To-Follow
- Lists vital in searching for topic experts in
Twitter - Future work
- Make the inference methodology robust against
List spam - Key insight Unlike follow-links, experts do not
List non-expert users
47Twitter microblogging site
- An important source for real-time Web content
- 500 million active users posting 400 million
tweets daily - Quality of tweets / content vary widely
- Any one can post tweets
- Celebrities, politicians, news media, academics,
spammers - Challenge Finding relevant trustworthy content
- Trustworthy Thwart spammers and their spam
- Relevance Identify authoritative experts on
specific topics
48Higher-level take away
- Links mean different things in different
real-world social networks - In fact, every social network offers different
types of links - They are backed by different social interactions
- Many links are implicit
- Important to differentiate and leverage
domain-specific usage of social links
49Thank You
- You can try Cognos at
- http//twitter-app.mpi-sws.org/whom-to-follow/
- http//twitter-app.mpi-sws.org/who-is-who/