Title: Collection of general data mining briefings
1 Social Networking, Security and Privacy
Dr. Bhavani Thuraisingham
April 21, 2008
2Social Networkshttp//www.flairandsquare.com/arch
ives/167
- A social network site allows people who share
interests to build a trusted network/ online
community. A social network site will usually
provide various ways for users to interact, such
as IM (chat/ instant messaging), email, video
sharing, file sharing, blogging, discussion
groups, etc. - The main types of social networking sites have a
theme, they allow users to connect through
image or video collections online (like Flicker
or You Tube) or music (like My Space, lastfm).
Most contain libraries/ directories of some
categories, such as former classmates, old work
colleagues, and so on (like Face book, friends
reunited, Linked in, etc). They provide a means
to connect with friends (by allowing users to
create a detailed profile page), and recommender
systems linked to trust.
3Popular Social Networks
- Face book - A social networking website.
Initially the membership was restricted to
students of Harvard University. It was originally
based on what first-year students were given
called the face book which was a way to get to
know other students on campus. As of July 2007,
there over 34 million active members worldwide.
From September 2006 to September 2007 it
increased its ranking from 60 to 6th most visited
web site, and was the number one site for photos
in the United States. - Twitter- A free social networking and
micro-blogging service that allows users to send
updates (text-based posts, up to 140 characters
long) via SMS, instant messaging, email, to the
Twitter website, or an application/ widget within
a space of your choice, like MySpace, Facebook, a
blog, an RSS Aggregator/reader. - My Space - A popular social networking website
offering an interactive, user-submitted network
of friends, personal profiles, blogs, groups,
photos, music and videos internationally.
According to AlexaInternet, MySpace is currently
the worlds sixth most popular English-language
website and the sixth most popular website in any
language, and the third most popular website in
the United States, though it has topped the chart
on various weeks. As of September 7, 2007, there
are over 200 million accounts.
4Social Networks More formal definition
- A structural approach to understanding social
interaction. - Networks consist of Actors and the Ties between
them. - We represent social networks as graphs whose
vertices are the actors and whose edges are the
ties. - Edges are usually weighted to show the strength
of the tie. - In the simplest networks, an Actor is an
individual person. - A tie might be is acquainted with. Or it might
represent the amount of email exchanged between
persons A and B.
5Social Network Examples
- Effects of urbanization on individual well-being
- World political and economic system
- Community elite decision-making
- Social support, Group problem solving
- Diffusion and adoption of innovations
- Belief systems, Social influence
- Markets, Sociology of science
- Exchange and power
- Email, Instant messaging, Newsgroups
- Co-authorship, Citation, Co-citation
- SocNet software, Friendster
- Blogs and diaries, Blog quotes and links
6History
- Sociograms were invented in 1933 by Moreno.
- In a sociogram, the actors are represented as
points in a two-dimensional space. The location
of each actor is significant. E.g. a central
actor is plotted in the center, and others are
placed in concentric rings according to
distance from this actor. - Actors are joined with lines representing ties,
as in a social network. In other words a social
network is a graph, and a sociogram is a
particular 2D embedding of it. - These days, sociograms are rarely used (most
examples on the web are not sociograms at all,
but networks). But methods like MDS
(Multi-Dimensional Scaling) can be used to lay
out Actors, given a vector of attributes about
them. - Social Networks were studied early by researchers
in graph theory (Harary et al. 1950s). Some
social network properties can be computed
directly from the graph. - Others depend on an adjacency matrix
representation (Actors index rows and columns of
a matrix, matrix elements represent the tie
strength between them).
7Social Networks Basic Questions
- Balance important in exchange networks
- In a two-person network (dyad), exchange of
goods, services and cash should be balanced. - More generally, exchanges of favors or
support are likely to be quite balanced. - Role what role does the actor perform in the
network? - Role is defined in terms of Actors
neighborhoods. - The neighborhood is the set of ties and actors
connected directly to the current actor. - Actors with similar or identical neighborhoods
are assigned the same role. - What is the related idea from semiotics?
- Paradigm interchangability. Actors with the
same role areinterchangable in the network.
8Social Networks Basic Questions
- Prestige How important is the actor in the
network? - Related notions are status and centrality.
- Centrality reifies the notion of peripheral vs.
central participation from communities of
practice. - Key notions of centrality were developed in the
1970s, e.g. eigenvalue centrality by Bonacich. - Most of these measures were rediscovered as
quality measures for web pages - Indegree
- Pagerank eigenvalue centrality
- HITS ? two-mode eigenvalue centrality
9Social Network Concepts
- Actor
- An actor is a basic component for SNs. Actors
can be - Individual people, Corporations, Nation-States,
Social groups - Modes
- If all the actors are of the same type, the
network is called a one-mode network. If there
are two groups of actor then it is a two-mode
network. - E.g. an affiliation network is a two-mode
network. One mode is individuals, the other is
groups to which they belong. Ties represent the
relation person A is a member of group B. - Ties
- A tie is the relation between two actors. Common
types of ties include - Friendship, Amount of communication, Goods
exchanged, Familial relation (kinship),
Institutional relations
10Practical issues Boundaries and Samples
- Because human relations are rich and unbounded,
drawing meaningful boundaries for network
analysis is a challenge. - There are two main approaches
- Realist boundaries perceived by actors
themselves, e.g. gang members or ACM members. - Nominalist Boundaries created by researcher
e.g. people who publish in ACM CHI. - To deal with large networks, sampling is
necessary. Unfortunately, randomly sampled graphs
will typically have completely different
structure. Why? - One approach to this is snowballing. You start
with a random sample. Then extend with all actors
connected by a tie. Then extend with all actors
connected to the previous set by a tie
11Social Network Analysis of 9/11 Terrorists
(www.orgnet.com)
Early in 2000, the CIA was informed of two
terrorist suspects linked to al-Qaeda. Nawaf
Alhazmi and Khalid Almihdhar were photographed
attending a meeting of known terrorists in
Malaysia. After the meeting they returned to Los
Angeles, where they had already set up
residence in late 1999.
12Social Network Analysis of 9/11 Terrorists
- What do you do with these suspects? Arrest or
deport them immediately? No, we need to use them
to discover more of the al-Qaeda network. - Once suspects have been discovered, we can use
their daily activities to uncloak their network.
Just like they used our technology against us, we
can use their planning process against them.
Watch them, and listen to their conversations to
see... - who they call / email
- who visits with them locally and in other cities
- where their money comes from
- The structure of their extended network begins to
emerge as data is discovered via surveillance.
13Social Network Analysis of 9/11 Terrorists
A suspect being monitored may have many contacts
-- both accidental and intentional. We must
always be wary of 'guilt by association'.
Accidental contacts, like the mail delivery
person, the grocery store clerk, and neighbor may
not be viewed with investigative interest.
Intentional contacts are like the late
afternoon visitor, whose car license plate is
traced back to a rental company at the airport,
where we discover he arrived from Toronto (got to
notify the Canadians) and his name matches a cell
phone number (with a Buffalo, NY area code) that
our suspect calls regularly. This intentional
contact is added to our map and we start tracking
his interactions -- where do they lead? As data
comes in, a picture of the terrorist organization
slowly comes into focus. How do investigators
know whether they are on to something big? Often
they don't. Yet in this case there was another
strong clue that Alhazmi and Almihdhar were up to
no good -- the attack on the USS Cole in October
of 2000. One of the chief suspects in the Cole
bombing Khallad was also present along with
Alhazmi and Almihdhar at the terrorist meeting
in Malaysia in January 2000. Once we have their
direct links, the next step is to find their
indirect ties -- the 'connections of their
connections'. Discovering the nodes and links
within two steps of the suspects usually starts
to reveal much about their network. Key
individuals in the local network begin to stand
out. In viewing the network map in Figure 2, most
of us will focus on Mohammed Atta because we now
know his history. The investigator uncloaking
this network would not be aware of Atta's
eventual importance. At this point he is just
another node to be investigated.
14Social Network Analysis of 9/11 Terrorists
Figure 2 shows the two suspects and
15Social Network Analysis of 9/11 Terrorists
Figure 2 shows the two suspects and
Atta's eventual importance. At this point he is
just another node to be investigated.
                                                 Â
                                                 Â
                                                 Â
                                                 Â
        Figure 3 shows the direct
16Social Network Analysis of 9/11 Terrorists
- We now have enough data for two key conclusions
- All 19 hijackers were within 2 steps of the two
original suspects uncovered in 2000! - Social network metrics reveal Mohammed Atta
emerging as the local leader - With hindsight, we have now mapped enough of the
9-11 conspiracy to stop it. Again, the
investigators are never sure they have uncovered
enough information while they are in the process
of uncloaking the covert organization. They also
have to contend with superfluous data. This data
was gathered after the event, so the
investigators knew exactly what to look for.
Before an event it is not so easy. - As the network structure emerges, a key dynamic
that needs to be closely monitored is the
activity within the network. Network activity
spikes when a planned event approaches. Is there
an increase of flow across known links? Are new
links rapidly emerging between known nodes? Are
money flows suddenly going in the opposite
direction? When activity reaches a certain
pattern and threshold, it is time to stop
monitoring the network, and time to start
removing nodes. - The author argues that this bottom-up approach of
uncloaking a network is more effective than a top
down search for the terrorist needle in the
public haystack -- and it is less invasive of the
general population, resulting in far fewer "false
positives".
17Social Network Analysis of Steroid Usage in
Baseball (www.orgnet.com)
Figure 2 shows the two suspects and
When the Mitchell Report on steroid use in Major
League Baseball MLB, was published, people were
surprised at who and how many players were
mentioned. The diagram below shows a human
network created from data found in the Mitchell
Report. Baseball players are shown as green
nodes. Those who were found to be providers of
steroids and other illegal performance enhancing
substances appear as red nodes. The links reveal
the flow of chemicals -- from provider to player.
18Knowledge Management Examples
- Managing the 21st Century Organization
- Networks of Adaptive/Agile Organizations
- Best Practice Organizational Network Mapping
- Discovering Communities of Practice
- Data-Mining E-mail
- Finding Leaders on your Team
- Post-Merger Integration
- Knowledge Sharing in Organizations
- Innovation happens at the Intersections
- Partnerships and Alliances in Industry
- Decision-Making in Organizations
- New Organizational Structures
19Knowledge Sharing in Organizations Finding
Experts
Figure 2 shows the two suspects and
20Knowledge Sharing Network Finding Experts
(www.orgnet.com)
Figure 2 shows the two suspects and
Organizational leaders are preparing for the
potential loss of expertise and knowledge flow
due to turnover, downsizing, outsourcing, and the
coming retirements of the baby boom generation.
The model network (previous chart) is used to
illustrate the knowledge continuity analysis
process. Each node in this sample network
(previous chart) represents a person that works
in a knowledge domain. Some people have more /
different knowledge than others. Employees who
will retire in 2 years or less have their nodes
colored red. Those who will retire in 3-4 years
are colored yellow. Those retiring in 5 years or
later are colored green. A gray, directed line
is drawn from the seeker of knowledge to the
source of expertise. A--B indicates that A seeks
expertise / advice from B. Those with many
arrows pointing to them are sought often for
assistance. The top subject matter experts --
SMEs -- in this group are nodes 29, 46, 100, 41,
36 and 55. The SMEs were discovered using a
network metric in InFlow that is similar to how
the Google search engine ranks web pages --
using both direct and indirect links. Of the top
six SMEs in this group, half are colored red100
or yellow46, 55. The loss of person 46 has the
greatest potential for knowledge loss. 90 of the
network is within 3 steps of accessing this key
knowledge source.
21Social Networks Security and Privacy Issues
European Network and Information Security Agency
- The European Network and Information Security
Agency (ENISA) has released its first issue paper
Security Issues and Recomendations for Online
Social Networks". - http//www.enisa.europa.eu/doc/pdf/deliverables/en
isa_pp_social_networks.pdf - Four groups of threats privacy related threats,
variants of traditional network and information
security threats, identity related threats,
social threats. - Recommendations are given for governments
(oversight and adaption of existing data
protection legislation), companies that run such
networks, technology developers, and research and
standardisation bodies. - Some concenrs recommnendation to use automated
filters against "offensive, litigious or illegal
content". This brings potential freedom of speech
issues. European Digital Rights has started a
campaign against a similar recommendation by the
Council of Europe.Issue of portability of
profiles social graphs are also addressed.
However what is missing is that Information
about social links is not about only one user,
but also the others which he is linked to. They
have to agree if this information is moved to
different platforms.
22Social Networks Security and Privacy Issues
Microsoft Recommendations http//www.microsoft.com
/protect/yourself/personal/communities.mspx
- Online communities require you to provide
personal information. Profiles are public.
Comments you post are permanently recorded on the
community site.You might even mention when you
plan to be out of town. - E-mail and phishing scammers count on the
appealing sense of trust that is often fostered
in online communities to steal your personal
information. The more you reveal in profiles and
posts, the more vulnerable you are to scams,
spam, and identity theft. - Here are some features to look for when you're
considering joining an online community - Privacy policies that explain exactly what
information the service will collect and how it
might be used. User guidelines that outline a
basic code of conduct for users on their sites.
Sites have the option to penalize reported
violators with account suspension or
termination.Special provisions for children and
their parents, such as family-friendly options
geared towards protecting children under a
certain age.Password protection to help keep
your account secure..E-mail address hiding,
which lets you display only part of your e-mail
address on the site's membership lists. Filtering
options Offered on blogging sites, these tools
let you to choose which subscribers can see what
you've written.
23 Appendix Social Networks and Surveillance
Evaluating Suspicion by Association
Ryan Layfield, PhD Student Prof. Bhavani
Thuraisingham
September 2006
24Overview
- Introduction
- Our Goal
- System Design
- Social Networks
- Threat Detection
- Correlation Analysis
- The Experiment
- Setup
- Current Results
- Issues
- Future Work
- Directions
25Introduction
- Automated message surveillance is essential to
communication monitoring - Widespread use of electronic communication
- Exponential data growth
- Impossible to sift through all by hand
- Going beyond basic surveillance
- Identifying groups rather than individuals
- Monitoring conversations rather than messages
26Our Approach
- Design new techniques and apply existing
algorithms to - Create a machine-understandable model of existing
social networks Identify abnormal conversations
and behavior Monitor a given communications
system in real-time Continuously learn and adapt
to a dynamic environment - System Design Three major components
- Social Network Modeler Initial Activity
Detector Correlated Activity Investigator - Assumptions
- Individuals engaged in suspicious or undesirable
behavior rarely act alone - We can infer than those associated with a person
positively identified as suspicious have a high
probability of being either - Accomplices (participants in suspicious activity)
- Witnesses (observers of suspicious activity)
- Making these assumptions, we create a context of
association between users of a communication
network
27Social Networks
- Within our model
- Every node is a unique user
- Every message creates or strengthens a link
between nodes - Over time, the network changes
- Frequent communication leads to stronger links
- Intermittent messaging implies weakening social
ties - The strength of the link implies how strong an
association between individuals is - From this data, we can theoretically identify
- Hubs
- Groups
- Liaisons
28Social Networks
29Threat Detection and Correlation Analysis
- Every message sent is scrutinized in the interest
of identifying suspicious communication - Keywords analysis Prior context (i.e. previous
message content) - When a detection algorithm yields a strong
result, a token is created - The token is created at the origin and passed to
the recipients) Existing tokens, if any, are
cloned instead - The result is a web that potentially reflects the
dissemination of suspicious information activity - Future messages with similar suspicious topics
are not always identifiable with the same
initial techniques - Quick replies Pronoun use Assumption that
recipient is aware of topic - If a token is present at the sender when a
message is sent - Message token is associated with and new message
are analyzed If analysis yields a strong match,
the token is further cloned and passed to
recipient
30The Experiment
- A rare set of words shared between two or more
messages are candidates for keyword analysis, but
they are not always easily sifted from noise - Noise within text-based messages comes in a
variety of forms - Misspelled words
- Unusual word choice
- Incompatible variations of the same language
(i.e. British vs. American English) - Unexpected language
- However, we do not want to eliminate potential
keywords - Document names
- Terminology specific to a subject
- Buzz words
- We proposed an experiment that attempts to
eliminate false positives due to noisy data while
strengthening and expanding our correlation
techniques
31Setup
- Tools
- Running word rank database
- Implementation of word set theory infrastructure
- JAMA Matrix Library
- Singular Value Decomposition
- Our Approach
- Apply SVD noise filtering based on 100 messages
- Analyze word frequency correlation between
current message and prior suspicious messages - Generate a score based on the results
32Setup
- Construct a matrix based on the last 100 messages
messages
More common
words
Less common
33Setup
VT
?
U
A
Eliminate weak singular values
34Setup
Pulled from messages j and k
Raw total score for word wi
Pulled from running word database
Counts only intersection of words
Predefined fixed threshold
35Current Results
Method is not currently accurate Large
fluctuations Correlation easily swayed by
plethora of common words Uncommon words not given
enough weight
1000 messages evaluated, first 100 used to seed
word ranks.
36Issues and Directions
- Word frequencies fluctuate wildly during
beginning of experiment (0.0 10.0) - Extreme cost for current construction methods and
computation - Filtering context limited to recent global
history - Affected by large bodies of text
- Future Directions include
- Tap potential of existing matrix for further
analysis - Adaptive filtering feedback algorithms
- Speed improvements to accommodate real-time
streams - Flexible communication platform monitoring
- Addition of pipe architecture for modular threat
detection and correlation