Collection of general data mining briefings - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Collection of general data mining briefings

Description:

Twitter- A free social networking and micro-blogging service that allows users ... messaging, email, to the Twitter website, or an application/ widget within a ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 37
Provided by: chrisc8
Category:

less

Transcript and Presenter's Notes

Title: Collection of general data mining briefings


1
Social Networking, Security and Privacy
Dr. Bhavani Thuraisingham
April 21, 2008

2
Social Networkshttp//www.flairandsquare.com/arch
ives/167
  • A social network site allows people who share
    interests to build a trusted network/ online
    community. A social network site will usually
    provide various ways for users to interact, such
    as IM (chat/ instant messaging), email, video
    sharing, file sharing, blogging, discussion
    groups, etc.
  • The main types of social networking sites have a
    theme, they allow users to connect through
    image or video collections online (like Flicker
    or You Tube) or music (like My Space, lastfm).
    Most contain libraries/ directories of some
    categories, such as former classmates, old work
    colleagues, and so on (like Face book, friends
    reunited, Linked in, etc). They provide a means
    to connect with friends (by allowing users to
    create a detailed profile page), and recommender
    systems linked to trust.

3
Popular Social Networks
  • Face book - A social networking website.
    Initially the membership was restricted to
    students of Harvard University. It was originally
    based on what first-year students were given
    called the face book which was a way to get to
    know other students on campus. As of July 2007,
    there over 34 million active members worldwide.
    From September 2006 to September 2007 it
    increased its ranking from 60 to 6th most visited
    web site, and was the number one site for photos
    in the United States.
  • Twitter- A free social networking and
    micro-blogging service that allows users to send
    updates (text-based posts, up to 140 characters
    long) via SMS, instant messaging, email, to the
    Twitter website, or an application/ widget within
    a space of your choice, like MySpace, Facebook, a
    blog, an RSS Aggregator/reader.
  • My Space - A popular social networking website
    offering an interactive, user-submitted network
    of friends, personal profiles, blogs, groups,
    photos, music and videos internationally.
    According to AlexaInternet, MySpace is currently
    the worlds sixth most popular English-language
    website and the sixth most popular website in any
    language, and the third most popular website in
    the United States, though it has topped the chart
    on various weeks. As of September 7, 2007, there
    are over 200 million accounts.

4
Social Networks More formal definition
  • A structural approach to understanding social
    interaction.
  • Networks consist of Actors and the Ties between
    them.
  • We represent social networks as graphs whose
    vertices are the actors and whose edges are the
    ties.
  • Edges are usually weighted to show the strength
    of the tie.
  • In the simplest networks, an Actor is an
    individual person.
  • A tie might be is acquainted with. Or it might
    represent the amount of email exchanged between
    persons A and B.

5
Social Network Examples
  • Effects of urbanization on individual well-being
  • World political and economic system
  • Community elite decision-making
  • Social support, Group problem solving
  • Diffusion and adoption of innovations
  • Belief systems, Social influence
  • Markets, Sociology of science
  • Exchange and power
  • Email, Instant messaging, Newsgroups
  • Co-authorship, Citation, Co-citation
  • SocNet software, Friendster
  • Blogs and diaries, Blog quotes and links

6
History
  • Sociograms were invented in 1933 by Moreno.
  • In a sociogram, the actors are represented as
    points in a two-dimensional space. The location
    of each actor is significant. E.g. a central
    actor is plotted in the center, and others are
    placed in concentric rings according to
    distance from this actor.
  • Actors are joined with lines representing ties,
    as in a social network. In other words a social
    network is a graph, and a sociogram is a
    particular 2D embedding of it.
  • These days, sociograms are rarely used (most
    examples on the web are not sociograms at all,
    but networks). But methods like MDS
    (Multi-Dimensional Scaling) can be used to lay
    out Actors, given a vector of attributes about
    them.
  • Social Networks were studied early by researchers
    in graph theory (Harary et al. 1950s). Some
    social network properties can be computed
    directly from the graph.
  • Others depend on an adjacency matrix
    representation (Actors index rows and columns of
    a matrix, matrix elements represent the tie
    strength between them).

7
Social Networks Basic Questions
  • Balance important in exchange networks
  • In a two-person network (dyad), exchange of
    goods, services and cash should be balanced.
  • More generally, exchanges of favors or
    support are likely to be quite balanced.
  • Role what role does the actor perform in the
    network?
  • Role is defined in terms of Actors
    neighborhoods.
  • The neighborhood is the set of ties and actors
    connected directly to the current actor.
  • Actors with similar or identical neighborhoods
    are assigned the same role.
  • What is the related idea from semiotics?
  • Paradigm interchangability. Actors with the
    same role areinterchangable in the network.

8
Social Networks Basic Questions
  • Prestige How important is the actor in the
    network?
  • Related notions are status and centrality.
  • Centrality reifies the notion of peripheral vs.
    central participation from communities of
    practice.
  • Key notions of centrality were developed in the
    1970s, e.g. eigenvalue centrality by Bonacich.
  • Most of these measures were rediscovered as
    quality measures for web pages
  • Indegree
  • Pagerank eigenvalue centrality
  • HITS ? two-mode eigenvalue centrality

9
Social Network Concepts
  • Actor
  • An actor is a basic component for SNs. Actors
    can be
  • Individual people, Corporations, Nation-States,
    Social groups
  • Modes
  • If all the actors are of the same type, the
    network is called a one-mode network. If there
    are two groups of actor then it is a two-mode
    network.
  • E.g. an affiliation network is a two-mode
    network. One mode is individuals, the other is
    groups to which they belong. Ties represent the
    relation person A is a member of group B.
  • Ties
  • A tie is the relation between two actors. Common
    types of ties include
  • Friendship, Amount of communication, Goods
    exchanged, Familial relation (kinship),
    Institutional relations

10
Practical issues Boundaries and Samples
  • Because human relations are rich and unbounded,
    drawing meaningful boundaries for network
    analysis is a challenge.
  • There are two main approaches
  • Realist boundaries perceived by actors
    themselves, e.g. gang members or ACM members.
  • Nominalist Boundaries created by researcher
    e.g. people who publish in ACM CHI.
  • To deal with large networks, sampling is
    necessary. Unfortunately, randomly sampled graphs
    will typically have completely different
    structure. Why?
  • One approach to this is snowballing. You start
    with a random sample. Then extend with all actors
    connected by a tie. Then extend with all actors
    connected to the previous set by a tie

11
Social Network Analysis of 9/11 Terrorists
(www.orgnet.com)
Early in 2000, the CIA was informed of two
terrorist suspects linked to al-Qaeda. Nawaf
Alhazmi and Khalid Almihdhar were photographed
attending a meeting of known terrorists in
Malaysia. After the meeting they returned to Los
Angeles, where they had already set up
residence in late 1999.
12
Social Network Analysis of 9/11 Terrorists
  • What do you do with these suspects? Arrest or
    deport them immediately? No, we need to use them
    to discover more of the al-Qaeda network.
  • Once suspects have been discovered, we can use
    their daily activities to uncloak their network.
    Just like they used our technology against us, we
    can use their planning process against them.
    Watch them, and listen to their conversations to
    see...
  • who they call / email
  • who visits with them locally and in other cities
  • where their money comes from
  • The structure of their extended network begins to
    emerge as data is discovered via surveillance.

13
Social Network Analysis of 9/11 Terrorists
A suspect being monitored may have many contacts
-- both accidental and intentional. We must
always be wary of 'guilt by association'.
Accidental contacts, like the mail delivery
person, the grocery store clerk, and neighbor may
not be viewed with investigative interest.
Intentional contacts are like the late
afternoon visitor, whose car license plate is
traced back to a rental company at the airport,
where we discover he arrived from Toronto (got to
notify the Canadians) and his name matches a cell
phone number (with a Buffalo, NY area code) that
our suspect calls regularly. This intentional
contact is added to our map and we start tracking
his interactions -- where do they lead? As data
comes in, a picture of the terrorist organization
slowly comes into focus. How do investigators
know whether they are on to something big? Often
they don't. Yet in this case there was another
strong clue that Alhazmi and Almihdhar were up to
no good -- the attack on the USS Cole in October
of 2000. One of the chief suspects in the Cole
bombing Khallad was also present along with
Alhazmi and Almihdhar at the terrorist meeting
in Malaysia in January 2000. Once we have their
direct links, the next step is to find their
indirect ties -- the 'connections of their
connections'. Discovering the nodes and links
within two steps of the suspects usually starts
to reveal much about their network. Key
individuals in the local network begin to stand
out. In viewing the network map in Figure 2, most
of us will focus on Mohammed Atta because we now
know his history. The investigator uncloaking
this network would not be aware of Atta's
eventual importance. At this point he is just
another node to be investigated.
14
Social Network Analysis of 9/11 Terrorists
Figure 2 shows the two suspects and
15
Social Network Analysis of 9/11 Terrorists
Figure 2 shows the two suspects and
Atta's eventual importance. At this point he is
just another node to be investigated.
                                                  
                                                  
                                                  
                                                  
         Figure 3 shows the direct
16
Social Network Analysis of 9/11 Terrorists
  • We now have enough data for two key conclusions
  • All 19 hijackers were within 2 steps of the two
    original suspects uncovered in 2000!
  • Social network metrics reveal Mohammed Atta
    emerging as the local leader
  • With hindsight, we have now mapped enough of the
    9-11 conspiracy to stop it. Again, the
    investigators are never sure they have uncovered
    enough information while they are in the process
    of uncloaking the covert organization. They also
    have to contend with superfluous data. This data
    was gathered after the event, so the
    investigators knew exactly what to look for.
    Before an event it is not so easy.
  • As the network structure emerges, a key dynamic
    that needs to be closely monitored is the
    activity within the network. Network activity
    spikes when a planned event approaches. Is there
    an increase of flow across known links? Are new
    links rapidly emerging between known nodes? Are
    money flows suddenly going in the opposite
    direction? When activity reaches a certain
    pattern and threshold, it is time to stop
    monitoring the network, and time to start
    removing nodes.
  • The author argues that this bottom-up approach of
    uncloaking a network is more effective than a top
    down search for the terrorist needle in the
    public haystack -- and it is less invasive of the
    general population, resulting in far fewer "false
    positives".

17
Social Network Analysis of Steroid Usage in
Baseball (www.orgnet.com)
Figure 2 shows the two suspects and
When the Mitchell Report on steroid use in Major
League Baseball MLB, was published, people were
surprised at who and how many players were
mentioned. The diagram below shows a human
network created from data found in the Mitchell
Report. Baseball players are shown as green
nodes. Those who were found to be providers of
steroids and other illegal performance enhancing
substances appear as red nodes. The links reveal
the flow of chemicals -- from provider to player.
18
Knowledge Management Examples
  • Managing the 21st Century Organization
  • Networks of Adaptive/Agile Organizations
  • Best Practice Organizational Network Mapping
  • Discovering Communities of Practice
  • Data-Mining E-mail
  • Finding Leaders on your Team
  • Post-Merger Integration
  • Knowledge Sharing in Organizations
  • Innovation happens at the Intersections
  • Partnerships and Alliances in Industry
  • Decision-Making in Organizations
  • New Organizational Structures

19
Knowledge Sharing in Organizations Finding
Experts

Figure 2 shows the two suspects and
20
Knowledge Sharing Network Finding Experts
(www.orgnet.com)

Figure 2 shows the two suspects and
Organizational leaders are preparing for the
potential loss of expertise and knowledge flow
due to turnover, downsizing, outsourcing, and the
coming retirements of the baby boom generation.
The model network (previous chart) is used to
illustrate the knowledge continuity analysis
process. Each node in this sample network
(previous chart) represents a person that works
in a knowledge domain. Some people have more /
different knowledge than others. Employees who
will retire in 2 years or less have their nodes
colored red. Those who will retire in 3-4 years
are colored yellow. Those retiring in 5 years or
later are colored green. A gray, directed line
is drawn from the seeker of knowledge to the
source of expertise. A--B indicates that A seeks
expertise / advice from B. Those with many
arrows pointing to them are sought often for
assistance. The top subject matter experts --
SMEs -- in this group are nodes 29, 46, 100, 41,
36 and 55. The SMEs were discovered using a
network metric in InFlow that is similar to how
the Google search engine ranks web pages --
using both direct and indirect links. Of the top
six SMEs in this group, half are colored red100
or yellow46, 55. The loss of person 46 has the
greatest potential for knowledge loss. 90 of the
network is within 3 steps of accessing this key
knowledge source.
21
Social Networks Security and Privacy Issues
European Network and Information Security Agency
  • The European Network and Information Security
    Agency (ENISA) has released its first issue paper
    Security Issues and Recomendations for Online
    Social Networks".
  • http//www.enisa.europa.eu/doc/pdf/deliverables/en
    isa_pp_social_networks.pdf
  • Four groups of threats privacy related threats,
    variants of traditional network and information
    security threats, identity related threats,
    social threats.
  • Recommendations are given for governments
    (oversight and adaption of existing data
    protection legislation), companies that run such
    networks, technology developers, and research and
    standardisation bodies.
  • Some concenrs recommnendation to use automated
    filters against "offensive, litigious or illegal
    content". This brings potential freedom of speech
    issues. European Digital Rights has started a
    campaign against a similar recommendation by the
    Council of Europe.Issue of portability of
    profiles social graphs are also addressed.
    However what is missing is that Information
    about social links is not about only one user,
    but also the others which he is linked to. They
    have to agree if this information is moved to
    different platforms.

22
Social Networks Security and Privacy Issues
Microsoft Recommendations http//www.microsoft.com
/protect/yourself/personal/communities.mspx
  • Online communities require you to provide
    personal information. Profiles are public.
    Comments you post are permanently recorded on the
    community site.You might even mention when you
    plan to be out of town.
  • E-mail and phishing scammers count on the
    appealing sense of trust that is often fostered
    in online communities to steal your personal
    information. The more you reveal in profiles and
    posts, the more vulnerable you are to scams,
    spam, and identity theft.
  • Here are some features to look for when you're
    considering joining an online community
  • Privacy policies that explain exactly what
    information the service will collect and how it
    might be used. User guidelines that outline a
    basic code of conduct for users on their sites.
    Sites have the option to penalize reported
    violators with account suspension or
    termination.Special provisions for children and
    their parents, such as family-friendly options
    geared towards protecting children under a
    certain age.Password protection to help keep
    your account secure..E-mail address hiding,
    which lets you display only part of your e-mail
    address on the site's membership lists. Filtering
    options Offered on blogging sites, these tools
    let you to choose which subscribers can see what
    you've written.

23
Appendix Social Networks and Surveillance
Evaluating Suspicion by Association
Ryan Layfield, PhD Student Prof. Bhavani
Thuraisingham
September 2006

24
Overview
  • Introduction
  • Our Goal
  • System Design
  • Social Networks
  • Threat Detection
  • Correlation Analysis
  • The Experiment
  • Setup
  • Current Results
  • Issues
  • Future Work
  • Directions

25
Introduction
  • Automated message surveillance is essential to
    communication monitoring
  • Widespread use of electronic communication
  • Exponential data growth
  • Impossible to sift through all by hand
  • Going beyond basic surveillance
  • Identifying groups rather than individuals
  • Monitoring conversations rather than messages

26
Our Approach
  • Design new techniques and apply existing
    algorithms to
  • Create a machine-understandable model of existing
    social networks Identify abnormal conversations
    and behavior Monitor a given communications
    system in real-time Continuously learn and adapt
    to a dynamic environment
  • System Design Three major components
  • Social Network Modeler Initial Activity
    Detector Correlated Activity Investigator
  • Assumptions
  • Individuals engaged in suspicious or undesirable
    behavior rarely act alone
  • We can infer than those associated with a person
    positively identified as suspicious have a high
    probability of being either
  • Accomplices (participants in suspicious activity)
  • Witnesses (observers of suspicious activity)
  • Making these assumptions, we create a context of
    association between users of a communication
    network

27
Social Networks
  • Within our model
  • Every node is a unique user
  • Every message creates or strengthens a link
    between nodes
  • Over time, the network changes
  • Frequent communication leads to stronger links
  • Intermittent messaging implies weakening social
    ties
  • The strength of the link implies how strong an
    association between individuals is
  • From this data, we can theoretically identify
  • Hubs
  • Groups
  • Liaisons

28
Social Networks
29
Threat Detection and Correlation Analysis
  • Every message sent is scrutinized in the interest
    of identifying suspicious communication
  • Keywords analysis Prior context (i.e. previous
    message content)
  • When a detection algorithm yields a strong
    result, a token is created
  • The token is created at the origin and passed to
    the recipients) Existing tokens, if any, are
    cloned instead
  • The result is a web that potentially reflects the
    dissemination of suspicious information activity
  • Future messages with similar suspicious topics
    are not always identifiable with the same
    initial techniques
  • Quick replies Pronoun use Assumption that
    recipient is aware of topic
  • If a token is present at the sender when a
    message is sent
  • Message token is associated with and new message
    are analyzed If analysis yields a strong match,
    the token is further cloned and passed to
    recipient

30
The Experiment
  • A rare set of words shared between two or more
    messages are candidates for keyword analysis, but
    they are not always easily sifted from noise
  • Noise within text-based messages comes in a
    variety of forms
  • Misspelled words
  • Unusual word choice
  • Incompatible variations of the same language
    (i.e. British vs. American English)
  • Unexpected language
  • However, we do not want to eliminate potential
    keywords
  • Document names
  • Terminology specific to a subject
  • Buzz words
  • We proposed an experiment that attempts to
    eliminate false positives due to noisy data while
    strengthening and expanding our correlation
    techniques

31
Setup
  • Tools
  • Running word rank database
  • Implementation of word set theory infrastructure
  • JAMA Matrix Library
  • Singular Value Decomposition
  • Our Approach
  • Apply SVD noise filtering based on 100 messages
  • Analyze word frequency correlation between
    current message and prior suspicious messages
  • Generate a score based on the results

32
Setup
  • Construct a matrix based on the last 100 messages

messages
More common
words
Less common
33
Setup
  • Decompose and rebuild

VT
?
U
A
Eliminate weak singular values
34
Setup
Pulled from messages j and k
Raw total score for word wi
Pulled from running word database
Counts only intersection of words
Predefined fixed threshold
35
Current Results
Method is not currently accurate Large
fluctuations Correlation easily swayed by
plethora of common words Uncommon words not given
enough weight
1000 messages evaluated, first 100 used to seed
word ranks.
36
Issues and Directions
  • Word frequencies fluctuate wildly during
    beginning of experiment (0.0 10.0)
  • Extreme cost for current construction methods and
    computation
  • Filtering context limited to recent global
    history
  • Affected by large bodies of text
  • Future Directions include
  • Tap potential of existing matrix for further
    analysis
  • Adaptive filtering feedback algorithms
  • Speed improvements to accommodate real-time
    streams
  • Flexible communication platform monitoring
  • Addition of pipe architecture for modular threat
    detection and correlation
Write a Comment
User Comments (0)
About PowerShow.com