Preventing Information Leaks in Email - PowerPoint PPT Presentation

About This Presentation
Title:

Preventing Information Leaks in Email

Description:

Leaked email exposes MS charity as PR exercise ... Business group say Liberals handled leaked email badly. ... Create simulated/artificial email recipients ... – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 32
Provided by: vitorrocha
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Preventing Information Leaks in Email


1
Preventing Information Leaks in Email
  • Vitor
  • Text Learning Group Meeting
  • Jan 18, 2007 SCS/CMU

2
Outline
  • Motivation
  • Idea and method
  • Leak Criteria, text-based baselines
  • Crossvalidation, network features
  • Results
  • Finding Real Leaks in the Enron Data
  • Predicting Real leaks in the Enron Data
  • Smoothing the leak criteria
  • Related Work
  • Conclusions

3
Information Leaks
  • Whats being leaked?
  • Credit card, New products information
  • Social Security Numbers
  • Software pre-release versions
  • Business Strategy, Health records, etc.
  • Multi-million dollar industry (ILDP)
  • Anonymity and Privacy of data
  • Information Leakage Detection and Prevention
    (from Wikipedia)

4
Information Leak using Email
  • Hard to estimate, but according to PortAuthority
    Technologies

How data is being leaked
5
Email Leaks make good headlines. Just google it
  • California Power-Buying Data Disclosed in
    Misdirected E-Mail
  • Leaked email exposes MS charity as PR exercise
  • Bush Glad FEMA Took Blame for Katrina, According
    to Leaked Email

6
More Email leak in the headlines
  • Dell leaked email shows channel plans -Direct
    threat haunts dealers-A leaked email reveals Dell
    wants to get closer to UK resellers.
  • Business group say Liberals handled leaked email
    badly.
  • Is Leaked eMail a SCO-Microsoft Connection?
  • Leaked email may be behind Morgan Stanley's Asia
    economist's sudden resignation

7
Detecting Email Leaks
Email Leak email accidentally sent to wrong
person
  • Idea
  • Goal to detect emails accidentally sent to the
    wrong person
  • Generate artificial leaks Email leaks may be
    simulated by various criteria a typo, similar
    last names, identical first names, aggressive
    auto-completion of addresses, etc.
  • Method LOOK FOR OUTLIERS.

Email Leak
8
Avoiding Expensive Email Errors
  • Method
  • Create simulated/artificial email recipients
  • Build model for (msg.recipients) train
    classifier on real data to detect synthetically
    created outliers (added to the true recipient
    list).
  • Features textual(subject, body), network
    features (frequencies, co-occurrences, etc).
  • Rank potential outliers - Detect outlier and warn
    user based on confidence.

P(rec_t) Probability recipient t is an outlier
given message text and other recipients in the
message.
9
Method
All messages sent and received by the user
Leak classifier is trained with real email data
combined with simulated outliers.
This module produces the simulated outliers.
It simulates the most common types of mistake
that can cause a leak. For instance, email
addresses with the same initial letters, or
addresses with very close spelling, etc.
Flowchart of Leak Detection Application
10
Leak Criteria how to generate (artificial)
outliers
  • Several options
  • Frequent typos, same/similar last names,
    identical/similar first names, aggressive
    auto-completion of addresses, etc.
  • In this paper, we adopted the 3g-address
    criteria
  • On each trial, one of the msg recipients is
    randomly chosen and an outlier is generated
    according to

Marina.wang _at_enron.com
1
2
3
Else Randomly select an address book entry
11
(No Transcript)
12
Dataset Enron Email Collection
  • Why?
  • Large, thousands of messages
  • Natural email, not email lists
  • Real work environment
  • Free
  • No privacy concerns
  • More than 100 users (with sentreceived msgs)

13
Enron Data Preprocessing 1
  • Setup a realistic temporal setup
  • For each user, 10 (most recent) sent messages
    will be used as test
  • All users had their Address Books extracted
  • List of all recipients in the sent messages.

14
Enron Data Preprocessing 2
  • ISI version of Enron
  • Remove repeated messages and inconsistencies
  • Disambiguate Main Enron addresses
  • List provided by Corrada-Emmanuel from UMass
  • Bag-of-words
  • Messages were represented as the union of BOW of
    body and BOW of subject
  • Some stop words removed
  • Self-addressed messages were removed

15
Experiments using Textual Features only
  • Three Baseline Methods
  • Random
  • Rank recipient addresses randomly
  • Cosine or TfIdf Centroid
  • Create a TfIdf centroid for each user in
    Address Book. A user1-centroid is the sum of all
    training messages (in TfIdf vector format) that
    were addressed to user user1. For testing, rank
    according to cosine similarity between test
    message and each centroid.
  • Knn-30
  • Given a test msg, get 30 most similar msgs in
    training set. Rank according to sum of
    similarities of a given user on the 30-msg set.

16
Experiments using Textual Features only
Email Leak Prediction Results Prec_at_1 in 10
trials.
On each trial, a different set of outliers is
generated
17
Network Features
  • How frequent a recipient was addressed
  • How these recipients co-occurred in the training
    set

18
Using Network Features
  • Frequency features
  • Number of received messages (from this user)
  • Number of sent messages (to this user)
  • Number of sentreceived messages
  • Co-Occurrence Features
  • Number of times a user co-occurred with all other
    recipients. Co-occurr means two recipients were
    addressed in the same message in the training
    set
  • Max3g features
  • For each recipient R, find Rm (address with max
    score from 3g-address list of R), then use
    score(R)-score(Rm) as feature. Scores come from
    the CV10 procedure. Leak-recipient scores are
    likely to be smaller than their 3g-address
    highest score.

19
To combine textual features with network
features Crossvalidation
  • Training
  • Use Knn-30 on 10-Fold crossvalidation setting to
    get textual score of each user for all training
    messages
  • Turn each train example into R binary examples,
    where R is the number of recipients of the
    message.
  • R-1 positive (the real recipients)
  • 1 negative (leak-recipient)
  • Augment textual score with network features
  • Quantize features
  • Train a classifier VP5- Classification-based
    ranking scheme
  • (VP5Voted Perceptron with 5 passes over training
    set)

20
Results TextualNetwork Features
21
Finding Real Leaks in Enron
  • How can we find it?
  • Grep for mistake, sorry or accident. We
    were looking for sentences like Sorry. Sent this
    to you by mistake. Please disregard., I
    accidentally send you this reminder, etc.
  • How many can we find?
  • Dozens of cases.
  • Unfortunately, most of these cases were
    originated by non-Enron email addresses or by an
    Enron email address that is not one of the 151
    Enron users whose messages were collected
  • Our method requires a collection of sent
    (received) messages from a user. Only 150 Enron
    users .

22
Finding Real Leaks in Enron
  • Found 2 good cases
  • Message germanyc/sent/930, message has 20
    recipients, leak is alex.perkins_at_
  • kitchen-l/sent items/497, it has 44 recipients,
    leak is rita.wynne_at_
  • Prepared training data accordingly (90/10 split)
    and no simulated leak added

23
Results Finding Real Leaks in Enron
  • Very Disappointing!!
  • Reason alex.perkins_at_ and rita.wynne_at_ were never
    observed in the training set!

Prec_at_1, Average Rank, 100 trials
24
Smoothing the leak generation
  • Sampling from random unseen recipients with
    probability a

1-a
a
Generate a random email address NOT in Address
Book
25
Some Results
  • Kitchen-l has 4 unseen addresses out of the 44
    recipients,
  • Germany-c has only one.

26
Mixture parameter a
27
Mixture parameter a
28
Back to the simulated leaks
29
Whats next
  • Modeling
  • Better, more elegant model
  • Email Server side application
  • Predict based on all users on mail server
  • In companies, use info from all email users
  • Privacy issues
  • Integration with cc-prediction

30
Related Work
  • Email Privacy Enforcement System
  • Boufaden et al. (CEAS-2005) - used information
    extraction techniques and domain knowledge to
    detect privacy breaches via email in a university
    environment. Breaches student names, student
    grades and student IDs.
  • CC Prediction
  • Pal McCallum (CEAS-06) Counterpart problem
    prediction of most likely intended recipients of
    email msg. One single user, limited evaluation,
    not public data
  • Expert finding in Email
  • Dom et al.(SIGMOD-03), Campbell et al(CIKM-03)
  • Balog de Rijke (www-06), Balog et al
    (SIGIR-06)
  • Soboroff, Craswell, de Vries (TREC-Enterprise
    2005-06-07) Expert finding task on the W3C
    corpus

31
Thanks!
  • Questions?
  • Comments?
  • Ideas?

32
(No Transcript)
33
http//www.workshare.com/company/blog/default.aspx
?11postid18titleData-Leak-Bank-Loses-IPO-Rol
e
  • Data Leak Bank Loses IPO
  • Deutsche Bank has lost its spot among the
    underwriters of Hertz Global Holdings Inc.'s
    initial public offering after several e-mails
    discussing the 1.5 billion initial public
    offering were inadvertently sent by the bank.
    This security breach will not only affect them
    financially, but will no doubt weaken their
    ability to capture new business in the future. A
    simple data security policy within Protect
    Enterprise Suite would have stopped this leak
    from occurring.
  • Source Bloomberg.com

34
http//hrwatch.counciloned.com/060705/Email.htm
35
Cases of Malicious Leaks
  • In October 2002, an email sent from Merrill Lynch
    to Standard Poor's in which
  • it requested an assessment of Commerzbank was
    leaked, causing the latter to
  • issue a statement regarding its financial
    robustness.
  • In October 2002, an internal Dell Computer
    document regarding its plan to enter
  • the PDA market was leaked and posted on a French
    Web site.
  • In February 2004, portions of the Windows 2000
    and Windows NT 4 source code
  • databases were leaked, apparently by one of its
    outsourcers for code
  • development.
  • In September 2004, a former helpdesk employee at
    Teledata Communications
  • pleaded guilty to a scheme to steal and sell
    30,000 consumer credit reports of
  • the company's customers.
  • In October 2004, confidential information about
    145,000 American residents was
  • leaked from identification and credential
    verification services provider
  • ChoicePoint. The company registered 11.4 million
    in charges related to this
  • incident.
  • In December 2004, Apple filed a lawsuit against
    three members of its Apple
  • Developer Connection network, who allegedly
    distributed a pre-release version of
  • "Tiger," the company's next major Mac OS X
    release, through the P2P filesharing
  • network BitTorrent.

36
http//flagrantharbour.com/?p206
  • Sun Hung Kai email leak
  • Another classic example of a stupendous security
    breach, this time by Sun Hung Kais online
    brokerage operation, SHK Online.
  • Hundreds of account holders received the
    following email, sent on April 4
  • Dear Client,
  • We notice that there has been no securities
    trading activities in your account with SHK
    ONLINE (SECURITIES) LIMITED for a long time and
    the account is currently showing a ZERO balance.
    As part of our companys regular account
    maintenance, we will classify such accounts as
    INACTIVE. Should the account remain in such
    INACTIVE status without any balance and/or
    securities trading activities by 4 May 2006, your
    account will be closed automatically without
    further notice.
  • Should you wish to reactivate your account,
    please contact our customer service hotline at
    (852) 2822 5001 or email us at enquiry_at_shkonline.c
    om as soon as possible for assistance.
  • Regards,
  • Customer Service Department
  • SHK Online (Securities) LtdLevel 11, One Pacific
    Place,88 Queensway, Hong Kong.
  • Tel (852) 2822-5001Fax (852) 2822-5998
  • Whats wrong with this? I hear you ask. And how
    do I know how many people received it?
  • The wrong is that the email addresses of the
    recipients were all congregated together in the
    TO field. The recipients were not BCCd nor were
    they sent this newsletter style using a mailer.
Write a Comment
User Comments (0)
About PowerShow.com