Title: Preventing Information Leaks in Email
1Preventing Information Leaks in Email
- Vitor
- Text Learning Group Meeting
- Jan 18, 2007 SCS/CMU
2Outline
- Motivation
- Idea and method
- Leak Criteria, text-based baselines
- Crossvalidation, network features
- Results
- Finding Real Leaks in the Enron Data
- Predicting Real leaks in the Enron Data
- Smoothing the leak criteria
- Related Work
- Conclusions
3Information Leaks
- Whats being leaked?
- Credit card, New products information
- Social Security Numbers
- Software pre-release versions
- Business Strategy, Health records, etc.
- Multi-million dollar industry (ILDP)
- Anonymity and Privacy of data
- Information Leakage Detection and Prevention
(from Wikipedia)
4Information Leak using Email
- Hard to estimate, but according to PortAuthority
Technologies
How data is being leaked
5Email Leaks make good headlines. Just google it
- California Power-Buying Data Disclosed in
Misdirected E-Mail - Leaked email exposes MS charity as PR exercise
- Bush Glad FEMA Took Blame for Katrina, According
to Leaked Email
6More Email leak in the headlines
- Dell leaked email shows channel plans -Direct
threat haunts dealers-A leaked email reveals Dell
wants to get closer to UK resellers. - Business group say Liberals handled leaked email
badly. - Is Leaked eMail a SCO-Microsoft Connection?
- Leaked email may be behind Morgan Stanley's Asia
economist's sudden resignation
7Detecting Email Leaks
Email Leak email accidentally sent to wrong
person
- Idea
- Goal to detect emails accidentally sent to the
wrong person - Generate artificial leaks Email leaks may be
simulated by various criteria a typo, similar
last names, identical first names, aggressive
auto-completion of addresses, etc. - Method LOOK FOR OUTLIERS.
Email Leak
8Avoiding Expensive Email Errors
- Method
- Create simulated/artificial email recipients
- Build model for (msg.recipients) train
classifier on real data to detect synthetically
created outliers (added to the true recipient
list). - Features textual(subject, body), network
features (frequencies, co-occurrences, etc). - Rank potential outliers - Detect outlier and warn
user based on confidence.
P(rec_t) Probability recipient t is an outlier
given message text and other recipients in the
message.
9Method
All messages sent and received by the user
Leak classifier is trained with real email data
combined with simulated outliers.
This module produces the simulated outliers.
It simulates the most common types of mistake
that can cause a leak. For instance, email
addresses with the same initial letters, or
addresses with very close spelling, etc.
Flowchart of Leak Detection Application
10Leak Criteria how to generate (artificial)
outliers
- Several options
- Frequent typos, same/similar last names,
identical/similar first names, aggressive
auto-completion of addresses, etc. - In this paper, we adopted the 3g-address
criteria - On each trial, one of the msg recipients is
randomly chosen and an outlier is generated
according to
Marina.wang _at_enron.com
1
2
3
Else Randomly select an address book entry
11(No Transcript)
12Dataset Enron Email Collection
- Why?
- Large, thousands of messages
- Natural email, not email lists
- Real work environment
- Free
- No privacy concerns
- More than 100 users (with sentreceived msgs)
13Enron Data Preprocessing 1
- Setup a realistic temporal setup
- For each user, 10 (most recent) sent messages
will be used as test - All users had their Address Books extracted
- List of all recipients in the sent messages.
14Enron Data Preprocessing 2
- ISI version of Enron
- Remove repeated messages and inconsistencies
- Disambiguate Main Enron addresses
- List provided by Corrada-Emmanuel from UMass
- Bag-of-words
- Messages were represented as the union of BOW of
body and BOW of subject - Some stop words removed
- Self-addressed messages were removed
15Experiments using Textual Features only
- Three Baseline Methods
- Random
- Rank recipient addresses randomly
- Cosine or TfIdf Centroid
- Create a TfIdf centroid for each user in
Address Book. A user1-centroid is the sum of all
training messages (in TfIdf vector format) that
were addressed to user user1. For testing, rank
according to cosine similarity between test
message and each centroid. - Knn-30
- Given a test msg, get 30 most similar msgs in
training set. Rank according to sum of
similarities of a given user on the 30-msg set.
16Experiments using Textual Features only
Email Leak Prediction Results Prec_at_1 in 10
trials.
On each trial, a different set of outliers is
generated
17Network Features
- How frequent a recipient was addressed
- How these recipients co-occurred in the training
set
18Using Network Features
- Frequency features
- Number of received messages (from this user)
- Number of sent messages (to this user)
- Number of sentreceived messages
- Co-Occurrence Features
- Number of times a user co-occurred with all other
recipients. Co-occurr means two recipients were
addressed in the same message in the training
set - Max3g features
- For each recipient R, find Rm (address with max
score from 3g-address list of R), then use
score(R)-score(Rm) as feature. Scores come from
the CV10 procedure. Leak-recipient scores are
likely to be smaller than their 3g-address
highest score.
19To combine textual features with network
features Crossvalidation
- Training
- Use Knn-30 on 10-Fold crossvalidation setting to
get textual score of each user for all training
messages - Turn each train example into R binary examples,
where R is the number of recipients of the
message. - R-1 positive (the real recipients)
- 1 negative (leak-recipient)
- Augment textual score with network features
- Quantize features
- Train a classifier VP5- Classification-based
ranking scheme - (VP5Voted Perceptron with 5 passes over training
set)
20Results TextualNetwork Features
21Finding Real Leaks in Enron
- How can we find it?
- Grep for mistake, sorry or accident. We
were looking for sentences like Sorry. Sent this
to you by mistake. Please disregard., I
accidentally send you this reminder, etc. - How many can we find?
- Dozens of cases.
- Unfortunately, most of these cases were
originated by non-Enron email addresses or by an
Enron email address that is not one of the 151
Enron users whose messages were collected - Our method requires a collection of sent
(received) messages from a user. Only 150 Enron
users .
22Finding Real Leaks in Enron
- Found 2 good cases
- Message germanyc/sent/930, message has 20
recipients, leak is alex.perkins_at_ - kitchen-l/sent items/497, it has 44 recipients,
leak is rita.wynne_at_ - Prepared training data accordingly (90/10 split)
and no simulated leak added
23Results Finding Real Leaks in Enron
- Very Disappointing!!
- Reason alex.perkins_at_ and rita.wynne_at_ were never
observed in the training set!
Prec_at_1, Average Rank, 100 trials
24Smoothing the leak generation
- Sampling from random unseen recipients with
probability a
1-a
a
Generate a random email address NOT in Address
Book
25Some Results
- Kitchen-l has 4 unseen addresses out of the 44
recipients, - Germany-c has only one.
26Mixture parameter a
27Mixture parameter a
28Back to the simulated leaks
29Whats next
- Modeling
- Better, more elegant model
- Email Server side application
- Predict based on all users on mail server
- In companies, use info from all email users
- Privacy issues
- Integration with cc-prediction
30Related Work
- Email Privacy Enforcement System
- Boufaden et al. (CEAS-2005) - used information
extraction techniques and domain knowledge to
detect privacy breaches via email in a university
environment. Breaches student names, student
grades and student IDs. - CC Prediction
- Pal McCallum (CEAS-06) Counterpart problem
prediction of most likely intended recipients of
email msg. One single user, limited evaluation,
not public data - Expert finding in Email
- Dom et al.(SIGMOD-03), Campbell et al(CIKM-03)
- Balog de Rijke (www-06), Balog et al
(SIGIR-06) - Soboroff, Craswell, de Vries (TREC-Enterprise
2005-06-07) Expert finding task on the W3C
corpus
31Thanks!
- Questions?
- Comments?
- Ideas?
32(No Transcript)
33http//www.workshare.com/company/blog/default.aspx
?11postid18titleData-Leak-Bank-Loses-IPO-Rol
e
- Data Leak Bank Loses IPO
- Deutsche Bank has lost its spot among the
underwriters of Hertz Global Holdings Inc.'s
initial public offering after several e-mails
discussing the 1.5 billion initial public
offering were inadvertently sent by the bank.
This security breach will not only affect them
financially, but will no doubt weaken their
ability to capture new business in the future. A
simple data security policy within Protect
Enterprise Suite would have stopped this leak
from occurring. - Source Bloomberg.com
34http//hrwatch.counciloned.com/060705/Email.htm
35Cases of Malicious Leaks
- In October 2002, an email sent from Merrill Lynch
to Standard Poor's in which - it requested an assessment of Commerzbank was
leaked, causing the latter to - issue a statement regarding its financial
robustness. - In October 2002, an internal Dell Computer
document regarding its plan to enter - the PDA market was leaked and posted on a French
Web site. - In February 2004, portions of the Windows 2000
and Windows NT 4 source code - databases were leaked, apparently by one of its
outsourcers for code - development.
- In September 2004, a former helpdesk employee at
Teledata Communications - pleaded guilty to a scheme to steal and sell
30,000 consumer credit reports of - the company's customers.
- In October 2004, confidential information about
145,000 American residents was - leaked from identification and credential
verification services provider - ChoicePoint. The company registered 11.4 million
in charges related to this - incident.
- In December 2004, Apple filed a lawsuit against
three members of its Apple - Developer Connection network, who allegedly
distributed a pre-release version of - "Tiger," the company's next major Mac OS X
release, through the P2P filesharing - network BitTorrent.
36http//flagrantharbour.com/?p206
- Sun Hung Kai email leak
- Another classic example of a stupendous security
breach, this time by Sun Hung Kais online
brokerage operation, SHK Online. - Hundreds of account holders received the
following email, sent on April 4 - Dear Client,
- We notice that there has been no securities
trading activities in your account with SHK
ONLINE (SECURITIES) LIMITED for a long time and
the account is currently showing a ZERO balance.
As part of our companys regular account
maintenance, we will classify such accounts as
INACTIVE. Should the account remain in such
INACTIVE status without any balance and/or
securities trading activities by 4 May 2006, your
account will be closed automatically without
further notice. - Should you wish to reactivate your account,
please contact our customer service hotline at
(852) 2822 5001 or email us at enquiry_at_shkonline.c
om as soon as possible for assistance. - Regards,
- Customer Service Department
- SHK Online (Securities) LtdLevel 11, One Pacific
Place,88 Queensway, Hong Kong. - Tel (852) 2822-5001Fax (852) 2822-5998
- Whats wrong with this? I hear you ask. And how
do I know how many people received it? - The wrong is that the email addresses of the
recipients were all congregated together in the
TO field. The recipients were not BCCd nor were
they sent this newsletter style using a mailer.