Title: Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset
1Strategies for Cleaning Organizational Emails
with an Application to Enron Email Dataset
- Yingjie Zhou, Research Assistant, RPI
- Mark Goldberg, Professor, RPI
- Malik Magdon-Ismail, Associate Professor, RPI
- William A. Wallace, Professor, RPI
- Supported by the NSF Grants 0324947, 0323324,
0634875, 0522672, and by the ONR Grant
N00014-06-1-0466
2Outline
- Introduction
- Properties of Organizational Emails
- Difficulties in Cleaning Organizational Emails
- Procedures of Cleaning Organizational Emails
- Introduction to Enron Email Dataset
- Application of Cleaning Procedures to Enron Email
Dataset - Results
- Conclusions and Future Work
3Introduction
- Emails
- Organizational emails
- Inter-organizational emails
- Intra-organizational emails
- The features of organizational email data make it
potential for various studies - Email data has its own problems and is noisy
4Properties of Organizational Emails
- Emails are formatted, and the format is usually
defined and followed. - Emails are normally stored in a server and can be
easily collected. - Emails are unobtrusive.
- Emails are time stamped.
- In addition,
- The senders and recipients of the emails are
employees of the organization. - Each employee is normally assigned one or more
unique email addresses within the organizational
domain.
5Difficulties in Cleaning Organizational Emails
- Multiple email addresses, names, or IDs exist for
the same person. - Duplicate emails exist.
- The content of the email is difficult to extract.
6Procedures of Cleaning Organizational Emails
- Map aliases to employees
- Parse last name, first name, and email ID in
headers
7Procedures of Cleaning Organizational Emails
(Contd)
- Remove duplicate emails
- content date recipients
- Consolidate date and time
- Convert to machine time
- Extract email Content
- Signatures
- Features of parent email message
- Greetings and names
8Introduction to Enron Email Dataset
- Federal Energy Regulatory Commission (FERC)
posted the Enron email dataset on the web in May
of 2002 - 619,446 emails
- Professor Leslie Kaelbling from MIT purchased the
dataset - SRI - integrity and security
- Professor William W. Cohen - CMU dataset
- 150 user folders
- 517,431 emails
- 400Mb
9Introduction to Enron Email Dataset (Contd)
- Message-ID lt1017199.1075849811346.JavaMail.evans_at_
thymegt - Date Thu, 30 Nov 2000 085000 -0800 (PST)
- From eugenio.perez_at_enron.com
- To sally.beck_at_enron.com
- Subject Self Evaluation - Short Version
- Mime-Version 1.0
- Content-Type text/plain charsetus-ascii
- Content-Transfer-Encoding 7bit
- X-From Eugenio Perez
- X-To Sally Beck
- X-cc
- X-bcc
- X-Folder \Sally_Beck_Nov2001\Notes Folders\All
documents - X-Origin BECK-S
- X-FileName sbeck.nsf
- Please let me know if you need anything else.
- Sender
- Receiver/Receivers
- Date Time
- Subject
- Body
- ? Forwarded or replied text
- ? Signature
- Attachment
10Introduction to Enron Email Dataset (Contd)
- From, To, Cc, Bcc
- X-From, X-To, X-cc, X-bcc
Wrong!
- Example1 davis-d\deleted_items\101
- From dana.davis_at_enron.com
- To dana.davis_at_enron.com
- X-From Davis, Mark Dana lt/OENRON/OUNA/CNRECIPI
ENTS/CNMDAVISgt - X-To Davis, Dana lt/OENRON/OUNA/CNRECIPIENTS/CN
Ddavisgt - Example2 cash-m\sent_items\505
- From michelle.cash_at_enron.com
- To legal lt.taylor_at_enron.comgt
- X-From Cash, Michelle lt/OENRON/OUNA/CNRECIPIEN
TS/CNMCASHgt - X-To Taylor, Mark E (Legal) lt/OENRON/OUNA/CNRE
CIPIENTS/CNMtaylo1gt
Doesnt make sense!
11Application of Cleaning Procedures to Enron Email
Dataset
- phillip k allen
- phillip allen
- allen, phillip
- allen, phillip k.
- phillip k allen ltphillip k allen/hou/ect_at_ectgt
- allen, phillip lt/oenron/ouna/cnrecipients/cnno
tesaddr/cnba4cd662-58db2db2-862564b8-5b412agt - allen, phillip k. lt/oenron/ouna/cnrecipients/cn
pallengt - phillip.k.allen_at_enron.com
- phillip.allen_at_enron.com
- pallen_at_enron.com
- pallen70_at_hotmail.com
- pallen_at_ect.enron.com
- pallen_at_hotmail.com
pallen_at_enron.com
pallen_at_enron.com phillip allen
ltpallen_at_enron.comgt pallen_at_enron.com"
ltpallen_at_enron.comgt phillip ltpallen_at_enron.comgt phil
lip allen ltpallen_at_enron.comgt allen, phillip k"
ltpallen_at_enron.comgt ltpallen_at_enron.comgt
12Application of Cleaning Procedures to Enron Email
Dataset (Contd)
- 150 folders gt 156 employees
- 517,431 emails gt 252,830 unique emails
- All emails are from the same time zone, and
emails with wrong dates are discarded - 22,241 emails among 156 employees from Nov. 1998
Jun. 2002 - Original Message, Forwarded by, Thanks,
Regards, etc. - Signatures
Susan S. Bailey Senior Legal Specialist Enron
Wholesale Services Legal Department 1400 Smith
Street, Suite 3803A Houston, Texas 77002 phone
(713) 853-4737 fax (713) 646-3490 email
susan.bailey_at_enron.com
13Conclusions and Future Work
- Conclusions
- In general, the procedures are practical and
served well in cleaning the Enron emails. - Future Work
- Name disambiguation
- Misdirected email detection
- Broadcast emails removal
- Various analysis
14