Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset - PowerPoint PPT Presentation

About This Presentation
Title:

Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset

Description:

Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor, RPI – PowerPoint PPT presentation

Number of Views:381
Avg rating:3.0/5.0
Slides: 15
Provided by: Smile
Learn more at: http://www.cs.rpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset


1
Strategies for Cleaning Organizational Emails
with an Application to Enron Email Dataset
  • Yingjie Zhou, Research Assistant, RPI
  • Mark Goldberg, Professor, RPI
  • Malik Magdon-Ismail, Associate Professor, RPI
  • William A. Wallace, Professor, RPI
  • Supported by the NSF Grants 0324947, 0323324,
    0634875, 0522672, and by the ONR Grant
    N00014-06-1-0466

2
Outline
  • Introduction
  • Properties of Organizational Emails
  • Difficulties in Cleaning Organizational Emails
  • Procedures of Cleaning Organizational Emails
  • Introduction to Enron Email Dataset
  • Application of Cleaning Procedures to Enron Email
    Dataset
  • Results
  • Conclusions and Future Work

3
Introduction
  • Emails
  • Organizational emails
  • Inter-organizational emails
  • Intra-organizational emails
  • The features of organizational email data make it
    potential for various studies
  • Email data has its own problems and is noisy

4
Properties of Organizational Emails
  • Emails are formatted, and the format is usually
    defined and followed.
  • Emails are normally stored in a server and can be
    easily collected.
  • Emails are unobtrusive.
  • Emails are time stamped.
  • In addition,
  • The senders and recipients of the emails are
    employees of the organization.
  • Each employee is normally assigned one or more
    unique email addresses within the organizational
    domain.

5
Difficulties in Cleaning Organizational Emails
  • Multiple email addresses, names, or IDs exist for
    the same person.
  • Duplicate emails exist.
  • The content of the email is difficult to extract.

6
Procedures of Cleaning Organizational Emails
  • Map aliases to employees
  • Parse last name, first name, and email ID in
    headers

7
Procedures of Cleaning Organizational Emails
(Contd)
  • Remove duplicate emails
  • content date recipients
  • Consolidate date and time
  • Convert to machine time
  • Extract email Content
  • Signatures
  • Features of parent email message
  • Greetings and names

8
Introduction to Enron Email Dataset
  • Federal Energy Regulatory Commission (FERC)
    posted the Enron email dataset on the web in May
    of 2002
  • 619,446 emails
  • Professor Leslie Kaelbling from MIT purchased the
    dataset
  • SRI - integrity and security
  • Professor William W. Cohen - CMU dataset
  • 150 user folders
  • 517,431 emails
  • 400Mb

9
Introduction to Enron Email Dataset (Contd)
  • Message-ID lt1017199.1075849811346.JavaMail.evans_at_
    thymegt
  • Date Thu, 30 Nov 2000 085000 -0800 (PST)
  • From eugenio.perez_at_enron.com
  • To sally.beck_at_enron.com
  • Subject Self Evaluation - Short Version
  • Mime-Version 1.0
  • Content-Type text/plain charsetus-ascii
  • Content-Transfer-Encoding 7bit
  • X-From Eugenio Perez
  • X-To Sally Beck
  • X-cc
  • X-bcc
  • X-Folder \Sally_Beck_Nov2001\Notes Folders\All
    documents
  • X-Origin BECK-S
  • X-FileName sbeck.nsf
  • Please let me know if you need anything else.
  • Sender
  • Receiver/Receivers
  • Date Time
  • Subject
  • Body
  • ? Forwarded or replied text
  • ? Signature
  • Attachment

10
Introduction to Enron Email Dataset (Contd)
  • From, To, Cc, Bcc
  • X-From, X-To, X-cc, X-bcc

Wrong!
  • Example1 davis-d\deleted_items\101
  • From dana.davis_at_enron.com
  • To dana.davis_at_enron.com
  • X-From Davis, Mark Dana lt/OENRON/OUNA/CNRECIPI
    ENTS/CNMDAVISgt
  • X-To Davis, Dana lt/OENRON/OUNA/CNRECIPIENTS/CN
    Ddavisgt
  • Example2 cash-m\sent_items\505
  • From michelle.cash_at_enron.com
  • To legal lt.taylor_at_enron.comgt
  • X-From Cash, Michelle lt/OENRON/OUNA/CNRECIPIEN
    TS/CNMCASHgt
  • X-To Taylor, Mark E (Legal) lt/OENRON/OUNA/CNRE
    CIPIENTS/CNMtaylo1gt

Doesnt make sense!
11
Application of Cleaning Procedures to Enron Email
Dataset
  • phillip k allen
  • phillip allen
  • allen, phillip
  • allen, phillip k.
  • phillip k allen ltphillip k allen/hou/ect_at_ectgt
  • allen, phillip lt/oenron/ouna/cnrecipients/cnno
    tesaddr/cnba4cd662-58db2db2-862564b8-5b412agt
  • allen, phillip k. lt/oenron/ouna/cnrecipients/cn
    pallengt
  • phillip.k.allen_at_enron.com
  • phillip.allen_at_enron.com
  • pallen_at_enron.com
  • pallen70_at_hotmail.com
  • pallen_at_ect.enron.com
  • pallen_at_hotmail.com

pallen_at_enron.com
pallen_at_enron.com phillip allen
ltpallen_at_enron.comgt pallen_at_enron.com"
ltpallen_at_enron.comgt phillip ltpallen_at_enron.comgt phil
lip allen ltpallen_at_enron.comgt allen, phillip k"
ltpallen_at_enron.comgt ltpallen_at_enron.comgt
12
Application of Cleaning Procedures to Enron Email
Dataset (Contd)
  • 150 folders gt 156 employees
  • 517,431 emails gt 252,830 unique emails
  • All emails are from the same time zone, and
    emails with wrong dates are discarded
  • 22,241 emails among 156 employees from Nov. 1998
    Jun. 2002
  • Original Message, Forwarded by, Thanks,
    Regards, etc.
  • Signatures

Susan S. Bailey Senior Legal Specialist Enron
Wholesale Services Legal Department 1400 Smith
Street, Suite 3803A Houston, Texas 77002 phone
(713) 853-4737 fax (713) 646-3490 email
susan.bailey_at_enron.com
13
Conclusions and Future Work
  • Conclusions
  • In general, the procedures are practical and
    served well in cleaning the Enron emails.
  • Future Work
  • Name disambiguation
  • Misdirected email detection
  • Broadcast emails removal
  • Various analysis

14
  • Thank you!
  • Any Comments?
Write a Comment
User Comments (0)
About PowerShow.com