Enron%20Corpus:%20A%20New%20Dataset%20for%20Email%20Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Enron%20Corpus:%20A%20New%20Dataset%20for%20Email%20Classification

Description:

... and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages ... Stemming, stop word removal used, effectiveness not proven. Categorical text ' ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 19

Provided by: Willi427

Learn more at: http://sifaka.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Enron%20Corpus:%20A%20New%20Dataset%20for%20Email%20Classification

1
Enron CorpusA New Dataset for Email
Classification

By Bryan Klimt and Yiming Yang
CEAS 2004
Presented by Will Lee

2
Introduction

Motivation
Related Works
The Enron Corpus
Methods
Evaluation
Thread Information
Conclusion

3
Motivation

Other corpuses focus on newsgroups or personal
email data
Lack of common data set to evaluate the
performance of email classification
Previous research uses different personal data
sets
Difficulties to find actual use of email within a
company
Obviously, companies do not like to share their
internal emails
Privacy concerns for people working for the
company

4
Related Works

Other corpuses
20 Newsgroups
http//people.csail.mit.edu/people/jrennie/20Newsg
roups/
Related Papers
Y. Diao, H. Lu, and D. Wu, A Comparative Study of
Classification Based Personal E-mail Filtering
(PAKDD 00)
I. Androutsopoulos, et. al., An Experimental
Comparison of Naïve Bayesian and Keyword-Based
Anti-Spam Filtering with Personal E-mail Messages
(SIGIR 00)
T. Payne, Learning Email Filtering Rules with
Magi (Thesis 1994)

5
20 Newsgroups

Collection of approximately 20,000 newsgroup
documents, spread out evenly across 20 different
newsgroups
Sample newsgroups
comp.graphics, rec.motorcycles,
rec.sport.baseball, sci.electronics,
talk.politics.misc, talk.religion.misc, etc.
Used originally in Ken Langs Newsweeder
Learning to filter netnews paper (ICML 1995)
Dataset on newsgroup data, probably not very
useful for research in personal information
management

6
Enron Dataset

619,446 messages (200,399 after cleaning) by 158
users
Average 757 messages per user
Shows most users do use folders to organize
emails
Can use folder information to evaluate
effectiveness for folder classification

7
Enron Corpus Characteristics

Number of messages per user varies from a few
messages to 10K messages
Upper bound of folder seems to correlate to the
log( of messages)
Number of messages does not correlate to the
lower bound (can have many messages but a few
folders)
Question how can we use this kind of information?

8
Email Classification Features

Constructive text
BOW approach, feature used the most
Some fields are more important than the others
Stemming, stop word removal used, effectiveness
not proven
Categorical text
to and from fields
BOW, useful for classification, but not as useful
as constructive text
Numeric data
Size of message, number of replies, number of
words, etc.
Not very useful
Thread information
Indicates how message relates to each other
Not fully exploited

9
Email Features (Example)
Numeric data
Categorical text
From Mark Hills ltmhills_at_cs.uiuc.edugt Subject
Re When is the first lecture? When will the
course page be updated? Date Thu, 26 Aug 2004
134109 -0500 Lines 11 Message-ID
ltcglafaf3o1_at_dcs-news1.cs.uiuc.edugt References
ltcgl09cbll1_at_dcs-news1.cs.uiuc.edugt In-Reply-To
ltcgl09cbll1_at_dcs-news1.cs.uiuc.edugt Joshua
Blatt wrote gt When is the first lecture? When
will the course page be updated? gt gt Thanks gt gt
Josh The first lecture was today, during the
normally scheduled time. Mark
Thread information
Contextual text
10
Classification Method

Vector space model with SVM
Vector weight wi is evaluated using ltc
(http//people.csail.mit.edu/people/jrennie/ecoc-s
vm/smart.html), which means
l new-tf ln (tf) 1.0
t new-wt new-tf log (num-docs/coll-freq-of-te
rm)
c divide each new-wt by sqrt (sum of (new-wts
squared))

11
Classification Method (Cont.)

Sort messages in chronological order, split into
train and test set
Run SVM on term weighted vectors of
From
Subject
Body
To, CC
All fields
Linear regression on all fields seem to have the
best performance

12
Clustering Effectiveness
13
Number of Messages vs. F1

Number of message does not directly correlate to
the accuracy
Question What about the case where the user has
only one folder, which makes classification
trivial?

14
Number of Folders vs. F1

Theres correlation between the number of folders
and the F1 score.
Question Is this trivial as well?
Some elements in the messages not modeled, since
SVM have more messages to train on.

15
Thread Information

200,399 messages, 101,786 threads, 71,696 threads
with only one message
61.63 of messages of corpus is in a thread.
Average thread size is 4.1 messages
Average folder per thread is 1.37 (meaning most
messages of the thread stays in one folder)
Question Not clear how threads are detected.
How can we use this information?

16
More Thread

D. Lewis, et. al., Threading Electronic Mail A
Preliminary Study (1997)
Lewis studied finding parent message using BOW,
TF/IDF weighted, vector space approach on
constructive text

Document weight
Query weight
Similarity
17
More Thread (Cont.)

Lewis work assumes that the thread information
is incomplete in the message header.
May not be the case.
Algorithm by Jamie Zawinski is widely used in the
original Netscape 4.x (maybe in recent Mozilla as
well?) can group threaded messages effectively.
http//www.jwz.org/doc/threading.htm
Questions
How can we leverage the thread information in
email messages more effectively?
Does this model extend to the more recent form of
conversation such as blog and web forums as well?

18
Conclusion

Pros
Introduce a new corpus that can be useful in
evaluating classification performance on a large
collection of personal mail
Unlike small collection of personal mails, corpus
can also be used to analyze behavior within a
company
Cons
Details on performing SVM and the linear weight
for various fields are missing
Not clear how threads are detected