Title: Enron%20Corpus:%20A%20New%20Dataset%20for%20Email%20Classification
1Enron CorpusA New Dataset for Email
Classification
- By Bryan Klimt and Yiming Yang
- CEAS 2004
- Presented by Will Lee
2Introduction
- Motivation
- Related Works
- The Enron Corpus
- Methods
- Evaluation
- Thread Information
- Conclusion
3Motivation
- Other corpuses focus on newsgroups or personal
email data - Lack of common data set to evaluate the
performance of email classification - Previous research uses different personal data
sets - Difficulties to find actual use of email within a
company - Obviously, companies do not like to share their
internal emails - Privacy concerns for people working for the
company
4Related Works
- Other corpuses
- 20 Newsgroups
- http//people.csail.mit.edu/people/jrennie/20Newsg
roups/ - Related Papers
- Y. Diao, H. Lu, and D. Wu, A Comparative Study of
Classification Based Personal E-mail Filtering
(PAKDD 00) - I. Androutsopoulos, et. al., An Experimental
Comparison of Naïve Bayesian and Keyword-Based
Anti-Spam Filtering with Personal E-mail Messages
(SIGIR 00) - T. Payne, Learning Email Filtering Rules with
Magi (Thesis 1994)
520 Newsgroups
- Collection of approximately 20,000 newsgroup
documents, spread out evenly across 20 different
newsgroups - Sample newsgroups
- comp.graphics, rec.motorcycles,
rec.sport.baseball, sci.electronics,
talk.politics.misc, talk.religion.misc, etc. - Used originally in Ken Langs Newsweeder
Learning to filter netnews paper (ICML 1995) - Dataset on newsgroup data, probably not very
useful for research in personal information
management
6Enron Dataset
- 619,446 messages (200,399 after cleaning) by 158
users - Average 757 messages per user
- Shows most users do use folders to organize
emails - Can use folder information to evaluate
effectiveness for folder classification
7Enron Corpus Characteristics
- Number of messages per user varies from a few
messages to 10K messages - Upper bound of folder seems to correlate to the
log( of messages) - Number of messages does not correlate to the
lower bound (can have many messages but a few
folders) - Question how can we use this kind of information?
8Email Classification Features
- Constructive text
- BOW approach, feature used the most
- Some fields are more important than the others
- Stemming, stop word removal used, effectiveness
not proven - Categorical text
- to and from fields
- BOW, useful for classification, but not as useful
as constructive text - Numeric data
- Size of message, number of replies, number of
words, etc. - Not very useful
- Thread information
- Indicates how message relates to each other
- Not fully exploited
9Email Features (Example)
Numeric data
Categorical text
From Mark Hills ltmhills_at_cs.uiuc.edugt Subject
Re When is the first lecture? When will the
course page be updated? Date Thu, 26 Aug 2004
134109 -0500 Lines 11 Message-ID
ltcglafaf3o1_at_dcs-news1.cs.uiuc.edugt References
ltcgl09cbll1_at_dcs-news1.cs.uiuc.edugt In-Reply-To
ltcgl09cbll1_at_dcs-news1.cs.uiuc.edugt Joshua
Blatt wrote gt When is the first lecture? When
will the course page be updated? gt gt Thanks gt gt
Josh The first lecture was today, during the
normally scheduled time. Mark
Thread information
Contextual text
10Classification Method
- Vector space model with SVM
- Vector weight wi is evaluated using ltc
(http//people.csail.mit.edu/people/jrennie/ecoc-s
vm/smart.html), which means - l new-tf ln (tf) 1.0
- t new-wt new-tf log (num-docs/coll-freq-of-te
rm) - c divide each new-wt by sqrt (sum of (new-wts
squared))
11Classification Method (Cont.)
- Sort messages in chronological order, split into
train and test set - Run SVM on term weighted vectors of
- From
- Subject
- Body
- To, CC
- All fields
- Linear regression on all fields seem to have the
best performance
12Clustering Effectiveness
13Number of Messages vs. F1
- Number of message does not directly correlate to
the accuracy - Question What about the case where the user has
only one folder, which makes classification
trivial?
14Number of Folders vs. F1
- Theres correlation between the number of folders
and the F1 score. - Question Is this trivial as well?
- Some elements in the messages not modeled, since
SVM have more messages to train on.
15Thread Information
- 200,399 messages, 101,786 threads, 71,696 threads
with only one message - 61.63 of messages of corpus is in a thread.
- Average thread size is 4.1 messages
- Average folder per thread is 1.37 (meaning most
messages of the thread stays in one folder) - Question Not clear how threads are detected.
How can we use this information?
16More Thread
- D. Lewis, et. al., Threading Electronic Mail A
Preliminary Study (1997) - Lewis studied finding parent message using BOW,
TF/IDF weighted, vector space approach on
constructive text
Document weight
Query weight
Similarity
17More Thread (Cont.)
- Lewis work assumes that the thread information
is incomplete in the message header. - May not be the case.
- Algorithm by Jamie Zawinski is widely used in the
original Netscape 4.x (maybe in recent Mozilla as
well?) can group threaded messages effectively. - http//www.jwz.org/doc/threading.htm
- Questions
- How can we leverage the thread information in
email messages more effectively? - Does this model extend to the more recent form of
conversation such as blog and web forums as well?
18Conclusion
- Pros
- Introduce a new corpus that can be useful in
evaluating classification performance on a large
collection of personal mail - Unlike small collection of personal mails, corpus
can also be used to analyze behavior within a
company - Cons
- Details on performing SVM and the linear weight
for various fields are missing - Not clear how threads are detected