Enron%20Corpus:%20A%20New%20Dataset%20for%20Email%20Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Enron%20Corpus:%20A%20New%20Dataset%20for%20Email%20Classification

Description:

... and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages ... Stemming, stop word removal used, effectiveness not proven. Categorical text ' ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 19
Provided by: Willi427
Category:

less

Transcript and Presenter's Notes

Title: Enron%20Corpus:%20A%20New%20Dataset%20for%20Email%20Classification


1
Enron CorpusA New Dataset for Email
Classification
  • By Bryan Klimt and Yiming Yang
  • CEAS 2004
  • Presented by Will Lee

2
Introduction
  • Motivation
  • Related Works
  • The Enron Corpus
  • Methods
  • Evaluation
  • Thread Information
  • Conclusion

3
Motivation
  • Other corpuses focus on newsgroups or personal
    email data
  • Lack of common data set to evaluate the
    performance of email classification
  • Previous research uses different personal data
    sets
  • Difficulties to find actual use of email within a
    company
  • Obviously, companies do not like to share their
    internal emails
  • Privacy concerns for people working for the
    company

4
Related Works
  • Other corpuses
  • 20 Newsgroups
  • http//people.csail.mit.edu/people/jrennie/20Newsg
    roups/
  • Related Papers
  • Y. Diao, H. Lu, and D. Wu, A Comparative Study of
    Classification Based Personal E-mail Filtering
    (PAKDD 00)
  • I. Androutsopoulos, et. al., An Experimental
    Comparison of Naïve Bayesian and Keyword-Based
    Anti-Spam Filtering with Personal E-mail Messages
    (SIGIR 00)
  • T. Payne, Learning Email Filtering Rules with
    Magi (Thesis 1994)

5
20 Newsgroups
  • Collection of approximately 20,000 newsgroup
    documents, spread out evenly across 20 different
    newsgroups
  • Sample newsgroups
  • comp.graphics, rec.motorcycles,
    rec.sport.baseball, sci.electronics,
    talk.politics.misc, talk.religion.misc, etc.
  • Used originally in Ken Langs Newsweeder
    Learning to filter netnews paper (ICML 1995)
  • Dataset on newsgroup data, probably not very
    useful for research in personal information
    management

6
Enron Dataset
  • 619,446 messages (200,399 after cleaning) by 158
    users
  • Average 757 messages per user
  • Shows most users do use folders to organize
    emails
  • Can use folder information to evaluate
    effectiveness for folder classification

7
Enron Corpus Characteristics
  • Number of messages per user varies from a few
    messages to 10K messages
  • Upper bound of folder seems to correlate to the
    log( of messages)
  • Number of messages does not correlate to the
    lower bound (can have many messages but a few
    folders)
  • Question how can we use this kind of information?

8
Email Classification Features
  • Constructive text
  • BOW approach, feature used the most
  • Some fields are more important than the others
  • Stemming, stop word removal used, effectiveness
    not proven
  • Categorical text
  • to and from fields
  • BOW, useful for classification, but not as useful
    as constructive text
  • Numeric data
  • Size of message, number of replies, number of
    words, etc.
  • Not very useful
  • Thread information
  • Indicates how message relates to each other
  • Not fully exploited

9
Email Features (Example)
Numeric data
Categorical text
From Mark Hills ltmhills_at_cs.uiuc.edugt Subject
Re When is the first lecture? When will the
course page be updated? Date Thu, 26 Aug 2004
134109 -0500 Lines 11 Message-ID
ltcglafaf3o1_at_dcs-news1.cs.uiuc.edugt References
ltcgl09cbll1_at_dcs-news1.cs.uiuc.edugt In-Reply-To
ltcgl09cbll1_at_dcs-news1.cs.uiuc.edugt Joshua
Blatt wrote gt When is the first lecture? When
will the course page be updated? gt gt Thanks gt gt
Josh The first lecture was today, during the
normally scheduled time. Mark
Thread information
Contextual text
10
Classification Method
  • Vector space model with SVM
  • Vector weight wi is evaluated using ltc
    (http//people.csail.mit.edu/people/jrennie/ecoc-s
    vm/smart.html), which means
  • l new-tf ln (tf) 1.0
  • t new-wt new-tf log (num-docs/coll-freq-of-te
    rm)
  • c divide each new-wt by sqrt (sum of (new-wts
    squared))

11
Classification Method (Cont.)
  • Sort messages in chronological order, split into
    train and test set
  • Run SVM on term weighted vectors of
  • From
  • Subject
  • Body
  • To, CC
  • All fields
  • Linear regression on all fields seem to have the
    best performance

12
Clustering Effectiveness
13
Number of Messages vs. F1
  • Number of message does not directly correlate to
    the accuracy
  • Question What about the case where the user has
    only one folder, which makes classification
    trivial?

14
Number of Folders vs. F1
  • Theres correlation between the number of folders
    and the F1 score.
  • Question Is this trivial as well?
  • Some elements in the messages not modeled, since
    SVM have more messages to train on.

15
Thread Information
  • 200,399 messages, 101,786 threads, 71,696 threads
    with only one message
  • 61.63 of messages of corpus is in a thread.
  • Average thread size is 4.1 messages
  • Average folder per thread is 1.37 (meaning most
    messages of the thread stays in one folder)
  • Question Not clear how threads are detected.
    How can we use this information?

16
More Thread
  • D. Lewis, et. al., Threading Electronic Mail A
    Preliminary Study (1997)
  • Lewis studied finding parent message using BOW,
    TF/IDF weighted, vector space approach on
    constructive text

Document weight
Query weight
Similarity
17
More Thread (Cont.)
  • Lewis work assumes that the thread information
    is incomplete in the message header.
  • May not be the case.
  • Algorithm by Jamie Zawinski is widely used in the
    original Netscape 4.x (maybe in recent Mozilla as
    well?) can group threaded messages effectively.
  • http//www.jwz.org/doc/threading.htm
  • Questions
  • How can we leverage the thread information in
    email messages more effectively?
  • Does this model extend to the more recent form of
    conversation such as blog and web forums as well?

18
Conclusion
  • Pros
  • Introduce a new corpus that can be useful in
    evaluating classification performance on a large
    collection of personal mail
  • Unlike small collection of personal mails, corpus
    can also be used to analyze behavior within a
    company
  • Cons
  • Details on performing SVM and the linear weight
    for various fields are missing
  • Not clear how threads are detected
Write a Comment
User Comments (0)
About PowerShow.com