Introduction to Automatic Email Classification - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Introduction to Automatic Email Classification

Description:

Introduction to Automatic Email Classification. Shih-Wen (George) Ke. 7th Dec 2005. Overview ... Email is time-dependent, poorly structured and written in ... – PowerPoint PPT presentation

Number of Views:415

Avg rating:3.0/5.0

Slides: 16

Provided by: cs02

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Automatic Email Classification

1
Introduction to Automatic Email Classification

Shih-Wen (George) Ke
7th Dec 2005

2
Overview

Introduction to Enron Corpus
Traditional Text Classification vs Email
Classification
Recent Work on Enron Corpus
Our Work on Enron Corpus
Summary
Future Research Directions in Information
Retrieval
Further Discussion

3
Overview

The nature of email classification is very
different to that of traditional text
classification tasks.
Email is time-dependent, poorly structured and
written in informal format and no standard ways
of preparing and evaluating email datasets have
been proposed.

4
Introduction

Automatic Email Classification dates back to mid
90s
Email Classification received little attention
until recently because there was no standard
email dataset available
Enron Email Corpus available in March 2004

5
Introduction Enron Corpus

Distributed by William Cohen at Carnegie Mellon
Uni.
Consists of 517,431 messages that belong to 150
users of Enron Corporation
Most users use folders to categorise their emails
Upper bound for the number of folders appears to
be the log of the number of messages (Klimt
Yang, 2004)

6
Email Classification Assumptions

Categorise email into folders a.k.a. email
foldering
Only personal and professional emails are
considered here
Assume that users use folders to organise their
emails
Other methods of organising emails, e.g. flag or
label, are not considered here although they may
provide more information in Email Classification

7
Recent Work on Enron Corpus
Bekkerman et al. (2004) Klimt Yang (2004)
Mono Multiple-classification Multiple-classification
Accuracy (TP/N) PR, Micro Macro F1
SVM performed best in most cases, but not statistically significant Newly created folders adversely affect performance Performance does not necessarily improve as the training set size grows Incoming emails are more related to those recently received than those received long ago Enron is suitable for email classification evaluation Body field is the most useful feature followed by From Email threads can be a valuable asset to email classification but they are difficult to detect and evaluate Foldering strategies differ individually
8
Our Work on Enron Corpus- Introduction

Users sometimes forget which folders they have
created or which folders they should file the
email under
So users tend to create new (duplicate) folders
Newly created folders adversely affect
performance (Bekkerman et al., 2004)
Reduce the likelihood of users creating duplicate
folders by improving the accuracy of assigning
incoming emails to folders that were created in
the first place
Compare state-of-the-art classifiers (kNN, SVM)
and our own classifier - PERC in a simulation of
real-time situation using various parameter
settings

9
Our Work on Enron Corpus- The PERC

The PERC Classifier (PERsonal email Classifier)

Find a centroid ci for each category Ci
For each test document x
Find k nearest neighbouring training documents
to x
Similarity between x and the training document
dj is added to similarity between x and ci
Sort similarity scores sim(x,Ci) in descending
order
Decision to assign x to Ci can be made using
various thresholding strategies

10
Our Work on Enron Corpus- The PERC

The PERC Classifier (PERsonal email Classifier)
where y(dj,Ci) 0,1 is the classification
for training document dj with respect to category
Ci sim(x,dj) is the similarity between test
document x and training document dj and
sim(x,ci) is the similarity between test document
x and the centroid ci of the category that dj
belongs to.

11
Rationale for the Hybrid Approach

Centroid method overcomes data sparseness emails
tend to be short.
kNN allows the topic of a folder to drift over
time. Considering the vector space locally allows
matching against features which are currently
dominant.

12
Our Work on Enron Corpus- Results
SVM1 (c1,j1), SVM2 (c0.01,j1) Micro-averaging
and Macro-average F1 over all users with standard
deviation for kNN, SVM and PERC For
Macro-averaging evaluations, PERC significantly
outperformed kNN (t2.786, p0.032), SVM1
(t2.533, p0.044) and SVM2 (t5.926, p0.001)
13
Our Work on Enron Corpus- Conclusions

PERC has the highest accuracy of assigning test
documents to small folders
kNN and PERC performed better with smaller k
Parameters of SVM can be sensitive to the number
of training documents available
Investigate various parameter settings and
training/test sets splits
Use of time will be investigated
A questionnaire-based study is being conducted in
order to indicate the behaviour of real users in
email management

14
Future Research Directions in IR

Use of time information
Training/test sets splits
Feature extraction, selection
Document representation
Qualitative evaluation
Threads detection, TDT for email
Mining sequential patterns
Burst of activity (Kleinberg, 2002)

15
References

Bekkerman, R., McCallum, A. and Huang, G. (2004)
Automatic Categorization of Email into Folders
Benchmark Experiments on Enron and SRI Corpora.
Technical Report IR-418, CIIR, University of
Massachusetts.
Kleinberg, J. (2002) Bursty and Hierarchical
Structure in Streams. In ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining.
Klimt, B. Yang, Y. (2004) The Enron Corpus A
New Dataset for Email Classification Research.
European Conference on Machine Learning.