Introduction to Automatic Email Classification - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Introduction to Automatic Email Classification

Description:

Introduction to Automatic Email Classification. Shih-Wen (George) Ke. 7th Dec 2005. Overview ... Email is time-dependent, poorly structured and written in ... – PowerPoint PPT presentation

Number of Views:404
Avg rating:3.0/5.0
Slides: 16
Provided by: cs02
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Automatic Email Classification


1
Introduction to Automatic Email Classification
  • Shih-Wen (George) Ke
  • 7th Dec 2005

2
Overview
  • Introduction to Enron Corpus
  • Traditional Text Classification vs Email
    Classification
  • Recent Work on Enron Corpus
  • Our Work on Enron Corpus
  • Summary
  • Future Research Directions in Information
    Retrieval
  • Further Discussion

3
Overview
  • The nature of email classification is very
    different to that of traditional text
    classification tasks.
  • Email is time-dependent, poorly structured and
    written in informal format and no standard ways
    of preparing and evaluating email datasets have
    been proposed.

4
Introduction
  • Automatic Email Classification dates back to mid
    90s
  • Email Classification received little attention
    until recently because there was no standard
    email dataset available
  • Enron Email Corpus available in March 2004

5
Introduction Enron Corpus
  • Distributed by William Cohen at Carnegie Mellon
    Uni.
  • Consists of 517,431 messages that belong to 150
    users of Enron Corporation
  • Most users use folders to categorise their emails
  • Upper bound for the number of folders appears to
    be the log of the number of messages (Klimt
    Yang, 2004)

6
Email Classification Assumptions
  • Categorise email into folders a.k.a. email
    foldering
  • Only personal and professional emails are
    considered here
  • Assume that users use folders to organise their
    emails
  • Other methods of organising emails, e.g. flag or
    label, are not considered here although they may
    provide more information in Email Classification

7
Recent Work on Enron Corpus
Bekkerman et al. (2004) Klimt Yang (2004)
Mono Multiple-classification Multiple-classification
Accuracy (TP/N) PR, Micro Macro F1
SVM performed best in most cases, but not statistically significant Newly created folders adversely affect performance Performance does not necessarily improve as the training set size grows Incoming emails are more related to those recently received than those received long ago Enron is suitable for email classification evaluation Body field is the most useful feature followed by From Email threads can be a valuable asset to email classification but they are difficult to detect and evaluate Foldering strategies differ individually
8
Our Work on Enron Corpus- Introduction
  • Users sometimes forget which folders they have
    created or which folders they should file the
    email under
  • So users tend to create new (duplicate) folders
  • Newly created folders adversely affect
    performance (Bekkerman et al., 2004)
  • Reduce the likelihood of users creating duplicate
    folders by improving the accuracy of assigning
    incoming emails to folders that were created in
    the first place
  • Compare state-of-the-art classifiers (kNN, SVM)
    and our own classifier - PERC in a simulation of
    real-time situation using various parameter
    settings

9
Our Work on Enron Corpus- The PERC
  • The PERC Classifier (PERsonal email Classifier)
  • Find a centroid ci for each category Ci
  • For each test document x
  • Find k nearest neighbouring training documents
    to x
  • Similarity between x and the training document
    dj is added to similarity between x and ci
  • Sort similarity scores sim(x,Ci) in descending
    order
  • Decision to assign x to Ci can be made using
    various thresholding strategies

10
Our Work on Enron Corpus- The PERC
  • The PERC Classifier (PERsonal email Classifier)
  • where y(dj,Ci) 0,1 is the classification
    for training document dj with respect to category
    Ci sim(x,dj) is the similarity between test
    document x and training document dj and
    sim(x,ci) is the similarity between test document
    x and the centroid ci of the category that dj
    belongs to.

11
Rationale for the Hybrid Approach
  • Centroid method overcomes data sparseness emails
    tend to be short.
  • kNN allows the topic of a folder to drift over
    time. Considering the vector space locally allows
    matching against features which are currently
    dominant.

12
Our Work on Enron Corpus- Results
SVM1 (c1,j1), SVM2 (c0.01,j1) Micro-averaging
and Macro-average F1 over all users with standard
deviation for kNN, SVM and PERC For
Macro-averaging evaluations, PERC significantly
outperformed kNN (t2.786, p0.032), SVM1
(t2.533, p0.044) and SVM2 (t5.926, p0.001)
13
Our Work on Enron Corpus- Conclusions
  • PERC has the highest accuracy of assigning test
    documents to small folders
  • kNN and PERC performed better with smaller k
  • Parameters of SVM can be sensitive to the number
    of training documents available
  • Investigate various parameter settings and
    training/test sets splits
  • Use of time will be investigated
  • A questionnaire-based study is being conducted in
    order to indicate the behaviour of real users in
    email management

14
Future Research Directions in IR
  • Use of time information
  • Training/test sets splits
  • Feature extraction, selection
  • Document representation
  • Qualitative evaluation
  • Threads detection, TDT for email
  • Mining sequential patterns
  • Burst of activity (Kleinberg, 2002)

15
References
  • Bekkerman, R., McCallum, A. and Huang, G. (2004)
    Automatic Categorization of Email into Folders
    Benchmark Experiments on Enron and SRI Corpora.
    Technical Report IR-418, CIIR, University of
    Massachusetts.
  • Kleinberg, J. (2002) Bursty and Hierarchical
    Structure in Streams. In ACM SIGKDD International
    Conference on Knowledge Discovery and Data
    Mining.
  • Klimt, B. Yang, Y. (2004) The Enron Corpus A
    New Dataset for Email Classification Research.
    European Conference on Machine Learning.
Write a Comment
User Comments (0)
About PowerShow.com