Modeling%20Intention%20in%20Email%20%20Vitor%20R.%20Carvalho - PowerPoint PPT Presentation

About This Presentation

Title:

Modeling%20Intention%20in%20Email%20%20Vitor%20R.%20Carvalho

Description:

Language Technologies Institute William W. ... Mozilla Thunderbird ... Cut Once, a Mozilla Thunderbird extension for Leak Detection and Recipient Recommendation ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 65

Provided by: vitorrocha

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Modeling%20Intention%20in%20Email%20%20Vitor%20R.%20Carvalho

1
Modeling Intention in EmailVitor R. Carvalho

Ph.D. Thesis Defense Thesis Committee
Language Technologies Institute
William W. Cohen (chair)
School of Computer Science Tom M.
Mitchel
Carnegie Mellon University Robert E.
Kraut
July 22th 2008 Lise Getoor (Univ.
Maryland)

2
Outline

Motivation
Email Acts
Email Leaks
Recommending Email Recipients
Learning Robust Rank Models
User Study

3
Why Email

The most successful e-communication application.
Great tool to collaborate, especially in
different time zones.
Very cheap, fast, and convenient.
Multiple uses task manager, contact manager, doc
archive, to-do list, etc.
Increasingly popular
Clinton adm. left 32 million emails to the
National Archives
Bush adm.more than 100 million in 2009
(expected)
Visible impact
Office workers in the U.S. spend at least 25 of
the day on email not counting handheld use

Shipley Schwalbe, 2007
Shipley Schwalbe, 2007
4
Hard to manage

People get overwhelmed
Costly interruptions
Serious impacts on work productivity
Increasingly difficult to manage requests,
negotiate shared tasks and keep track of
different commitments
People make horrible mistakes
Send messages to the wrong persons
Forget to address intended recipients
Oops, Did I just hit reply-to-all?

Dabbish Kraut, CSCW-2006.
Belloti et al. HCI-2005
5
Thesis

?We present evidence that email management can be
potentially improved by the effective use of
machine learning techniques to model different
aspects of user intention.

6
Outline

Motivation
Email Acts ?
Preventing Email Information Leaks
Recommending Email Recipients
Learning Robust Rank Models
User Study

7
Classifying Email into Acts Cohen, Carvalho
Mitchell, EMNLP-04

An Act is described as a verb-noun pair (e.g.,
propose meeting, request information) - Not all
pairs make sense
One single email message may contain multiple
acts
Try to describe commonly observed behaviors,
rather than all possible speech acts in English
Also include non-linguistic usage of email (e.g.
delivery of files)

Verbs
Nouns
8
Data Features

Data Carnegie Mellon MBA students competition
Semester-long project for CMU MBA students. Total
of 277 students, divided in 50 teams (4 to 6
students/team). Rich in task negotiation.
1700 messages (from 5 teams) were manually
labeled. One of the teams was double labeled, and
the inter-annotator agreement ranges from 0.72 to
0.83 (Kappa) for the most frequent acts.
Features
N-grams 1-gram, 2-gram, 3-gram,4-gram and 5-gram
Pre-Processing
Remove Signature files, quoted lines
(in-reply-to) Jangada package
Entity normalization and substitution patterns
SundayMonday ?day, numbernumber ?
hour,
me, her, him ,us or them ? me, after,
before, or during ? time, etc

9
Classification Performance
Carvalho Cohen, HLT-ACTS-06 Cohen, Carvalho
Mitchell, EMNLP-04
5-fold cross-validation over 1716 emails, SVM
with linear kernel
10
Predicting Acts from Surrounding Acts
Carvalho Cohen, SIGIR-05
Example of Email Thread Sequence
Deliver
Strong correlation between previous and next
messages acts
Request
Request
Propose
Deliver
Commit
Commit
Deliver
Both Context and Content have predictive value
for email act classification
Commit
Collective classification problem ? Dependency
Network
11
Collective Classification with Dependency
Networks (DN)
Carvalho Cohen, SIGIR-05

In DNs, the full joint probability distribution
is approximated with a set of conditional
distributions that can be learned independently.
The conditional probabilities are calculated for
each node given its Markov blanket.

Heckerman et al., JMLR-00 Neville Jensen,
JMLR-07
Inference Temperature-driven Gibbs sampling
12
Act by Act Comparative Results
Modest improvements over the baseline
Only on acts related to negotiation Request,
Commit, Propose, Meet, Commissive, etc.
Kappa values with and without collective
classification, averaged over four team test sets
in the leave-one-team out experiment.
13
Key Ideas

Summary
Introduced a new taxonomy of acts tailored to
email communication
Good levels of inter-annotator agreement
Showed that it can be automated
Proposed a collective classification algorithm
for threaded messages
Related Work
Speech Act Theory Austin, 1962Searle,1969,
Coordinator system Winograd,1987, Dialog Acts
for Speech Recognition, Machine Translation, and
other dialog-based systems. Stolcke et al.,
2000 Levin et al., 03, etc.
Related applications
Focus message in threads/discussions Feng et al,
2006, Action-items discovery Bennett
Carbonell, 2005, Task-focused email summary
Corsten-Oliver et al, 2004, Predicting Social
Roles Leusky, 2004, etc.

14
Applications of Email Acts

Iterative Learning of Email Tasks and Email Acts
Predicting Social Roles and Group Leadership
Detecting Focus on Threaded Discussions
Semantically Enhanced Email
Email Act Taxonomy Refinements

Kushmerick Khousainov, IJCAI-05
Leusky,SIGIR-04Carvalho et al, CEAS-07
Feng et al., HLT/NAACL-06
Scerri et al, DEXA-07
Lampert et al, AAAI-2008 EMAIL
15
Outline

Motivation
Email Acts
Preventing Email Information Leaks ?
Recommending Email Recipients ?
Learning Robust Rank Models
User Study

16
(No Transcript)
17
(No Transcript)
18
http//www.sophos.com/
19
Preventing Email Info Leaks
Carvalho Cohen, SDM-07
Email Leak email accidentally sent to wrong
person

Similar first or last names, aliases, etc
Aggressive auto-completion of email addresses
Typos
Keyboard settings

Disastrous consequences expensive law suits,
brand reputation damage, negotiation setbacks,
etc.
20
Preventing Email Info Leaks
Carvalho Cohen, SDM-07

Method
Create simulated/artificial email recipients
Build model for (msg.recipients) train
classifier on real data to detect synthetically
created outliers (added to the true recipient
list).
Features textual(subject, body), network
features (frequencies, co-occurrences, etc).
Detect potential outliers - Detect outlier and
warn user based on confidence.

Similar first or last names, aliases, etc
Aggressive auto-completion of email addresses
Typos
Keyboard settings

21
Simulating Email Leaks

Several options
Frequent typos, same/similar last names,
identical/similar first names, aggressive
auto-completion of addresses, etc.
We adopted the 3g-address criteria
On each trial, one of the msg recipients is
randomly chosen and an outlier is generated
according to

a
1-a
Generate a random email address NOT in Address
Book
22
Data and Baselines

Enron email dataset, with a realistic setting
For each user, 10 most recent sent messages
were used as test
Some basic preprocessing
Baseline methods
Textual similarity
Common baselines in IR

Rocchio/TFIDF Centroid 1971
Create a TfIdf centroid for each user in
Address Book. For testing, rank according to
cosine similarity between test msg and each
centroid.
Knn-30 Yang Chute, 1994
Given a test msg, get 30 most similar msgs in
training set. Rank according to sum of
similarities of a given user on the 30-msg set.

23
Using Non-Textual Features

Frequency features
Number of received, sent and sentreceived
messages (from this user)
Co-Occurrence Features
Number of times a user co-occurred with all other
recipients.
Auto features
For each recipient R, find Rm (address with max
score from 3g-address list of R), then use
score(R)-score(Rm) as feature.

Combine with text-only scores using
perceptron-based reranking, trained on simulated
leaks
Text-based Feature (KNN30 score or TFIDF score)
Non-textual Features
24
Email Leak Results
Carvalho Cohen, SDM-07
25
Finding Real Leaks in Enron
Sorry. Sent this to you by mistake., I
accidentally sent you this reminder

How can we find it?
Grep for mistake, sorry or accident
Note must be from one of the Enron users
Found 2 valid cases
Message germany-c/sent/930, message has 20
recipients, leak is alex.perkins_at_
kitchen-l/sent items/497, it has 44 recipients,
leak is rita.wynne_at_
Prediction results
The proposed algorithm was able to find these two
leaks

26
Another Email Addressing Problem
Sometimes people just forget an intended recipient
27
Forgetting an intended recipient

Particularly in large organizations,
it is not uncommon to forget to CC an important
collaborator a manager, a colleague, a
contractor, an intern, etc.
More frequent than expected (from Enron
Collection)
at least 9.27 of the users have forgotten to add
a desired email recipient.
At least 20.52 of the users were not included as
recipients (even though they were intended
recipients) in at least one received message.
Cost of errors in task management can be high
Communication delays, deadlines can be missed
Opportunities wasted, costly misunderstandings,
task delays

Carvalho Cohen, ECIR-2008
28
Data and Features

Easy to obtain labeled data
Two Ranking problems
Predicting TOCCBCC
Predicting CCBCC
Features Methods
Textual Rocchio (TFIDF) and KNN
Non-Textual Frequency, Recency and Co-Occurrence
Number of messages received and/or sent (from/to
this user)
How often was a particular user addressed in the
last 100 msgs
Number of times a user co-occurred with all other
recipients. Co-occurr means two recipients were
addressed in the same message in the training set

29
Email Recipient Recommendation
Carvalho Cohen, ECIR-08
36 Enron users
MRR 0.5
44000 queries Avg 1267 q/user
30
Rank Aggregation
Aslam Montague, 2001 Ogilvie Callan,
2003 Macdonald Ounis, 2006

Many Data Fusion methods
2 types
Normalized scores CombSUM, CombMNZ, etc.
Unnormalized scores BordaCount, Reciprocal Rank
Sum, etc.
Reciprocal Rank
The sum of the inverse of the rank of document in
each ranking.

31
Rank Aggregation Results
32
Intelligent Email Auto-completion
Carvalho Cohen, ECIR-08
TOCCBCC
CCBCC
33
Related Work

Email Leak
Boufaden et al., 2005 proposed a privacy
enforcement system to monitor specific privacy
breaches (student names, student grades, IDs).
Lieberman and Miller, 2007 Prevent leaks based
on faces
Recipient Recommendation
Pal McCallum, 2006, Dredze et al, 2008 CC
Prediction problem, Recipient prediction based on
summary keywords
Expert Search in Email
Dom et al.,2003, Campbell et al,2003, Balog
de Rijke, 2006, Balog et al, 2006,Soboroff,
Craswell, de Vries (TREC-Enterprise 2005-06-07)

34
Outline

Motivation
Email Acts
Preventing Email Information Leaks
Recommending Email Recipients
Learning Robust Ranking Models ?
User Study

35
Can we learn a better ranking function?

Learning to Rank machine learning to improve
ranking
Many recently proposed methods
RankSVM
RankBoost
Committee of Perceptrons
Meta-Learning Method
Learn Robust Ranking Models in the pairwise-based
framework

Joachims, KDD-02
Freund et al, 2003
Elsas, Carvalho Carbonell, WSDM-08
36
Pairwise-based Ranking
Goal induce a ranking function f(d) s.t.
Rank q
d1 d2 d3 d4 d5 d6 ... dT
We assume a linear function f
Constraints
Paired instance
O(n) mislabels produce O(n2) mislabeled pairs
37
Effect of Pairwise Outiers
RankSVM
SEAL-1
38
Effect of Pairwise Outiers
RankSVM
RankSVM
Loss Function
Pairwise Score Pl
39
Effect of Pairwise Outiers
RankSVM
RankSVM
Loss Function
Pairwise Score Pl
Robust to outliers, but not convex
40
Ranking Models 2 Stages
Base ranking model
Final model
Base Ranker
Sigmoid Rank
Non-convex
e.g., RankSVM, Perceptron, ListNet, etc.
Minimize (a very close approximation for) the
empirical error number of misranks Robust to
outliers (label noise)
41
Learning

SigmoidRank Loss
Learning with Gradient Descent

42
Email Recipient Results
Carvalho, Elsas, Cohen and Carbonell, SIGIR 2008
LR4IR
36 Enron users
plt0.01
p0.06
plt0.01
13.2
0.96
2.07
p0.74
plt0.01
plt0.01
12.7
1.69
-0.09
44000 queries Avg 1267 q/user
43
Email Recipient Results
Carvalho, Elsas, Cohen and Carbonell, SIGIR 2008
LR4IR
44
Email Recipient Results
Carvalho, Elsas, Cohen and Carbonell, SIGIR 2008
LR4IR
45
Set Expansion (SEAL) Results
Wang Cohen, ICDM-2007
Carvalho et al, SIGIR 2008 LR4IR
18 features, 120/60 train/test splits, half
relevant
46
Letor Results
Carvalho et al, SIGIR 2008 LR4IR
queries/features (106/25)
(50/44) (75/44)
47
Related Work

Classification with non-convex loss functions
tradeoff for outlier robustness, accuracy,
scalability, etc.
Perez-Cruz et al, 2003, Xu et al., 2006,
Zhan Shen, 2005, Collobert et al, 2006,
Liu et al, 2005, Yang and Hu, 2008
Ranking with other non-convex loss functions
FRank Tsai et al, 2007 a fidelity-based loss
function optimized in the boosting framework,
query normalization may be interfering in
performance gains, not a general stage (meta)
learner

48
Outline

Motivation
Email Acts
Preventing Email Information Leaks
Recommending Email Recipients
Learning Robust Ranking Models
User Study ?

49
User Study

Choosing an Email System
Gmail, Yahoo!Mail, etc.
Widely adopted, interface/compatibility issues
Develop a new client
Perfect control, longer development, low
adoption.
Mozilla Thunderbird
Open source community, easy mechanism to install
extensions, millions of users.

50
User Study Cut Once
Balasubramanyan, Carvalho and Cohen, AAAI-2008
EMAIL

Cut Once, a Mozilla Thunderbird extension for
Leak Detection and Recipient Recommendation
A few issues
Poor documentation and limited customization of
interface
JavaScript is slow Imposed computational
restrictions
Disregard rare words and rare users.
Implement two lightweight ranking methods
1) TFIDF
2) MRR (Frequency, Recency, TFIDF)

51
Cut Once Screenshots
Main Window after installation
52
Cut Once Screenshots
53
Cut Once Screenshots
Logged Cut Once Usage - Time, Confidence and
Position in rank of clicked recommendations -
Baseline Ranking Method
54
User Study Description

4 week long study
most subjects from Pittsburgh
After 1 week, qualified users were invited to
continue. 20 of compensation was paid after 1
week
After 4 weeks, users were fully compensated (
final questionnaire)
26 subjects finished study
4 female and 22 male. Median age 28.5.
Total of 2315 sent messages. Averages 113
address book entries
Mostly students. A few sys admin, 1 professor, 2
staff member.
Randomly assigned to two different ranking
methods
TFIDF and MRR

55
Recipient Suggestions

17 subjects used the functionality (in 5.28 of
their sent msgs).
Average of 1 accepted suggestion per 24.37 sent
messages.

56
Comparison of Ranking Methods
MRR better than TFIDF
Average Rank 3.14 versus 3.69
Rank Quality 3.51 versus 3.43
Difference is not statistically significant
Rough estimate factor of 5.5 5.54 22 weeks of
user study or 5.526 143 subjects for 4 weeks
Clicked Rank
57
Results Leak Detection

18 subj used the leak deletion (in 2.75 of their
sent msgs).
Most frequent reported use was to clean up the
addressee list
Removing unwanted people (inserted by Reply-all)
Remove themselves (automatically added)
5 real leaks were reported, from 4 different
users
These users did not use Cut Once to remove the
leaks
Clicked on the cancel button, and removed
manually
Uncomfortable or unfamiliar with interface
Under pressure because of 10-sec timer

58
Results Leak Detection

5 leaks ? 4 users.
Network Admin two users with similar userIDs
System Admin wrong auto-completion in 2 or 3
situations
Undergrad two acquaintances with similar names
Grad student reply-all case
Correlations
2 users used TFIDF, 2 MRR
No significant correlation with size of Address
Book or sent msgs
Correlation with non-student occupations (95
confidence)
Estimate 1 leak every 463 sent messages
Assuming a binomial dist with p5/2315, then 1066
messages are required send at least one leak
(with 90 confidence).

59
Final Questionnaire
(Higher Better)
Not
60
Frequent complaints
Training and Optimization
Interface
61
Final Questionnaire
Compose-then-address instead of
address-then-compose behavior
62
Conclusions

Email acts
A taxonomy of intentions in email communication
Categorization can be automated
Addressed Two Types of Email Addressing Mistakes
Email Leaks (accidentally adding non-intended
recipients)
Recipient recommendation (forgetting intended
recipients)
Framed as a supervised learning problems
Introduced new methods for Leak detection
Proposed several models for recipient
recommendation, including combinations of base
methods.
Proposed a new general purpose ranking algorithm
Robust to outliers - outperformed
state-of-the-art rankers on the recipient
recommendation task (and other ranking tasks)
User Study using a Mozilla Thunderbird extension
Caught 5 real leaks, and showed reasonably good
prediction quality
Showed clear potential to be adopted by a large
number of email users

63
Proof non-existence of a better advisor
Given the finite set of good advisors An
64
Proof non-existence of a better advisor
Given the finite set of advisors An
Assuming
Q.E.D.
65
Thank you.

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

ATI Courses Technical Training Short Course Underwater Acoustic Modeling and Simulation PowerPoint PPT Presentation

ATI Courses Technical Training Short Course Underwater Acoustic Modeling and Simulation - The subject of underwater acoustic modeling deals with the translation of our physical understanding of sound in the sea into mathematical formulas solvable by computers. This course provides a comprehensive treatment of all types of underwater acoustic models including environmental, propagation, noise, reverberation and sonar performance models. Specific examples of each type of model are discussed to illustrate model formulations, assumptions and algorithm efficiency. Guidelines for selecting and using available propagation, noise and reverberation models are highlighted. Problem sessions allow students to exercise PC-based propagation and active sonar models. Each student will receive a copy of Underwater Acoustic Modeling and Simulation by Paul C. Etter, in addition to a complete set of lecture notes. | PowerPoint PPT presentation | free to view

Geographical Data Modeling UML and Data Modeling Elements Examples from the Marine Data Model and ArcHydro (Thanks to Dawn Wright) PowerPoint PPT Presentation

Geographical Data Modeling UML and Data Modeling Elements Examples from the Marine Data Model and ArcHydro (Thanks to Dawn Wright) - data model = limited representation of reality. a discretization or partitioning of space ... Data Model. Representation of information about a form or a process ... | PowerPoint PPT presentation | free to view

Technologies for Creating Easily Maintainable Component Model Libraries of Complex Physical Systems PowerPoint PPT Presentation

Technologies for Creating Easily Maintainable Component Model Libraries of Complex Physical Systems - Modeling systems graphically is better than modeling them by means of equations. Graphical models are two-dimensional, whereas equations are one-dimensional. ... | PowerPoint PPT presentation | free to view

Workshop:%20The%20Role%20of%20Roles%20in%20Compliance%20A%20Practical%20Approach PowerPoint PPT Presentation

Workshop:%20The%20Role%20of%20Roles%20in%20Compliance%20A%20Practical%20Approach - How useful is the NIST RBAC model? ... to introduce role based access control ... of the access control concept using role semantics was necessary to ... | PowerPoint PPT presentation | free to view

Modeling Range Distributions of Terrestrial Vertebrates from Species Occurrences and Landscape Variables PowerPoint PPT Presentation

Modeling Range Distributions of Terrestrial Vertebrates from Species Occurrences and Landscape Variables - BBS & CBC. 1970-1999. Birds. Modeling Method. Data Type ... for ugly or complicated models ... Inverting the habitat model to forecast range distribution ... | PowerPoint PPT presentation | free to view

Formal methods: Model Checking and Testing PowerPoint PPT Presentation

Formal methods: Model Checking and Testing - Formal methods: Model Checking and Testing Prof. Doron A. Peled University of Warwick, UK and Bar Ilan University, Israel Modeling Software Systems for Analysis ... | PowerPoint PPT presentation | free to view

Using i* modeling for the multidimensional design of data warehouses PowerPoint PPT Presentation

Using i* modeling for the multidimensional design of data warehouses - Using i* modeling for the multidimensional design of data warehouses ... Multidimensional (MD) modeling. Fact. Contains interesting measures of a business process ... | PowerPoint PPT presentation | free to view

The%20Dodd-Frank%20Act:%20Additional%20Mortgage-Related%20Changes%20in%20Title%20XIV%20%20Sept.%2030,%202010%20Joseph%20M.%20Kolar%20 PowerPoint PPT Presentation

The%20Dodd-Frank%20Act:%20Additional%20Mortgage-Related%20Changes%20in%20Title%20XIV%20%20Sept.%2030,%202010%20Joseph%20M.%20Kolar%20 - Use of Automated Valuation Models Regulations must be promulgated regarding the use of AVMs. | PowerPoint PPT presentation | free to view

Protein Modeling Challenge Science Olympiad Trial Event PowerPoint PPT Presentation

Protein Modeling Challenge Science Olympiad Trial Event - Create physical models using the flexible modeling media, Mini-Toobers ... Fold Mini-Toober into a 3D model representing protein. 21. Zinc Finger Protein ... | PowerPoint PPT presentation | free to view

Architecture Modeling and Analysis for Embedded Systems Overview of AADL and related research activi PowerPoint PPT Presentation

Architecture Modeling and Analysis for Embedded Systems Overview of AADL and related research activi - Why architectural modeling? Helps structure the system into ... Oriented towards modeling embedded and real-time systems. Platform and software components ... | PowerPoint PPT presentation | free to view

$Multivariable%20regression%20models%20with%20continuous%20covariates%20with%20a%20practical%20emphasis%20on%20fractional%20polynomials%20and%20applications%20in%20clinical%20epidemiology PowerPoint PPT Presentation$