Title: Summarizing Threads of Email Conversations: Using QA Pairs Detection to Improve Extractive Summaries
1Summarizing Threads of Email Conversations Using
QA Pairs Detection to Improve Extractive Summaries
2Reasons for Summarizing Email
- Email has become a primary means of business and
personal communication. - Conversations take place and decisions are made
entirely through email. - Given the high volume of email each individual
accumulates, how can we efficiently retrieve
information from our email archives?
3Summarizing Email vs. Summarizing Newswire
- Email has interactive structure
- Email can have informal language
- Email does not have different, independent
documents about same topic (not multi-document
summarization)
4Contributions
- Email specific features can be used for machine
learning based extractive summarization of email
threads - A novel approach to question-answer pair
detection - Integration of QA pair sentences with extractive
sentences improve summaries.
5Overview
- Related Work
- Corpus
- Approach 1 Sentence Extraction
- Approach 2 Question-Answer Pairs Detection
- Approach 3 Integration
- Outlook Email Client
- Conclusion
6Related Work
- Summarizing individual emails
- Derek Lam, Steven L. Rohall, Chris Schmandt, and
Mia K. Stern. 2002 - Sentence extraction
- Smaranda Muresan, Evelyne Tzoukermann, and Judith
Klavans. 2001. - Key phrase extraction
- Summarizing discussion lists
- Ani Nenkova and Amit Bagga. 2003.
- Sentence extraction
- Paula Newman and John Blitzer. 2003.
- Thread topic clustering and sentence extraction.
- Summarizing speech dialogues
- Klaus Zechner. 2002.
- Sentence Extraction and QA pairs
7Overview
- Related Work
- Corpus
- Approach 1 Sentence Extraction
- Approach 2 Question-Answer Pairs Detection
- Approach 3 Integration
- Outlook Email Client
- Conclusion
8Corpus
- Columbia ACM chapter executive board mailing list
- Approximately 10 regular participants
- 300 Threads, 1000 Messages
- Threads include scheduling and planning of
meetings and events, question and answer, general
discussion and chat. - Annotated by human annotators
- Hand-written summary
- Categorization of threads and messages
- Highlighting important information (such as
question-answer pairs)
9Sample Hand-Written Summary for Thread
- Annotator 1 Summary Alexander McCaughly asks the
group if he can reschedule his C-session for
Wednesday night. Raju Gupta tells McCaughly that
he is able to reschedule his C-session. Reema
Ramachandran reminds McCaughly that he scheduled
an MS Office Session for November 14, and she
asks McCaughly to confirm that he can be at that
session.
10Overview
- Related Work
- Corpus
- Approach 1 Sentence Extraction
- Approach 2 Question-Answer Pairs Detection
- Approach 3 Integration
- Outlook Email Client
- Conclusion
11Sentence Extraction
- Machine learning approach to extractive
summarization of email threads - Creating Training Data
- Learn extractive rules
- Use rules to generate summary
12Sentence Extraction Creating Training Data
- Using human generated summaries to create a model
extractive summary - Compare thread sentences with human summary
sentences using SimFinder - Given a summary size, select highly ranked
sentences - Represent each sentence with a vector of features
and the class
13SimFinder in Action
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- alex, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
- Alexander McCaughly asks the group if he can
reschedule his C-session for Wednesday night. - Raju Gupta tells McCaughly that he is able to
reschedule his C-session. - Reema Ramachandran reminds McCaughly that he
scheduled on MS Office Session for November 14,
and she asks McCaughly to confirm that he can be
at that session.
14SimFinder in Action
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- alex, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
- Alexander McCaughly asks the group if he can
reschedule his C-session for Wednesday night. - Raju Gupta tells McCaughly that he is able to
reschedule his C-session. - Reema Ramachandran reminds McCaughly that he
scheduled on MS Office Session for November 14,
and she asks McCaughly to confirm that he can be
at that session.
SimFinder 0.0038
15SimFinder in Action
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- alex, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
- Alexander McCaughly asks the group if he can
reschedule his C-session for Wednesday night. - Raju Gupta tells McCaughly that he is able to
reschedule his C-session. - Reema Ramachandran reminds McCaughly that he
scheduled on MS Office Session for November 14,
and she asks McCaughly to confirm that he can be
at that session.
SimFinder 0.0028
16SimFinder in Action
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- alex, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
- Alexander McCaughly asks the group if he can
reschedule his C-session for Wednesday night. - Raju Gupta tells McCaughly that he is able to
reschedule his C-session. - Reema Ramachandran reminds McCaughly that he
scheduled on MS Office Session for November 14,
and she asks McCaughly to confirm that he can be
at that session.
SimFinder 0.0028
17SimFinder in Action
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- alex, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
SimFinder 0.0028
- Alexander McCaughly asks the group if he can
reschedule his C-session for Wednesday night. - Raju Gupta tells McCaughly that he is able to
reschedule his C-session. - Reema Ramachandran reminds McCaughly that he
scheduled on MS Office Session for November 14,
and she asks McCaughly to confirm that he can be
at that session.
18SimFinder in Action
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- alex, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
- Alexander McCaughly asks the group if he can
reschedule his C-session for Wednesday night. - Raju Gupta tells McCaughly that he is able to
reschedule his C-session. - Reema Ramachandran reminds McCaughly that he
scheduled on MS Office Session for November 14,
and she asks McCaughly to confirm that he can be
at that session.
SimFinder 0.983
19SimFinder in Action
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- alex, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
- Alexander McCaughly asks the group if he can
reschedule his C-session for Wednesday night. - Raju Gupta tells McCaughly that he is able to
reschedule his C-session. - Reema Ramachandran reminds McCaughly that he
scheduled on MS Office Session for November 14,
and she asks McCaughly to confirm that he can be
at that session.
SimFinder 0.563
20SimFinder in Action
SimFinder 0.0038
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- dan, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
- Daniel Kestin asks the group if he can reschedule
his C-session for Wednesday night. - Janak Parekh tells Medina that he is able to
reschedule his C-session. - Christy Lauridsen reminds Medina that he
scheduled on MS Office Session for November 14,
and she asks Kestin to confirm that he can be at
that session.
SimFinder 0.983
SimFinder 0.0038
SimFinder 0.0038
SimFinder 0.0038
SimFinder 0.752
SimFinder 0.221
SimFinder 0.368
21Determining Summary Size
- Determine the summary size the human summarizers
used - Create gold-standard data manually
- Select about 10 of ACM threads
- gold-standard threads
- Manually classify sentences in gold-standard
threads - positive if content reflected in human summary
- negative otherwise
- Compare SimFinder derived classifications at
various summary sizes with gold-standard
classifications
22Determining Summary Size
- Results
- Use 45
- Verifies the use of SimFinder
Summary size 20 30 40 45 50 55 60
Recall 0.268 0.500 0.625 0.768 0.803 0.821 0.857
Precision 0.750 0.824 0.833 0.827 0.803 0.780 0.750
F-score 0.394 0.622 0.714 0.796 0.803 0.80 0.80
23Result Sentences Marked as in Summary/not in
Summary
- Guys, I can't come tonight.
- Can I reschedule my C session for Wednesday
night, 11/8, at 800? - If that's cool with you guys, please reserve me a
room. - Sure we can, but that's the day after Election
Day. - Are you sure you want to do it then?
- alex, a reminder that your scheduled to do an
MSOffice session on Nov. 14, at 7pm in 252Mudd. - --please confirm that you can do that
session/posters - Confirmed. Intro to MS Office, then there will be
three more where we'll work on the individual
programs for full sessions
- Alexander McCaughly asks the group if he can
reschedule his C-session for Wednesday night. - Raju Gupta tells McCaughly that he is able to
reschedule his C-session. - Reema Ramachandran reminds McCaughly that he
scheduled on MS Office Session for November 14,
and she asks McCaughly to confirm that he can be
at that session
N Y N N N Y N Y
24Sentence Features Thread as a document
- Length number of words in sentence
- TF-IDF scores highest, sum and mean
- Centroid similarity
- Subject similarity
- Relative position in thread
- Is question?
25Sentence FeaturesEmail-Specific Features
- Number of responses to the email.
- Number of recipients of email
- Has sender names does the sentence contain the
name of the senders of messages in the thread? - Email contains forwarded message?
- Features derived from quoted material
26Learn extractive rules Results
- Using full feature set, 5-fold cross-validation
with Ripper - Baseline scores are obtained with random
classification
Data Set Precision Recall F1-score Baseline F1-score
Annotator 1 0.550 0.516 0.532 0.422
Annotator 2 0.514 0.468 0.490 0.392
27Sample Ruleset Nice Rules
- IF centroid_sim_local ? 0.32 AND thread_line_num
? 4 AND isQuestion 1 AND tfidfavg ? 0.21 AND
tfidfavg ? 0.30 THEN Y. - IF centroid_sim ? 0.72 AND numOfRecipients ? 8
THEN Y. - IF centroid_sim_local ? 0.31 AND thread_line_num
? 4 AND tfidfmax ? 0.61 AND m_rel_pos ? 0.36 AND
t_rel_pos ? 0.18 THEN Y. - IF centroid_sim_local ? 0.31 AND centroid_sim ?
0.76 AND centroid_sim ? 0.79 AND tfidfavg ? 0.19
THEN Y. - IF subject_sim ? 0.33 AND tfidfsum ? 2.84 AND
tfidfsum ? 2.64 AND tfidfmax ? 0.68 THEN Y. - ELSE N
28Automatically Generated Sample Summary
- Regarding "meeting tonight...", on Oct 30, 2000,
Alexander Max McCaughly wrote Can I reschedule
my C session for Wednesday night, 11/8, at 800? - Responding to this on Oct 30, 2000, Raju J Gupta
wrote Are you sure you want to do it then? - Responding to this on Oct 30, 2000, Reema
Ramachandran wrote alex, a reminder that your
scheduled to do an MSOffice session on Nov. 14,
at 7pm in 252Mudd.
29Overview
- Summarizing Email
- Corpus Development
- Approach 1 Sentence Extraction
- Approach 2 Question-Answer Pairs Detection
- Approach 3 Integration
- Outlook Email Client
- Conclusion
30The Problem
- Question-answer exchanges common in email
- Multiple questions in one thread in one message
- Multiple, possibly contradictory, answers to a
single question - If a summary has question, and answer is in
thread, summary should have the answer
31Questions in Email Summaries
- Complete summary from our rule-based sentence
extractor - Regarding "acm home/bjarney", on Apr 9, 2001,
Muriel Danslop wrote - Two things Can someone be responsible for the
press releases for Stroustrup? - Responding to this on Apr 10, 2001, Theresa Feng
wrote - I think Phil, who is probably a better writer
than most of us, is writing up something for dang
and Dave to send out to various ACM chapters.
Phil, we can just use that as our "press
release", right? - In another subthread, on Apr 12, 2001, Kevin
Danquoit wrote - Are you sending out upcoming events for this
week?
32Approach
- Same machine learning as before Supervised rule
induction based - Ripper (Cohen, 96)
- Same email corpus as before
- ACM Corpus
33Detection of Questions
- Detecting questions is non-trivial
- Informal use of question mark
- Use question mark in cases other than questions
- to denote uncertainty, to make a suggestion. - I am on with Monday - perhaps some time in the
afternoon or evening? - I suggest 7pm?
- If it's better for ppl we could also have shorter
lunch meetings (mon,tues,thurs)? - Overlook using a question mark after posing a
question - Who can we get in touch with at your organization
regarding these services. - The work we present here is based on the
detection of interrogative questions inverted
subject-verb order.
34Detection of Questions
- Training Corpus - Speech
- Switchboard corpus annotated with DAMSL tags.
- 5000 positive examples, 5000 negative examples
- negative examples - "statement-opinion" and
"statement-non-opinion". - positive examples - "yes-no-question",
"Wh-question", and "rhetorical-question" - Test Corpus - Email
- manually extracted from the ACM corpus
- 300 positive examples, 300 negative examples.
35Detection of Questions
- Features
- POS tags for the first five terms
- POS tags for the last five terms
- length of the utterance
- most discriminating POS-bigrams
36Detection of Questions
- Results
- Recall low because
- Questions in ACM corpus start with a declarative
clause - So, if you're available, do you want to come?
- if you don't mind, could you post this to the
class bboard? - Results without declarative clause
Recall 0.56
Precision 0.96
F-measure 0.70
Recall 0.72
Precision 0.96
F-measure 0.82
37Detection of Answers
- Detection difficult
- Multiple topics discussed in parallel
- Those that begin with a single topic may spin off
different ones - Use of reply back function to answer a question
asked earlier in the thread. - We show how various features derived from the
structure of email threads can improve upon
lexical similarity between message segments
38Detection of Answers
- ACM Corpus
- Annotators were asked to
- Highlight and link Question and Answer pairs.
- Annotator 1 200 Threads, 81 QA Threads
- Annotator 2 138 Threads, 62 QA Threads
- Inter-Annotator Agreement (Kappa statistic)
- Question Detection 0.68
- Answer Detection (given question) 0.81
39Detection of Answers
- Methods
- Use human annotated data to generate training
data - Textual Unit
- use message segments rather than individual
sentences to reduce lexical gap between questions
and candidate answers - Learn a classifier that predicts if a subsequent
segment to a question segment answers it - Represent each question and candidate answer
segment by a feature vector
40Detection of Answers
- Features Used
- Standard word counts, word overlap (Cosine,
Euclidean) - Based on thread structure
- is candidate answer the first
- number of emails between the question and the
answer segments - the number of emails in the thread before the
question segment - Based on other candidate answer segments
- is candidate the most similar
- relative position of the candidate among other
candidates - number of other candidates
41Detection of Answers
- Experiments and Results
- 5 fold cross validation using Ripper (Cohen, 96)
Data Set Precision Recall F1-score
Union 0.698 0.619 0.656
Union lt 2 0.879 0.921 0.899
Union gt 2 0.631 0.619 0.625
Composite 0.728 0.732 0.730
42Detection of Answers
- Experiments and Results
- 5 fold cross validation using Ripper (Cohen, 96)
Data Set Precision Recall F1-score
Union 0.698 0.619 0.656
Union lt 2 0.879 0.921 0.899
Union gt 2 0.631 0.619 0.625
Composite 0.728 0.732 0.730
43Detection of Answers
- Experiments and Results
- 5 fold cross validation using Ripper (Cohen, 96)
Data Set Precision Recall F1-score
Union 0.698 0.619 0.656
Union lt 2 0.879 0.921 0.899
Union gt 2 0.631 0.619 0.625
Composite 0.728 0.732 0.730
44Overview
- Summarizing Email
- Corpus Development
- Approach 1 Sentence Extraction
- Approach 2 Question-Answer Pairs Detection
- Approach 3 Integration
- Outlook Email Client
- Conclusion
45Integrating extractive summaries with QA pairs
Approaches
- Use QA pairs as features
- Add corresponding answers to extracted questions
and corresponding questions to extracted answers - Add extractive sentences to QA pairs
- Use all QA pairs detected as basis for summary
- Use machine learning technique to identify QA
pairs to be included in summary
46Integrating extractive summaries with QA pairs
First Approach
- Use QA pairs as features
- Each sentence in the thread is represented by a
feature vector - Relative position of the sentence in email and
thread - TFIDF weights
- Is question?
- .
- .
- .
- Is answer?
47Integrating extractive summaries with QA pairs
First Approach
- Use QA pairs as features
- Number of rules learned with this augmented set
of features 1397 - Number of rules that include the answer feature
54 - Maximum number of rules that any feature is
included in 160
48Integrating extractive summaries with QA pairs
Second Approach
- Add corresponding answers to extracted questions
- Alex -- since you're in OS, what do you think? Do
you think students will be working on the 15th? - I'm in OS, and yeah, I'm pretty sure people will
be working on the weekend of a week before. - Add corresponding questions to extracted answers
- Sure we can, but that's the day after Election
Day. - Can I reschedule my C session for Wednesday
night, 11/8, at 800?
49Integrating extractive summaries with QA pairs
Third Approach
- Augment QA pair sentences with extractive
sentences - Automatically detect QA segment pairs in a thread
- Select the question sentence from each question
segment - Select an answer sentence from each answer
segment - Add extractive sentences if they do are not in
any automatically detect QA segment pairs
50Integrating extractive summaries with QA pairs
Third Approach
- Example Summary Adding questions
Regarding "ACM / CUSFS Film Cosponsorship (fwd)", on Wed Aug 16 100156 EDT 2000, Raju J Gupta wrote Are you all around before September? In a subsequent message in the same thread, on Thu Aug 17 142211 EDT 2000, Raju J Gupta wrote Well, shall we do this the weekend before classes? How about Monday, the labor day before class? Responding to this on Thu Aug 17 205524 EDT 2000, Justin Liu wrote I am on with Monday - perhaps some time in the afternoon or evening?
51Integrating extractive summaries with QA pairs
Third Approach
- Example Summary Adding answers
Regarding "ACM / CUSFS Film Cosponsorship (fwd)", on Wed Aug 16 100156 EDT 2000, Raju J Gupta wrote Are you all around before September? Responding to this on Wed Aug 16 120541 EDT 2000, Manij Ali wrote however, i will be around the following week and i'll be able to make any meeting that does not conflict with any orientation event In another subthread, on Thu Aug 17 142211 EDT 2000, Raju J Gupta wrote Well, shall we do this the weekend before classes? How about Monday, the labor day before class? Responding to this on Thu Aug 17 205524 EDT 2000, Justin Liu wrote I am on with Monday - perhaps some time in the afternoon or evening? Responding to this on Fri Aug 18 113125 EDT 2000, Manij Ali wrote so only under the condition that the time does not conflict with anything that i might have been scheduled for will monday afternoon be okay.
52Integrating extractive summaries with QA pairs
Third Approach
- Example Summary Adding extractive sentences
Regarding "ACM / CUSFS Film Cosponsorship (fwd)", on Wed Aug 16 100156 EDT 2000, Raju J Gupta wrote Are you all around before September? You guys realize that this means it's time for the 1st meeting. Responding to this on Wed Aug 16 120541 EDT 2000, Manij Ali wrote however, i will be around the following week and i'll be able to make any meeting that does not conflict with any orientation eventi won't be around next week. In another subthread, on Thu Aug 17 040149 EDT 2000, Ritu Shetty wrote I won't be back on campus till Sept. 3 In another subthread, on Thu Aug 17 093040 EDT 2000, Daniel Max Kestin wrote I am back on campus on the 27th. Responding to this on Thu Aug 17 142211 EDT 2000, Raju J Gupta wrote Well, shall we do this the weekend before classes? How about Monday, the labor day before class? ... Alex (Markov), when you get back from wherever you are it should be your responsibility to organize these ) Responding to this on Thu Aug 17 205524 EDT 2000, Justin Liu wrote I am on with Monday - perhaps some time in the afternoon or evening? Responding to this on Fri Aug 18 113125 EDT 2000, Manij Ali wrote so only under the condition that the time does not conflict with anything that i might have been scheduled for will monday afternoon be okay.
53Integrating extractive summaries with QA pairs
Results
Approach Baseline
Precision 0.55
Recall 0.52
F-score 0.53
54Integrating extractive summaries with QA pairs
Results
Approach Baseline QA features
Precision 0.55 0.591
Recall 0.52 0.506
F-score 0.53 0.545
55Integrating extractive summaries with QA pairs
Results
Approach Baseline QA features Add answers and questions to extractive sentences
Precision 0.55 0.591 0.561
Recall 0.52 0.506 0.571
F-score 0.53 0.545 0.566
56Integrating extractive summaries with QA pairs
Results
Approach Baseline QA features Add answers and questions to extractive sentences Add extractive sentences to QA pair sentences
Precision 0.55 0.591 0.561 0.534
Recall 0.52 0.506 0.571 0.617
F-score 0.53 0.545 0.566 0.573
57Integrating extractive summaries with QA pairs
Results
Approach Baseline QA features Add answers and questions to extractive sentences Add extractive sentences to QA pair sentences
Precision 0.55 0.591 0.561 0.534
Recall 0.52 0.506 0.571 0.617
F-score 0.53 0.545 0.566 0.573
58Overview
- Summarizing Email
- Corpus Development
- Approach 1 Sentence Extraction
- Approach 2 Question-Answer Pairs Detection
- Approach 3 Integration
- Outlook Email Client
- Conclusion
59What is SUMUI?
- User Interface that exposes Natural Language
Processing functionalities through an email
client such as MS Outlook. - NLP functionalities
- Summarization of email
- Categorization of email
- Summarization of email thread
- Categorization of email thread
- Email clustering and topic detection
- Summarization of mailbox
- Functionalities in italics are work in progress.
60Components
61MS Outlook Client Add-On
62Conclusion
- Email specific features can be used for machine
learning based extractive summarization of email
threads. - We presented our novel approach to
question-answer pair detection with high
accuracy. - We showed how integration of QA pair sentences
with extractive sentences improve summaries.
63