Implicit Queries for Email - PowerPoint PPT Presentation

About This Presentation

Title:

Implicit Queries for Email

Description:

Implicit Queries for Email. Find good search keywords in email messages ... Email Dataset. 20 Hotmail volunteers (not MS employees) Spam, 'subs' and 'wanted' folders ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 18

Provided by: vit3

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Implicit Queries for Email

1
Implicit Queries for Email

Vitor R. Carvalho
(Joint work with Joshua Goodman, at Microsoft
Research)

2
Search Email

Email is the number 1 activity on the internet
Fast, easy and cheap
Search is number 2
Highly lucrative (billion market targeted ads)
Why not put them together?
Make users happy
Make more money

3
(No Transcript)
4
Implicit Queries for Email

Find good search keywords in email messages
1 Click (or less) for users to do search
Lots of possible User Interfaces
Add hyperlinks to words in message
List keywords in a sidebar
Perform search automatically show results
(Gmail)
Closely related to finding keywords for
advertising

5
Main Contributions

Extract Keyphrases
Similar to Information Extraction
Several features
Rank/Display
Maxent probability estimates
Select/Filter
Restrict to MSN Query Logs (7.5 million entries)

6
Email Dataset

20 Hotmail volunteers (not MS employees)
Spam, subs and wanted folders
6 human annotators labeled 1143 msgs according to
the following instructions

These are mail messages from real Hotmail users.
Imagine that you were the recipient of each
message. If your email program were to
automatically perform a query to a search engine
like MSN Search or Google for you, what words or
phrases would you want the engine to search for?
In some messages, there may be no words worth
searching for. In others, there may be several.
When possible, the words or phrases should
actually occur in the messages you annotate.
7
TF-IDF baseline

Extract all possible keyphrases from email (up to
5 tokens)
Rank keyphrases by TF-IDF scores
TF term frequency number of times each
keyphrase occurs in the email message
IDF 1/DF number of documents the keyphrase
occurs in corpus
Top1 percentage of ranked-1st keyphrases that
were labeled as relevant
Top10 number of keyphrases in the top-10 rank
that were labeled as relevant, normalized by the
total number of relevant keyphrases (no message
had more than 10 relevant keyphrases)

8
First ImprovementConstrain Results to Query Log
File

Query log file top 7.5 million queries to MSN
Search
Only return keyphrases from an email if they
occur in the query log file
Faster only process keyphrases in message that
occur in the query log file.
Creates some errors
Removes some errors such as occur in the
Works better!

9
Adding More Features

Query Log Frequency
Frequency and log(frequency) of keyphrase
Capitalization
Word capitalized before/after, capitalized
initials in phrase, capitalized letters in
phrase, etc
Phrase Length
Number of characters and number of tokens
TF IDF based features
TF, IDF, from Body and from Subject
Punctuation and Alphanumeric
Punct before/after, has no alpha, has numbers
only, etc
Email Specific
Has FW in subject, has RE in subject

10
Maximum Entropy Learner (a.k.a. Logistic
Regression)

Computes
y is 1 if keyphrase is relevant
is the feature vector (previous slide
features)
Weight vector w learned using a type of
Generalized Iterative Scaling alg. (SCGIS).
Rank and cutoff based on probability estimate

11
Rank and cutoff based on probability

Keyphrases
Port Angeles
Lake Crescent
Olympic National Park
Atlanta
Mt. Baker
Hurricane Ridge
Marymere Fall
Beaches on the west coast

Probability
0.121
0.105
0.034
0.031
0.022
0.012
0.009
0.004

Cutoff 10
12
Performance Analysis
10-fold cross-validation on the 1143 email
messages
13
Performance Analysis
14
Using Other Learning Algorithms
15
Opportunities for Future Work