Implicit Queries for Email - PowerPoint PPT Presentation

About This Presentation
Title:

Implicit Queries for Email

Description:

Implicit Queries for Email. Find good search keywords in email messages ... Email Dataset. 20 Hotmail volunteers (not MS employees) Spam, 'subs' and 'wanted' folders ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 18
Provided by: vit3
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Implicit Queries for Email


1
Implicit Queries for Email
  • Vitor R. Carvalho
  • (Joint work with Joshua Goodman, at Microsoft
    Research)

2
Search Email
  • Email is the number 1 activity on the internet
  • Fast, easy and cheap
  • Search is number 2
  • Highly lucrative (billion market targeted ads)
  • Why not put them together?
  • Make users happy
  • Make more money

3
(No Transcript)
4
Implicit Queries for Email
  • Find good search keywords in email messages
  • 1 Click (or less) for users to do search
  • Lots of possible User Interfaces
  • Add hyperlinks to words in message
  • List keywords in a sidebar
  • Perform search automatically show results
    (Gmail)
  • Closely related to finding keywords for
    advertising

5
Main Contributions
  • Extract Keyphrases
  • Similar to Information Extraction
  • Several features
  • Rank/Display
  • Maxent probability estimates
  • Select/Filter
  • Restrict to MSN Query Logs (7.5 million entries)

6
Email Dataset
  • 20 Hotmail volunteers (not MS employees)
  • Spam, subs and wanted folders
  • 6 human annotators labeled 1143 msgs according to
    the following instructions

These are mail messages from real Hotmail users.
Imagine that you were the recipient of each
message. If your email program were to
automatically perform a query to a search engine
like MSN Search or Google for you, what words or
phrases would you want the engine to search for?
In some messages, there may be no words worth
searching for. In others, there may be several.
When possible, the words or phrases should
actually occur in the messages you annotate.
7
TF-IDF baseline
  • Extract all possible keyphrases from email (up to
    5 tokens)
  • Rank keyphrases by TF-IDF scores
  • TF term frequency number of times each
    keyphrase occurs in the email message
  • IDF 1/DF number of documents the keyphrase
    occurs in corpus
  • Top1 percentage of ranked-1st keyphrases that
    were labeled as relevant
  • Top10 number of keyphrases in the top-10 rank
    that were labeled as relevant, normalized by the
    total number of relevant keyphrases (no message
    had more than 10 relevant keyphrases)

8
First ImprovementConstrain Results to Query Log
File
  • Query log file top 7.5 million queries to MSN
    Search
  • Only return keyphrases from an email if they
    occur in the query log file
  • Faster only process keyphrases in message that
    occur in the query log file.
  • Creates some errors
  • Removes some errors such as occur in the
  • Works better!

9
Adding More Features
  • Query Log Frequency
  • Frequency and log(frequency) of keyphrase
  • Capitalization
  • Word capitalized before/after, capitalized
    initials in phrase, capitalized letters in
    phrase, etc
  • Phrase Length
  • Number of characters and number of tokens
  • TF IDF based features
  • TF, IDF, from Body and from Subject
  • Punctuation and Alphanumeric
  • Punct before/after, has no alpha, has numbers
    only, etc
  • Email Specific
  • Has FW in subject, has RE in subject

10
Maximum Entropy Learner (a.k.a. Logistic
Regression)
  • Computes
  • y is 1 if keyphrase is relevant
  • is the feature vector (previous slide
    features)
  • Weight vector w learned using a type of
    Generalized Iterative Scaling alg. (SCGIS).
  • Rank and cutoff based on probability estimate

11
Rank and cutoff based on probability
  • Keyphrases
  • Port Angeles
  • Lake Crescent
  • Olympic National Park
  • Atlanta
  • Mt. Baker
  • Hurricane Ridge
  • Marymere Fall
  • Beaches on the west coast
  • Probability
  • 0.121
  • 0.105
  • 0.034
  • 0.031
  • 0.022
  • 0.012
  • 0.009
  • 0.004

Cutoff 10
12
Performance Analysis
10-fold cross-validation on the 1143 email
messages
13
Performance Analysis
14
Using Other Learning Algorithms
15
Opportunities for Future Work
  • Relax the Query Log restriction
  • Use real advertisement data
  • Use feedback from users (user can be annoyed,
    etc)
  • Use personalization (age, gender, place, etc)

16
Conclusions
  • Implicit Query task ? finding good search
    keywords
  • Use of large query log from MSN Search
  • Maxent to combine features and output
    probabilities ranking and display cutoff
  • Most meaningful features are associated with
    query frequency and capitalization
  • Results several times better than baseline TF-IDF
    (top 1 and top 10 scores)

17
Thank you
Write a Comment
User Comments (0)
About PowerShow.com