Distributions and Distributional Lexical Semantics for Stop Lists - PowerPoint PPT Presentation

About This Presentation
Title:

Distributions and Distributional Lexical Semantics for Stop Lists

Description:

Distributions and Distributional Lexical Semantics for Stop Lists ... Kurtosis : bit hard to do and use. Energy. can do this in two axis: Collocate:- Q_peak ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 17
Provided by: pro1176
Category:

less

Transcript and Presenter's Notes

Title: Distributions and Distributional Lexical Semantics for Stop Lists


1
Distributions and Distributional Lexical
Semantics for Stop Lists
Corpus Profiling 2008 BCS London
Neil Cooke BSc DMS CEng FIET PhD Student CCSR
Dr Lee Gillam Computer Science Department
2
Contents
  • Introduction
  • Finding Enrons Confidential Information
  • Lexical Semantic techniques
  • Archaeological remains of Context
  • Choosing the right stop words
  • Lexical Semantic Similarity
  • Questions

3
Introduction
  • Our domain of research
  • Security and intellectual property protection
  • Context sensitive checking of out going emails
    to remove false positives
  • The search for accidental stupidity,
  • not for the professional spy

4
Introduction
  • Zipfian Expectations

fr
Log rank
5
Introduction
  • Zipfian Expectations
  • Low frequency words

6
Introduction
  • Sources of Corpora variance
  • Typos Spelling mistakes
  • Duplication
  • Straight / exact copy
  • Reworded copy
  • Sources of Enron variance
  • Straight Duplicate Emails (52)
  • Near Duplicate Emails (2)
  • Specialist machine Email formatting
  • Specialist Text Business, Power Generation,
    Social
  • Straight Reworded Text Duplication Banners

7
Introduction
  • Enron Raw Enron Clean

8
Finding Enrons Confidential information
  • Key word Confidential
  • Banner or Real text ?

DISCLAIMER This e-mail message is intended only
for the named recipient(s) above and may contain
information that is privileged, confidential
and/or exempt from disclosure under applicable
law. If you have received this message in error,
or are not the named recipient(s), please
immediately notify the sender and delete this
e-mail message.
9
Banner Context Vector Space
25 users 94005 emails 4608 confidential emails
  • Finding using size

3223 banner instances
22 key words
2663 body instances
22 key words
10
Choosing the right words
  • Collocates with low entropy tend to Flat Line
  • Collocates with high entropy tend to Peak
  • Kurtosis bit hard to do and use

Energy
  • can do this in two axis
  • Collocate- Q_peak
  • Nucleate- Q_test
  • Q_test Sum(Q_peak)
  • number of collocates

11
Choosing the right words
  • Should be able to identify Stop words

Top 2000 BNC used as the stop word reference
list, of which 1262 match the top 3992 collocates
of energy
12
Lexical Semantic Similarity
  • Should be able to use it to identify similarity

Dice Cosine
13
Lexical Semantic Similarity
  • Depreciating common or stop words
  • Appreciating rare words
  • Salton G., A. Wong, C.S. Yang, 1975, A Vector
    space model for automatic indexing, Journal of
    the American Society for Information Science,
    18613-620.
  • Terms with medium document frequency used
    directly
  • Terms with high document frequency should be
    moved to the left by transforming them in to
    entities of lower frequency
  • Terms with low document frequency should be moved
    to the right on the document frequency spectrum
    by transforming them into entities of higher
    frequency

Poor Discriminator
Good Discriminator
Frequency
14
Lexical Semantic Similarity
  • Width of collocate window reduces precision
  • Shape is important
  • Its a Broadband/narrow band signal to noise
    ratio issue

Bullinaria J.A., J. P. Levy, 2006, Extracting
Semantic Representations from Word Cooccurrence
Statistics A Computational Study,
signal
noise
Window Size
15
Further Work to do
  • Is it better or worse than other methods ?
  • Carry out Synonyms Test using TOEFL data set.
  • Compare Qw approach against Frequency based
    Cosine approach

Bullinaria J.A., J. P. Levy, 2006, Extracting
Semantic Representations from Word Cooccurrence
Statistics A Computational Study,
TOEFL test data provided by Tom Landauer,
Institute of Cognitive Science, University of
Colorado Boulder
16
Show End
  • Any Questions
Write a Comment
User Comments (0)
About PowerShow.com