Title: Distributions and Distributional Lexical Semantics for Stop Lists
1Distributions and Distributional Lexical
Semantics for Stop Lists
Corpus Profiling 2008 BCS London
Neil Cooke BSc DMS CEng FIET PhD Student CCSR
Dr Lee Gillam Computer Science Department
2Contents
- Introduction
- Finding Enrons Confidential Information
- Lexical Semantic techniques
- Archaeological remains of Context
- Choosing the right stop words
- Lexical Semantic Similarity
- Questions
3Introduction
- Our domain of research
- Security and intellectual property protection
- Context sensitive checking of out going emails
to remove false positives - The search for accidental stupidity,
- not for the professional spy
4Introduction
fr
Log rank
5Introduction
- Zipfian Expectations
- Low frequency words
6Introduction
- Sources of Corpora variance
- Typos Spelling mistakes
- Duplication
- Straight / exact copy
- Reworded copy
- Sources of Enron variance
- Straight Duplicate Emails (52)
- Near Duplicate Emails (2)
- Specialist machine Email formatting
- Specialist Text Business, Power Generation,
Social - Straight Reworded Text Duplication Banners
7Introduction
8Finding Enrons Confidential information
- Key word Confidential
- Banner or Real text ?
DISCLAIMER This e-mail message is intended only
for the named recipient(s) above and may contain
information that is privileged, confidential
and/or exempt from disclosure under applicable
law. If you have received this message in error,
or are not the named recipient(s), please
immediately notify the sender and delete this
e-mail message.
9Banner Context Vector Space
25 users 94005 emails 4608 confidential emails
3223 banner instances
22 key words
2663 body instances
22 key words
10Choosing the right words
- Collocates with low entropy tend to Flat Line
- Collocates with high entropy tend to Peak
- Kurtosis bit hard to do and use
-
Energy
- can do this in two axis
- Collocate- Q_peak
- Nucleate- Q_test
- Q_test Sum(Q_peak)
- number of collocates
11Choosing the right words
- Should be able to identify Stop words
-
-
Top 2000 BNC used as the stop word reference
list, of which 1262 match the top 3992 collocates
of energy
12Lexical Semantic Similarity
- Should be able to use it to identify similarity
-
Dice Cosine
13Lexical Semantic Similarity
- Depreciating common or stop words
- Appreciating rare words
- Salton G., A. Wong, C.S. Yang, 1975, A Vector
space model for automatic indexing, Journal of
the American Society for Information Science,
18613-620.
- Terms with medium document frequency used
directly - Terms with high document frequency should be
moved to the left by transforming them in to
entities of lower frequency - Terms with low document frequency should be moved
to the right on the document frequency spectrum
by transforming them into entities of higher
frequency
Poor Discriminator
Good Discriminator
Frequency
14Lexical Semantic Similarity
- Width of collocate window reduces precision
- Shape is important
- Its a Broadband/narrow band signal to noise
ratio issue
Bullinaria J.A., J. P. Levy, 2006, Extracting
Semantic Representations from Word Cooccurrence
Statistics A Computational Study,
signal
noise
Window Size
15Further Work to do
- Is it better or worse than other methods ?
- Carry out Synonyms Test using TOEFL data set.
- Compare Qw approach against Frequency based
Cosine approach
Bullinaria J.A., J. P. Levy, 2006, Extracting
Semantic Representations from Word Cooccurrence
Statistics A Computational Study,
TOEFL test data provided by Tom Landauer,
Institute of Cognitive Science, University of
Colorado Boulder
16Show End