Distributions and Distributional Lexical Semantics for Stop Lists - PowerPoint PPT Presentation

About This Presentation

Title:

Distributions and Distributional Lexical Semantics for Stop Lists

Description:

Distributions and Distributional Lexical Semantics for Stop Lists ... Kurtosis : bit hard to do and use. Energy. can do this in two axis: Collocate:- Q_peak ... – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 17

Provided by: pro1176

Category:

more less

Transcript and Presenter's Notes

Title: Distributions and Distributional Lexical Semantics for Stop Lists

1
Distributions and Distributional Lexical
Semantics for Stop Lists
Corpus Profiling 2008 BCS London
Neil Cooke BSc DMS CEng FIET PhD Student CCSR
Dr Lee Gillam Computer Science Department
2
Contents

Introduction
Finding Enrons Confidential Information
Lexical Semantic techniques
Archaeological remains of Context
Choosing the right stop words
Lexical Semantic Similarity
Questions

3
Introduction

Our domain of research
Security and intellectual property protection
Context sensitive checking of out going emails
to remove false positives
The search for accidental stupidity,
not for the professional spy

4
Introduction

Zipfian Expectations

fr
Log rank
5
Introduction

Zipfian Expectations
Low frequency words

6
Introduction

Sources of Corpora variance
Typos Spelling mistakes
Duplication
Straight / exact copy
Reworded copy
Sources of Enron variance
Straight Duplicate Emails (52)
Near Duplicate Emails (2)
Specialist machine Email formatting
Specialist Text Business, Power Generation,
Social
Straight Reworded Text Duplication Banners

7
Introduction

Enron Raw Enron Clean

8
Finding Enrons Confidential information

Key word Confidential
Banner or Real text ?

DISCLAIMER This e-mail message is intended only
for the named recipient(s) above and may contain
information that is privileged, confidential
and/or exempt from disclosure under applicable
law. If you have received this message in error,
or are not the named recipient(s), please
immediately notify the sender and delete this
e-mail message.
9
Banner Context Vector Space
25 users 94005 emails 4608 confidential emails

Finding using size

3223 banner instances
22 key words
2663 body instances
22 key words
10
Choosing the right words

Collocates with low entropy tend to Flat Line
Collocates with high entropy tend to Peak
Kurtosis bit hard to do and use

Energy

can do this in two axis
Collocate- Q_peak
Nucleate- Q_test
Q_test Sum(Q_peak)
number of collocates

11
Choosing the right words

Should be able to identify Stop words

Top 2000 BNC used as the stop word reference
list, of which 1262 match the top 3992 collocates
of energy
12
Lexical Semantic Similarity

Should be able to use it to identify similarity

Dice Cosine
13
Lexical Semantic Similarity

Depreciating common or stop words
Appreciating rare words
Salton G., A. Wong, C.S. Yang, 1975, A Vector
space model for automatic indexing, Journal of
the American Society for Information Science,
18613-620.

Terms with medium document frequency used
directly
Terms with high document frequency should be
moved to the left by transforming them in to
entities of lower frequency
Terms with low document frequency should be moved
to the right on the document frequency spectrum
by transforming them into entities of higher
frequency

Poor Discriminator
Good Discriminator
Frequency
14
Lexical Semantic Similarity

Width of collocate window reduces precision
Shape is important
Its a Broadband/narrow band signal to noise
ratio issue

Bullinaria J.A., J. P. Levy, 2006, Extracting
Semantic Representations from Word Cooccurrence
Statistics A Computational Study,
signal
noise
Window Size
15
Further Work to do

Is it better or worse than other methods ?
Carry out Synonyms Test using TOEFL data set.
Compare Qw approach against Frequency based
Cosine approach

Bullinaria J.A., J. P. Levy, 2006, Extracting
Semantic Representations from Word Cooccurrence
Statistics A Computational Study,
TOEFL test data provided by Tom Landauer,
Institute of Cognitive Science, University of
Colorado Boulder
16
Show End