Title: Stylometry%20Project
1Stylometry Project
Paces Research Day
2TEAM MEMBERS
- Rob Goodman, Programmer
- Currently working at KPMG
- Completing MS in Computer Science in December
2008 - Matt Hahn, Quality Assurance
- Currently working at Affiliated Computer
Services, Inc. - Completing MS in in Information Technologies in
May 2007 - Madhuri Marella, Programmer
- Completing MS in Computer Science in May 2007
- Chris Ojar, Team Leader
- Currently working at Paces Evening Support
Office in Pleasantville - Completing MS in Internet Technologies in May 2007
3WHAT IS STYLOMETRY?
- Unique linguistic styles and writing behaviors of
individuals in order to determine authorship - Used to attribute authorship to anonymous or
disputed documents, and it has legal as well as
academic and literary applications - Uses statistical analysis, pattern recognition,
and artificial intelligence techniques. For
features, stylometry typically analyzes the text
by using word frequencies and identifying
patterns in common parts of speech
4THE PROGRAM
- A pattern recognition system to identify the
author of arbitrary email using stylometry
features - Phase 1 Data Collection
- Raw data from Keystroke Biometric Project
- Plain text emails
- Phase 2 Feature Extraction
- Measurements of punctuation, content format, and
keystrokes when applicable - Normalize features to 0-1 range
- Phase 3 Classification
- k-Nearest-Neighbor using Euclidean distance
- Defaulted to 10
5RAW DATA EXAMPLES
File Name Goodman-email.txt Dear Ms.
Sanderson I enjoyed our conversation on February
18th at the Family and Child Development seminar
on teaching young children and appreciated your
personal input about helping children attend
school for the first time. This letter is to
follow-up about the Fourth Grade Teacher position
as discussed at the seminar. I will be
completing my Bachelor of Science Degree in
Family and Child Development with a concentration
in Early Childhood Education at Pace in May of
2007, and will be available for employment at
that time
6DIRTY DATA EXAMPLE
ltShiftgt I'm on my second take and ltShiftgt I'm
still writing about the same book ltShiftgt
ltShiftgt " ltShiftgt A ltShiftgt Million ltShiftgt
Little ltShiftgt Pieces. ltBackspacegt
ltBackspacegt ltShiftgt " ltShiftgt I'm not sure if
ltShiftgt I am supposed to be typing the same this
ltBackspacegt ng ltShiftgt I typed on submit
ltBackspacegt ssion ltShiftgt 1 as ltShiftgt I am on
sb ltBackspacegt ubmission ltShiftgt 2, but since
ltShiftgt my sister is skiing in ltShiftgt Vermont,
ltShiftgt I'll just continued ltBackspacegt .
ltShiftgt In any event, as a ltBackspacegt soon as
ltShiftgt I found out the book was not true,
ltShiftgt I couldn't pick it up for a few days.
ltShiftgt Then, it got the best of me. ltShiftgt
It is tu ltBackspacegt ltBackspacegt a fact that
ltShiftgt James ltShiftgt Frey is a great ri
ltBackspacegt ltBackspacegt writer. ltShiftgt He
holds your interest and attention a ltBackspacegt
so ltShiftgt I go ltBackspacegt t b ltBackspacegt
past the fact the ltBackspacegt ltBackspacegt at he
lied, and continued on. ltShiftgt I have to say
ltShiftgt I endj ltBackspacegt ltBackspacegt joyed the
book a lot better as a non-fiction book than
ltShiftgt I did as a fiction novel.
7CLEAN DATA EXAMPLE
I'm on my second take and I'm still writing about
the same book "A Million Little Pieces." I'm
not sure if I am supposed to be typing the same
thing I typed on submission 1 as I am on
submission 2, but since my sister is skiing in
Vermont, I'll just continue. In any event, as
soon as I found out the book was not true, I
couldn't pick it up for a few days. Then, it got
the best of me. It is a fact that James Frey is
a great writer. He holds your interest and
attention so I got past the fact that he lied,
and continued on. I have to say I enjoyed the
book a lot better as a non-fiction book than I
did as a fiction novel.
8THE PROGRAM
- A pattern recognition system to identify the
author of arbitrary email using stylometry
features - Phase 1 Data Collection
- Raw data from Keystroke Biometric Project
- Plain text emails
- Phase 2 Feature Extraction
- Measurements of punctuation, content format, and
keystrokes when applicable - Normalize features to 0-1 range
- Phase 3 Classification
- k-Nearest-Neighbor using Euclidean distance
- Defaulted to 10
9LIST OF 62 FEATURES MEASURED
Number of Accents Number of Left curly braces Number of Right curly braces Number of Vertical lines Number of Tildes Number of Windows keys Number of Up keys Number of Left Shift keys Number of Right Shift keys Number of Page Down keys Number of Insert keys Number of Home keys Number of End keys Number of Down keys Number of Ctrl keys Number of Context menu keys Number of Caps Lock keys Number of Alt keys Number of F12 keys Number of Right keys Number of Backspace keys Number of Enter keys Number of Delete keys Number of Tab keys Number of words Number of sentences Average words per sentence Number of paragraphs Average words per paragraph Average word length Number of sentences beginning with upper case
Number of sentences beginning with lower case Number of White spaces Number of exclamation points Number of Number signs Number of Dollar signs Number of percent signs Number of Ampersands Number of Single quotes Number of Left parentheses Number of Right parentheses Number of Asterisks Number of Plus signs Number of Commas Number of Dashes Number of Periods Number of Forward slashes Number of Colons Number of Semi-colons Number of Less than signs Number of Equal signs Number of Greater than signs Number of Question marks Number of multiple question marks Number of multiple exclamation marks Number of ellipsis Number of At signs Number of Left square brackets Number of Back slashes Number of Right square brackets Number of Caret signs Number of Underscores
10THE PROGRAM
- A pattern recognition system to identify the
author of arbitrary email using stylometry
features - Phase 1 Data Collection
- Raw data from Keystroke Biometric Project
- Plain text emails
- Phase 2 Feature Extraction
- Measurements of punctuation, content format, and
keystrokes when applicable - Normalize features to 0-1 range
- Phase 3 Classification
- k-Nearest-Neighbor using Euclidean distance
- Defaulted to 10
11k-NEAREST NEIGHBOR USING EUCLIDEAN DISTANCE
12CLASSIFICATION PHASE
13DESIGN MODEL
14ANALYSIS MODEL
15PROJECT HOME PAGE
http//utopia.csis.pace.edu/cs615/2006-2007/team2/
16QUESTIONS
Contact cojar_at_pace.eduor ctappert_at_pace.edufor
more informationor visithttp//utopia.csis.pace.
edu/cs615/2006-2007/team2