Title: Lemons Into Lemonade: Using Spam as a Teaching Tool
1Lemons Into Lemonade Using Spam as a Teaching
Tool Dr. Megan S. Conklin, Elon University --
mconklinATelonDOTedu
Introduction This poster details ideas for using
unsolicited commercial/bulk email (UCE/UBE),
commonly known as spam email, as a teaching
tool in several classes within the computing
sciences curriculum. Spam is free, delivered
daily in mass quantity, easy to explain, and easy
to understand. The eradication of spam is a
problem most students are easily motivated to
help solve. They are further inspired by the
status of the spam problem as an open question
suitable for undergraduate research. This
intersection makes spam an appealing project for
many different course objectives.
Networking Email headers can be used for
instruction in Internet topology, how the SMTP
protocol works, and how to read an RFC. Students
can learn how to spot forged headers and how
message IDs are generated by mail servers. Using
problem- based learning, students can propose
their own changes to SMTP protocol, and debate
the pros and cons of various other proposed
changes. Students can compare anonymous
protocols like SMTP to authentication
protocols such as POP3 and IMAP.
From ???_at_??? Thu Sep 18 194648 2003 Status
U Return-Path Received
from 207.69.200.106 (63.105.205.40) by
aaron.mail.atl.earthlink.net (Earthlink Mail
Service) with SMTP id 1a08rz4sq3Nl3qa0 Thu, 18
Sep 2003 194313 -0400 (EDT) Received from
84.210.145.37 by 207.69.200.106 with ESMTP id
ED4F0DBC61F Thu, 18 Sep 2003 194159
-0500 Message-ID
From "Arron Logan" Reply-To
"Arron Logan" To
megansmith_at_gate.net Cc ,
, ,
, Subject
3 yrs. or 36k miles auto warranty
viydprlolzluogt Date Thu, 18 Sep 03 194159
GMT X-Mailer Microsoft Outlook IMO, Build
9.0.2416 (9.0.2910.0) MIME-Version
1.0 Content-Type multipart/alternativeboundary"
0A...7D_D9E" X-Priority 3 X-MSMail-Priority
Normal Content-Type text/html
Hi,
I know you want to
refinance your home so
here is the website you
wanted where
lenders
compete
for your business.
href"http//www.smj8i9jfdsa.flippindeals.com"
Go Here
Thanks,
Jack
Johnson l
Database Design Data Modeling Students can
easily design and build a database to hold a
corpus of spam emails. A typical design would
include as the primary key a unique identifier
for each email received, and the columns for the
database would represent each email header.
Students will have to think through 1M and MM
relationships for each email header. Tricky
headers, such as the optional and variable X-
headers, will provide an additional challenge
for problem-solving. Once a database of emails
is constructed, students can devise an interface
for adding new emails to the database (using Web
programming and SQL, for instance). And once the
database is stabilized, students can use SQL to
perform basic data retrieval, analysis, and
reporting tasks.
Ethics Aside from the myriad debates that could
be waged over whether it is ethical to send UCE
or UBE (especially with forged headers), and
what the legal ramifications of sending spam
should be, it can also be instructive for
CIS/MIS students to discuss spam email in terms
of the free Internet. For instance, why do
spammers compare spam to television commercials,
and state that being marketed to is the cost of
a free Internet? Is it possible to use email for
any type of marketing at all? How does the
offshoring of the spam industry impact foreign
policy and international law? Can U.S. policy
extend to other countries, and if not, how can
the spam problem best be solved?
Machine Learning Classification Spam filtering
represents a rich, open problem in machine
learning and classification. Students have been
successful in using a neural net and Bayesian
learning to build a spam classifier. Students
get experience building pattern matching tools
for structured data (i.e. email headers) and
semi-structured data (i.e. email bodies).
Clustering is also an interesting problem to use
spam emails to investigate. Is it possible to
find spam messages that likely come from the
same sender, despite obfuscated header
information? Students can experiment with trying
to identify manually which emails they think
come from the same sender, then they can build a
tool to perform this job automatically with
machine learning techniques. In both cases, the
corpus of spam emails serves as a rich source of
data and a compelling problem to solve.
Web Programming Most spam email uses HTML to
format text. HTML is used both as a markup
language (to catch the attention of the reader)
and as a tool for obfuscating the intent of the
message (to fool spam classifiers). In the
example shown here, the spammer attempts to use
HTML comments to hide the presence of certain
high- Probability spam words. Some spam also uses
Javascript to take advantage of certain security
holes in the browser or email software. Spammers
have also tried to exploit known
security problems with tags,
tags, and ActiveX.
Conclusions Although spam serves to annoy and
bewilder the vast majority of Internet users,
undergraduate students seem particularly
enthralled by this topic area. Perhaps it is the
plethora of titillating keywords used in so many
spam emails that attracts students, or perhaps it
is the vaguely criminal nature of the scams and
snake oil sales pitches that draws them in. In
either case, the application of the spam
problem to more traditional projects can direct
the students natural fascination with this
subject area into valuable opportunities for
undergraduate study.