Title: Henry S. Baird
1A Highly Legible CAPTCHAthat Resists
Segmentation Attacks
- Henry S. Baird
- Michael A. Moll
- Sui-Yu Wang
2Some Typical CAPTCHAs
AltaVista
eBay/PayPal
Yahoo!
PARCs PessimalPrint
3All These Are Vulnerable to Segment-then-Recogniz
e Attack
- Effective strategy of attack
- Segment image into characters
- Apply aggressive OCR to isolated chars
- If its known (or guessed) that the word is
spellable (e.g. legal
English), use the lexicon to constrain
interpretations - Patrice Simard (MS Research) et al report that
this breaks many
widely used CAPTCHAs
4We try to generate word-imagesthat will be hard
to segment into characters
- Slice characters up -vertical
cuts then -horizontal cuts - Set size of cuts to constant within a word
- Choose positions of cuts randomly
- Force pieces to drift apart scatter horiz.
vert. - Change intercharacter space
5Character fragments can interpenetrate
- Not only is it hard to segment the word into
characters, . - it can be hard to recombine characters
fragments into characters
6Character fragments can interpenetrate
- Not only is it hard to segment the word into
characters, . - it can be hard to recombine characters
fragments into characters
7Nonsense Words
- We use nonsense (but English-like) words (as in
BaffleText) - generated pseudorandomly by a stochastic
variable-length character n-gram model - trained on the Brown corpus
this
protects against lexicon-driven attacks - Why not use random strings?
- We want to help human readers feel confident they
have made a plausible choice, so theyll put up
with severe image degradations (Cf. research in
psychophysics of reading.)
M. Chew H. S. Baird, BaffleText a Human
Interactive Proof, Proc., 10th SPIE/IST
Document Recognition and Retrieval Conf.,
(DRR2003), Santa Clara, CA, January 23-24, 2003.
8How Well Can People Read These?
- We carried out a human legibility trial with the
help of - 60 volunteers students, faculty, staff
at Lehigh Univ. - plus colleagues at Avaya Labs Research
9Subjects were told they got it right/wrong
after they rated its difficulty
10Subjective difficulty ratingswere correlated
with objective difficulty
- People often know when theyve done well
- This can be used to ensure that challenges arent
too hard (frustrating, angering)
11The same data, graphically
1 Easy 2 3 4 5 Impossible
Right Wrong
12People Rated These Easy (1/5)
- aferatic
- memmari
- heiwho
- nampaign
13Rated Medium Hard (3/5)
- overch / ovorch
- wouwould
- atlager / adager
- weland / wejund
14Rated Impossible (5/5)
- acchown / echaeva
- gualing / gealthas
- bothere / beadave
- caquired / engaberse
15Why is ScatterType legible?
- Does it surprise you that this is legible?
- I speculate that we can read it because
- we exploit typeface consistency
- the evidence is small details of
local shape - this ability seems largely unconscious
16Ensuring that ScatterType is Legible
- We mapped the domain of legibility
as a function of engineering
choices - typefaces
- characters in the alphabet
- cutting scattering parameters
cut fraction expansion fraction horizontal
scatter mean vertical scatter mean h v scatter
variance character separation
17Some typefaces remain legiblewhile others
degrade quickly
18Raising Legibility byPruning Typefaces
19Some Characters QuicklyBecome Confusable
overch o e c confusions
20Raising Accuracy byOmitting Characters
21Ensuring Legibility
- Pruning characters typefaces
- raised legibility in the top two difficulty
levels to 90 - Next step
- restrict the range of cutting scatter parameters
22Mean Horizontal Scattervs Mean Vertical Scatter
1 Easy 2 3 4 5 Impossible
Right Wrong
Mirage data analysis tool, Tin Kam Ho, Bell Labs.
23Cut Fraction Histogram
1 Easy 2 3 4 5 Impossible
Right Wrong
24Character Separation Histogram
1 Easy 2 3 4 5 Impossible
Right Wrong
25Finding Parameter Rangesfor High Legibility
d Euclidean distance from origin of Mean Horiz
Scatter vs Mean Vertical Scatter
26Guided by this Analysis, We Can Define
Legibility Regimes
Trivial large cut fraction and small expansion
Simple character separation also decreases
Easy in original trial, correct 81 of time
Medium Hard larger scatter distances degrades
legibility noticeably
27Other Examples - Easy
wexped - difficult to segment e, x and p.
Shows difficulty of achieving 100 legibility
veral - same parameters as above but different
font. Not as difficult to segment
28Other Examples - Too Hard
thern difficult to read, but easier than most
with the same parameter values. Font makes a big
difference.
wezre satisfactorily illegible, though probably
segmentable
29Next Steps
- By judicious restrictions on engineering
parameters, attempt to ensure human legibility
better than 99.5 - Similarly, attempt to ensure 90 of challenges
have low subjective difficulty ratings (e.g. 1-3
out of 5) - You are welcome to try out ScatterType
- arcturus.cse.lehigh.edu/CAPTCHAs
- Also, we invite you to attack it
- Well send you large batches, with ground-truth
- Try to train a classifier to break it!
-
30Future Work
- We have exhausted the experimental data from the
1st trial - How can we automatically create images with given
difficulty? - We have generated many images that seem difficult
to segment automatically, but we dont understand
how to guarantee this - We need to understand the effects of typefaces on
ScatterType legibility - We want to study character-confusion pairs more
- Attacking ScatterType
- Testing on best OCR systems
- Invite attacks from other researchers
- Is it credible if we attack it ourselves, and
fail?
31Contacts
- Henry S. Baird
- baird_at_cse.lehigh.edu
- Michael Moll
- mam7_at_lehigh.edu