On Combining Multiple Segmentations in Scene Text Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

On Combining Multiple Segmentations in Scene Text Recognition

Description:

On Combining Multiple Segmentations in SceneText Recognition. Luk Neumann and Ji Matas. Centre for Machine Perception, Department of Cybernetics – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 19
Provided by: cmpFelkC4
Category:

less

Transcript and Presenter's Notes

Title: On Combining Multiple Segmentations in Scene Text Recognition


1
On Combining Multiple Segmentations in Scene Text
Recognition
  • Lukáš Neumann and Jirí Matas
  • Centre for Machine Perception, Department of
    Cybernetics
  • Czech Technical University, Prague

2
Talk Overview
  • End-to-End Scene Text Recognition - Problem
    Introduction
  • The TextSpotter System
  • Character Detection as Extremal Region (ERs)
    Selection
  • Line formation Character Recognition
  • Character Ordering
  • Optimal Sequence Selection
  • Experiments

3
End-to-End Scene Text Recognition
  • Input Digital image (BMP, JPG, PNG) / video
    (AVI)
  • Lexicon-free method
  • Output Set of words in the image word
    (horizontal) rectangular bounding box, text
    content

4
System Overview
  • Multi-scale Character Detection 1
  • with Gaussian Pyramid (new)
  • Text Line Formation 2
  • Character Recognition 3
  • Optimal Sequence Selection (new)

1 L. Neumann, J. Matas, Real-time scene text
localization and recognition, CVPR 2012 2 L.
Neumann, J. Matas, Text localization in
real-world images using efficiently pruned
exhaustive search, ICDAR 2011 3
L. Neumann, J. Matas, A method for text
localization and recognition in real-world
images, ACCV 2010
5
Character Detection - Thresholding
Input image (PNG, JPEG, BMP)
1D projection lt0255gt (grey scale, hue,)
Extremal regions with threshold ? (? 50, 100,
150, 200)
6
Extremal Regions (ER)
  • Let image I be a mapping I Z2 ? S Let S be a
    totally ordered set, e.g. lt0, 255gt
  • Let A be an adjacency relation (e.g.
    4-neigbourhood)
  • Region Q is a contiguous subset w.r.t. A
  • (Outer) Region Boundary dQ is set of pixels
    adjacent but not belonging to Q
  • Extremal Region is a region where there exists a
    threshold ? that separates the region and its
    boundary?? ?p?Q,?q?Q I(p) lt ? ? I(q)

? 32
  • Assuming character is an ER, 3 parameters still
    have to be determined
  • Threshold
  • Mapping to a totally order set (colour space
    projection)
  • Adjacency relation

7
ER Detection - Threshold Selection
  • Character boundaries are often fuzzy
  • It is very difficult to locally determine the
    threshold value, typical document processing
    pipeline (image binarization?? OCR) leads to
    inferior results
  • Thresholds that most probably correspond to a
    character segmentation are selected using a CSER
    classifier 1, multiple hypotheses for each
    character are generated

1 L. Neumann and J. Matas, Real-time scene
text localization and recognition, CVPR 2012
8
ER Detection Threshold Selection
  • p(rcharacter) estimated at each threshold for
    each region
  • Only regions corresponding to local maxima
    selected by the detector
  • Incrementally computed descriptors used for
    classification 1
  • Aspect ratio
  • Compactness
  • Number of holes
  • Horizontal crossings
  • Trained AdaBoost classifier with decision trees
    calibrated to output probabilities
  • Linear complexity, real-time performance (300ms
    on an 800x600px image)

1 L. Neumann and J. Matas, Real-time scene
text localization and recognition, CVPR 2012
9
ER Detection - Color Space Projection
  • Color space projection maps a color image into a
    totally ordered set
  • Trade-off between recall and speed (although can
    be easily parallelized)
  • Standard channels (R, G, B, H, S, I) of RGB / HSI
    color space
  • 85.6 characters detected in the Intensity
    channel, combining all channels increases the
    recall to 94.8

Source Image
Intensity Channel (no threshold exists for the
letter A)
Red Channel
10
ER Detection - Gaussian Pyramid
  • Pre-processing with a Gaussian pyramid alters the
    adjacency relation
  • At each level of the pyramid only a certain
    interval of character stroke widths is amplified
  • Not a major overhead as each level is 4 times
    faster than the previous one, total processing
    takes 4/3 of the first level (1 ¼ ¼2 )

Characters formed of multiple small regions
Multiple characters joint together
11
Character Recognition
  • Regions agglomerated into text lines hypotheses
    by exhaustive search 1
  • Each segmentation (region) labeled by a FLANN
    classifier trained on synthetic data 2
  • Multiple mutually exclusive segmentations with
    different label(s) present in each text line
    hypothesis

P
n
ilI
f
f
A
m
n
1 Neumann, Matas, Text localization in
real-world images using efficiently pruned
exhaustive search, ICDAR 2011 2 Neumann,
Matas, A method for text localization and
recognition in real-world images, ACCV 2010
12
Character Ordering
  • Region A is a predecessor of a region B if A
    immediately precedes B in a text line
  • Approximated by a heuristic function based on
    text direction and mutual overlap
  • The relation induces a directed graph for each
    text line

13
Optimal Sequence Selection
  • The final region sequence of each text line is
    selected as an optimal path in the graph,
    maximizing the total score
  • Unary terms
  • Text line positioning (prefers regions which sit
    nicely in the text line)
  • Character recognition confidence
  • Binary terms (regions pair compatibility score)
  • Threshold interval overlap (prefers that
    neighboring regions have similar threshold)
  • Language model transition probability (2nd order
    character model)

Accommodation
14
Experiments
ICDAR 2011 Dataset Text Localization
pipeline recall precision f time / image
SMSS 45.9 69.8 55.4 1.87s
SMMS 55.5 75.2 63.8 2.35s
SWTSS 38.0 66.0 48.0 0.60s
SWTMS 41.0 80.0 54.0 0.84s
MLMSS 62.1 85.9 72.0 2.52s
MLMMS 67.5 85.4 75.4 3.10s
Single Maximum (SM) Segmentation with the highest CSER score
Multiple Local Maxima (MLM) Segmentations which correspond to local maxima of the CSER score
Stroke Width Transform (SWT) Reimplementation of character detector based on Epshtein et al. 1
SS Single Scale MS Multiple Scales (Gaussian Pyramid) SS Single Scale MS Multiple Scales (Gaussian Pyramid)
1 B. Epshtein, E. Ofek, and Y. Wexler,
Detecting text in natural scenes with stroke
width transform, CVPR 2010
15
Experiments
ICDAR 2011 Dataset Text Localization
pipeline recall precision f
Proposed method 67.5 85.4 75.4
Shis method 1 63.1 83.3 71.8
Kims method 2 (ICDAR 2011 winner) 62.5 83.0 71.3
Neumann Matas 3 64.7 73.1 68.7
Yis Method 4 58.1 67.2 62.3
TH-TextLoc System 5 57.7 67.0 62.0
1 C. Shi, C. Wang, B. Xiao, Y. Zhang, and S.
Gao, Scene text detection using graph model
built upon maximally stable extremal regions,
Pattern Recognition Letters, 2013 2 A. Shahab,
F. Shafait, and A. Dengel, ICDAR 2011 robust
reading competition challenge 2 Reading text in
scene images, ICDAR 2011 3 L. Neumann and J.
Matas, Real-time scene text localization and
recognition, CVPR 2012 4 C. Yi and Y. Tian,
Text string detection from natural scenes by
structure-based partition and grouping, Image
Processing, 2011 5 S. M. Hanif and L. Prevost,
Text detection and localization in complex scene
images using constrained adaboost algorithm,
ICDAR 2009
16
Experiments
ICDAR 2011 Dataset End-to-End Text Recognition
pipeline recall precision f
Proposed method 37.8 39.4 38.5
Neumann Matas (CVPR 2012) 1 37.2 37.1 36.5
Percentage of words correctly recognized without
any error case-sensitive comparison (ICDAR 2003
protocol)
1 L. Neumann and J. Matas, Real-time scene
text localization and recognition, CVPR 2012
17
Sample Results on the ICDAR 2011 Dataset
18
Conclusions
  • Multi-scale processing / Gaussian Pyramid
    improves text localization results without a
    significant impact on speed
  • Combining several channels and postponing the
    decision about character detection parameters
    (e.g. binarization threshold) to a later stage
    improves localization and OCR accuracy
  • Method current state
  • The method placed second in ICDAR 2013 Text
    Localization competition, 1.4 worse than the
    winner (f-measure)(unfortunately, end-to-end
    text recognition is not part of the competition)
  • Online demo available at http//www.textspotter.or
    g/
  • OpenCV implementation of the character detector
    in progress by the open source community
  • Future work
  • OCR accuracy improvement
  • Overcoming limitations of CC-based methods (e.g.
    non-linearity non-robustness caused by a single
    pixel)
Write a Comment
User Comments (0)
About PowerShow.com