Title: Extraction of Vectorized Graphical Information from Scientific Chart Images
1Extraction of Vectorized Graphical Information
from Scientific Chart Images
Ruizhe Liu, Weihua Huang, Chew Lim Tan School
of Computing National University of Singapore
2Outline
- Introduction
- Related works
- The proposed approach
- Experiment results
- Conclusion
3Introduction
- What is represented in a scientific chart?
- Recognition and interpretation is the reverse
process
4Introduction
- The significance of recognition and
interpretation of scientific chart images. - Automatic document processing Convert the charts
into machine readable form - Information retrieval and extraction recover the
tabula data and the intended message
5Introduction
- Major steps in the system
6Introduction
- The role of vectorization
- Converts pixels into graphical primitives
- Forms the basis for graphical symbol
construction1 - The proposed approach
- Directional single-connected chains (DSCC)
curve fitting
7Outline
- Introduction
- Related works
- The proposed approach
- Experiment results
- Conclusion
8Related Works
- Point based vectorization
- A set of points from the contour or the skeleton
of the lines is chosen. - Curve fitting methods are applied
- Limitations
- Proper point set required, otherwise easy to
shift from the true curve when the lines are
distorted - Difficult to handle line intersections
9Related Works
- Point based vectorization
Straight line fitting
Ellipse fitting
10Related Works
- Segment based vectorization
- Segments obtained from thinning, contour
tracking, run-length coding or medial-axis
tracking - Make use of geometric features
- Limitations
- Difficult to extract geometric features for
complex curves - Some still have difficulty with line intersections
11Related Works
- Segment based vectorization
Thinning based method
Sparse pixel tracking method
Progressive simplification based tracking method
12Outline
- Introduction
- Related works
- The proposed approach
- Experiment results
- Conclusion
13The Proposed Approach
- Major steps
- Its a 2-pass vectorization process
- Run-lengths breaks lines and curves at each
intersection - Broken lines and curves are re-joint during
post-processing
14The Proposed Approach
- The directional single-connected chain (DSCC)
- Chain of run-lengths following a single
direction, can be treated as a segment - Originally proposed for form recognition
- Average length of the run-lengths estimates the
thickness of the chain - The set of mid-points of the run-lengths in a
chain allows curve fitting to the chain
15The Proposed Approach
- Construction of DSCC (straight-line)
Horizontal run-length
Vertical run-length
- Chain formed by run-lengths that
- have similar length
- keep the direction of the chain (run-length does
not shift too much)
16The Proposed Approach
- Construction of DSCC (arc)
- Chain formed by run-lengths that
- have similar length
- keep the direction of the chain (run-length does
not shift too much) - number of neighbors 2
17The Proposed Approach
- Post-processing
- Filtering remove run-lengths with length 1 and
1 neighbors - Smoothing combine two run-lengths in the same
column/row if the blank area between them is less
than threshold T
Before
After
18The Proposed Approach
- Post-processing
- Splitting iterative divide a DSCC at the turning
point that is the run-length with maximum
distance to the line formed by the start and end
point of the DSCC
P1
P3
P2
Each DSCC is now a straight line, or arc or
polyline
19The Proposed Approach
- Apply ellipse fitting to each DSCC A.
Fitzgibbon, 1999
Minimize the squared algebraic distance
F(A X) A X ax2 bxy cy2 dx ey f
0 where A a b c d e f T and X x2 xy y2
x y 1T.
F(A Xi) the algebraic distance of a point
(xi, yi) to the conic F(A X) 0
SA ?CA ATCA 1
S is the scatter matrix DTD. D x1 x2 xn T
is called the design matrix C is the matrix
that expresses the constraint. ? is the Lagrange
multiplier.
20The Proposed Approach
- Result of ellipse fitting
- Classification and verification
- Straight line max_radius / min_radius T
- Circular arc max_radius / min_radius 1
- Elliptic arc max_radius / min_radius T
- Polyline through error checking
21The Proposed Approach
- Combine straight lines
- The two lines must be within the connected area.
- The two lines should be angled less than 10
degrees. - Combine arcs
- The two arcs must be within the connected area.
- The two arcs have common center and radius.
- The two tangent lines of the staring or ending
points of the two arcs should be angled less than
10 degrees.
22Outline
- Introduction
- Related works
- The proposed approach
- Experiment results
- Conclusion
23Experimental Results
- Dataset
- 200 chart images including 2D and 3D bar charts,
2D and 3D pie charts, and 2D line charts. - Multi-leveled ground truth information for
performance evaluation available, including
vector level information of straight lines,
circular and elliptic arcs.
24Experimental Results
- Evaluation Criteria
- s overlapping segment between an extracted
vector vd and the corresponding vector in the
ground truth vg - Coverage(s, vi) the length of s divided by the
length of vi, where vi is either vd or vg. - Correct if both Coverage(s, vd) and Coverage(s,
vg) are 90 - Broken if Coverage(s, vd) 90 but Coverage(s,
vg) is not - Wrong if both Coverage(s, vd) and Coverage(s,
vg)
25Experimental Results
26Outline
- Introduction
- Related works
- The proposed approach
- Experiment results
- Conclusion
27Conclusion
- A method for obtaining vector information from
scientific chart images is introduced. - The method is based on construction of DSCC and
ellipse fitting. - The resulting vectors are to be used to construct
graphical symbols for further recognition and
interpretation purposes.
28Thank you!