Data Mining, Information Theory and Image Interpretation - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining, Information Theory and Image Interpretation

Description:

Uncertainty in ZIP Code when City, State or a digit is known ... Information from S is more effective for determining a ZIP Code ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 35

Provided by: wenjan

Learn more at: https://cedar.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining, Information Theory and Image Interpretation

1
Data Mining, Information Theory and Image
Interpretation

Sargur N. Srihari
Center of Excellence for Document Analysis and
Recognition
and
Department of Computer Science and Engineering
State University of New York at Buffalo
Buffalo, NY 14260
USA

2
Data Mining

Search for Valuable Information in Large Volumes
of Data
Knowledge Discovery in Databases (KDD)
Discovery of Hidden Knowledge, Unexpected
Patterns and new rules from Large Databases

3
Information Theory

Definitions of Information
Communication Theory
Entropy (Shannon-Weaver)
Stochastic Uncertainty
Bits
Information Science
Data of Value in Decision Making

4
Image Interpretation

Use of knowledge in assigning meaning to an image
Pattern Recognition using Knowledge
Processing Atoms (Physical) as Bits (Information)

5
Address Interpretation Model
6
Typical American AddressAddress Directory Size
139 million records
7
Assignment Strategies
Typical street address
Database query
Address encoding
Results
Word Recognizer selects (after lexicon expansion)
Delivery point 142213557
8
Australian Address
Delivery Point ID 66568882 Postal Directory
Size 9.4 million records
9
Canadian Address
Postal code H1X 3B3 Postal Directory 12.7
million records
10
United Kingdom Address
Postcode TN23 1EU (unique postcode) Delivery
Point Suffix 1A (default) Address Directory
Size 26 million records
11
Motivation for Information Theoretic Study

Understand information interaction in postal
address fields to overcome uncertainty in fields
Compare the efficiency of assignment strategies
Rank processing priority for determining a
component value
Select most effective component to help recover
an ambiguous component

12
Address Fields in a US Postal Address

Address fields

Sargur N. Srihari
f6 street name
f7 secondary designator abbr.
f5 primary number
f8 secondary number
f2 state abbr.
f3 5-digit ZIP Code
f4 4-digit ZIP4 add-on
f1 city name

Delivery point 142282583

13
Probability Distributionof Street Name Lexicon
Size f6
14
Number of Address Recordsfor Different Countries
15
Definitions

A component c is an address field fi, a portion
of fi (e.g., a digit), or a combination of
components.
1. Entropy H (x) information provided by
component x (assuming uniform distribution)
H (x) log2 x bits
2. Conditional Entropy Hx(y) uncertainty of
component y when component x is known
where xi is a value of component x yj is a
value of component y
pij is the joint probability of p(xi , yj)
3. Redundancy of component x to y
Rx(y) (H (x) H (y) - H (x, y)) / H (y)
0
Higher value of Rx(y) indicates that more
information in y is shared by x.

16
Example of Information Measure
Value sets
pa10 1/5, pae 2/5, etc.
Address records
Information measure
17
Measure of Information from National City State
File, D1 (July 1997)

Measure
H(x) x any combination of f1, f2, and f3i
Hx(f3) x any combination of f1, f2, and f3i

18
Measure of Information from Delivery Point Files,
D2 (July 1997)

Measure
H(x) x any combination of f3, f4 , f5 , f6 , f7
, f8, and f9
Hx(f4) x f3 with any combination of f3 f9

19
Measure of Information from D
Uncertainty in ZIP Code when City, State or a
digit is known
Uncertainty in component

To determine f3 (5-digit ZIP) from f1, f2 and
f3i
- City name reduces uncertainty the most

20
Propagation of Uncertainty for Assignment
Strategies
21
Ranking Processing Priority for Confirming ZIP
Code
f1 City name f2 State abbreviation f3 ZIP Code
Processing flow city, 5th, 4th, 3rd, state
22
Modeling Processing Cost

For component y
Location rate l(y) 0
Recognition rate r(y) 0
Processing speed s(y) in msec
Existence rate e(y) 0
Patron rate p(y) 0
Lexicon size of y, given x yx 2(H (x,y)
-H (x))
Cost of processing component y given component x

23
Example Cost Table
24
Ranking Processing Priority for Confirming ZIP
CodeBased on Cost
Processing flow based on cost 2nd, city, 5th,
4th, 3rd, 1st Processing flow based on Hx(y)
city, 5th, 4th, 3rd, state
25
Recovery of 1st ZIP-Code Digit, f31, from State
Abbr. (f2) and Other ZIP-Code Digits (f32-f35)

Usage If recognition of a component (e.g., f31)
fails, this component has higher probability of
recovery by knowing another component with
largest redundancy (f2).
There are 62 state abbrs. In 60 of them, 1st ZIP
digit is unique.
For NY and TX, there are two valid 1st ZIP-Code
digits.

26
Measure of Information from Mail Stream, S

Eighteen sets, each from a mail processing site,
of mail pieces
We measure
Information provided by H(f2), H(f3i)
Uncertainty of f3 by Hf2(f3), Hf3i(f3)
Each set is measured separately
The results are shown on the average of these sets

27
Comparison of ZIP-Code Uncertainty from D and S
28
Comparison of Results from D and S

ZIP-Code uncertainty
from S
Information from S is more effective for
determining a ZIP Code
The most effective processing flow of using f3i
and f2 to determine f3 is (consistent between S
and D)
f2 - f35 - f34 - f33 - f32 - f31

29
UK Address InterpretationField Recognition
Database Query

Fields of interest
Locality
Post town
County
Outward postcode
Target
Outward postcode
Control flow
Based on data mining

30
UK Address InterpretationLast Line Parsing
Resolution
31
Discussion(Reliability of information)

For selecting effective processing flow in
address interpretation, the prediction is
accurate when the information can be the most
representative in the current processing
situation
Use of unreliable information for determining a
candidate value may cause error.
Unreliable information used to choose an
effective processing flow is less effective.

32
Reliability of information

Measure of information from D
Not reflecting the current processing situation
Full coverage of all valid values
Measure of information from S
Assuming that site specific preceding history
represents current processing situation
Mail distribution could be season-specific
Should consider the coverage of valid samples
Should consider the information bias if valid
samples are from AI engine

33
Complexity of collecting mail information (S)

Information from mail streams should be collected
automatically and only high confidence
information is collected
Address interpretation is not ideal
Some error cases would be collected
Address interpretation may always reject a
certain patterns of mail pieces, resulting in
biased collected information

34
Conclusion

Information content of postal addresses can be
measured
The efficiency of assignment strategies can be
compared
Redundancy of two components can be measured
An uncertain component has higher probability of
recovery when another component with larger
redundancy is known
Information measure can suggest most effective
processing flow
Information Theory is an effective tool for Data
Mining