Title: Data Mining, Information Theory and Image Interpretation
1Data Mining, Information Theory and Image
Interpretation
- Sargur N. Srihari
- Center of Excellence for Document Analysis and
Recognition - and
- Department of Computer Science and Engineering
- State University of New York at Buffalo
- Buffalo, NY 14260
- USA
2Data Mining
- Search for Valuable Information in Large Volumes
of Data - Knowledge Discovery in Databases (KDD)
- Discovery of Hidden Knowledge, Unexpected
Patterns and new rules from Large Databases
3Information Theory
- Definitions of Information
- Communication Theory
- Entropy (Shannon-Weaver)
- Stochastic Uncertainty
- Bits
- Information Science
- Data of Value in Decision Making
4Image Interpretation
- Use of knowledge in assigning meaning to an image
- Pattern Recognition using Knowledge
- Processing Atoms (Physical) as Bits (Information)
5Address Interpretation Model
6Typical American AddressAddress Directory Size
139 million records
7Assignment Strategies
Typical street address
Database query
Address encoding
Results
Word Recognizer selects (after lexicon expansion)
Delivery point 142213557
8Australian Address
Delivery Point ID 66568882 Postal Directory
Size 9.4 million records
9Canadian Address
Postal code H1X 3B3 Postal Directory 12.7
million records
10United Kingdom Address
Postcode TN23 1EU (unique postcode) Delivery
Point Suffix 1A (default) Address Directory
Size 26 million records
11Motivation for Information Theoretic Study
- Understand information interaction in postal
address fields to overcome uncertainty in fields - Compare the efficiency of assignment strategies
- Rank processing priority for determining a
component value - Select most effective component to help recover
an ambiguous component
12Address Fields in a US Postal Address
Sargur N. Srihari
f6 street name
f7 secondary designator abbr.
f5 primary number
f8 secondary number
f2 state abbr.
f3 5-digit ZIP Code
f4 4-digit ZIP4 add-on
f1 city name
13Probability Distributionof Street Name Lexicon
Size f6
14Number of Address Recordsfor Different Countries
15Definitions
- A component c is an address field fi, a portion
of fi (e.g., a digit), or a combination of
components. - 1. Entropy H (x) information provided by
component x (assuming uniform distribution) - H (x) log2 x bits
- 2. Conditional Entropy Hx(y) uncertainty of
component y when component x is known - where xi is a value of component x yj is a
value of component y - pij is the joint probability of p(xi , yj)
- 3. Redundancy of component x to y
- Rx(y) (H (x) H (y) - H (x, y)) / H (y)
- 0
- Higher value of Rx(y) indicates that more
information in y is shared by x.
16Example of Information Measure
Value sets
pa10 1/5, pae 2/5, etc.
Address records
Information measure
17Measure of Information from National City State
File, D1 (July 1997)
- Measure
- H(x) x any combination of f1, f2, and f3i
- Hx(f3) x any combination of f1, f2, and f3i
18Measure of Information from Delivery Point Files,
D2 (July 1997)
- Measure
- H(x) x any combination of f3, f4 , f5 , f6 , f7
, f8, and f9 - Hx(f4) x f3 with any combination of f3 f9
19Measure of Information from D
Uncertainty in ZIP Code when City, State or a
digit is known
Uncertainty in component
- To determine f3 (5-digit ZIP) from f1, f2 and
f3i - - City name reduces uncertainty the most
20Propagation of Uncertainty for Assignment
Strategies
21Ranking Processing Priority for Confirming ZIP
Code
f1 City name f2 State abbreviation f3 ZIP Code
Processing flow city, 5th, 4th, 3rd, state
22Modeling Processing Cost
- For component y
- Location rate l(y) 0
- Recognition rate r(y) 0
- Processing speed s(y) in msec
- Existence rate e(y) 0
- Patron rate p(y) 0
- Lexicon size of y, given x yx 2(H (x,y)
-H (x)) - Cost of processing component y given component x
23Example Cost Table
24Ranking Processing Priority for Confirming ZIP
CodeBased on Cost
Processing flow based on cost 2nd, city, 5th,
4th, 3rd, 1st Processing flow based on Hx(y)
city, 5th, 4th, 3rd, state
25Recovery of 1st ZIP-Code Digit, f31, from State
Abbr. (f2) and Other ZIP-Code Digits (f32-f35)
- Usage If recognition of a component (e.g., f31)
fails, this component has higher probability of
recovery by knowing another component with
largest redundancy (f2). - There are 62 state abbrs. In 60 of them, 1st ZIP
digit is unique. - For NY and TX, there are two valid 1st ZIP-Code
digits.
26Measure of Information from Mail Stream, S
- Eighteen sets, each from a mail processing site,
of mail pieces - We measure
- Information provided by H(f2), H(f3i)
- Uncertainty of f3 by Hf2(f3), Hf3i(f3)
- Each set is measured separately
- The results are shown on the average of these sets
27Comparison of ZIP-Code Uncertainty from D and S
28Comparison of Results from D and S
- ZIP-Code uncertainty
- from S
- Information from S is more effective for
determining a ZIP Code - The most effective processing flow of using f3i
and f2 to determine f3 is (consistent between S
and D) - f2 - f35 - f34 - f33 - f32 - f31
29UK Address InterpretationField Recognition
Database Query
- Fields of interest
- Locality
- Post town
- County
- Outward postcode
- Target
- Outward postcode
- Control flow
- Based on data mining
30UK Address InterpretationLast Line Parsing
Resolution
31Discussion(Reliability of information)
- For selecting effective processing flow in
address interpretation, the prediction is
accurate when the information can be the most
representative in the current processing
situation - Use of unreliable information for determining a
candidate value may cause error. - Unreliable information used to choose an
effective processing flow is less effective.
32Reliability of information
- Measure of information from D
- Not reflecting the current processing situation
- Full coverage of all valid values
- Measure of information from S
- Assuming that site specific preceding history
represents current processing situation - Mail distribution could be season-specific
- Should consider the coverage of valid samples
- Should consider the information bias if valid
samples are from AI engine
33Complexity of collecting mail information (S)
- Information from mail streams should be collected
automatically and only high confidence
information is collected - Address interpretation is not ideal
- Some error cases would be collected
- Address interpretation may always reject a
certain patterns of mail pieces, resulting in
biased collected information
34Conclusion
- Information content of postal addresses can be
measured - The efficiency of assignment strategies can be
compared - Redundancy of two components can be measured
- An uncertain component has higher probability of
recovery when another component with larger
redundancy is known - Information measure can suggest most effective
processing flow - Information Theory is an effective tool for Data
Mining