Title: Overview
1Overview
Market LeaderIntelligent Capture Exchange
Solutions
2Information comes in many forms
- Structured Content
- Information is predictable
- Location of information ispredictable
- Examples
- Waybill
- Traffic Citations
- Tax Forms
- Mail Order Forms
- Applications
- Insurance Claims
3Information comes in many forms
- Semi-Structured Content
- Information is predictable
- Location of information isNOT predictable
- Examples
- Accounts Payable
- Accounts Receivable
- Transportation
- Bills of Lading
- Medical Billing
4Information comes in many forms
- Unstructured Content
- Information is NOT predictable
- Location of information isNOT predictable
- Examples
- Mortgage Folders
- Medical Records
- Email Classification
- Digital Mailroom
- Litigation Support
5Where Did Kofax Classification / Separation
Originate?
Was funded by In-Q-Tel, the joint venture capture
startup group owned by the CIA.
6Enabling the automation of Document
Classification Processes
- Processing millions of captured foreign documents
- Automating the categorization of content to
expedite linguistic activities - Connecting to an internal content management
solution
7Kofax Transformation - Advanced Document
Separation
- Automatically identify document type and
individual document boundaries (start/end) within
a batch of multiple documents - Goal Perform separation/recognition just as if
physical separator sheets were inserted between
each document - Utilizes multiple approaches in classification
and separation in a waterfall approach.
8KTM Advanced Document Separation Process
KTM Advanced Document Separation
Typical Process Flow
Extraction 1
Document
Scan/
Extraction
Classify
Review
Release
Data Validation
Import
Separate
9Vector Space Machines Under the Hood
Warning The following slides may require pocket
protectors.
10Classification
- CLASSIFICATION METHODS
- Each classification method can be used
independently or in combination. - Mark Registration
- Text Registration Image
- Image
- Advanced Text
- Manual Rules
Template Technology
Self-Learning Document Classification
NEW
Powered by INDICIUS
NEW
Used to augment Advanced Text
11Classification Waterfall
Template
Document
Yes
Type Set
Image
No
Document
Yes
Type Set
Document Type
Advanced Text
No
Document
Yes
Type Set
Rules
No
Document
Yes
Type Set
Document
Review /
No
Completion
12Learn-By-Example Approach
Advanced Text and Separation
Subject experts provide example documents
text
Learning Algorithm
Engine
Attributes
- Training and Model Development Phase
- Requires example documents correctly placed in
each category - Generates a small footprint model used by
run-time engine - No rules to write!
- Run-time Operation
- Uses model to apply patterns learned in training
phase to new, incoming documents - Analyzes document attributes and determines the
appropriate category - Provides associated confidence score for each
result
13Advanced Text Training Phase Model Builder
The Advanced Text Classification and Separation
is trained on a directory of files that contains
sample documents properly placed in order and in
the correct categories.
HUD - Page 1
14Training on Text Patterns Learning Algorithm
PHOENIX (AP) - Ray Durham hit a leadoff homer and
Brett Tomko pitched four innings, helping the San
Francisco Giants beat the Milwaukee Brewers 9-2
Tuesday. Milwaukee center fielder Scott
Podsednik, returned to the lineup and doubled
twice....
SARASOTA, Fla. (AP) - Jose Acevedo made a strong
bid for a spot in Cincinnati's rotation, pitching
five solid innings Tuesday night in a 5-0 victory
over the Boston Red Sox. Adam Dunn returned to
the starting lineup and went 1-for-2 with a
sacrifice fly....
Sports
NEW YORK (AP) - A spinoff of its hit cartoon
"Dora the Explorer" and a comedy that stars Julia
Roberts' niece are among nine new programs
ordered by Nickelodeon for the networks next
season. "Go, Diego, Go" will feature Dora's
rough-and-tumble....
NEW YORK (AP) - If Regis Philbin once saved ABC,
Donald Trump has certain bragging rights at NBC.
In two months, the hit show "The Apprentice" has
made a huge difference on Thursday nights for
NBC, an evening the network....
Entertainment
15Using Spatial Vectors for Text Classification
Sports
Simple Vector Approach
Tech.
Ent.
Mohomine Approach Space Vector Machine
algorithm confidence (sports category)
0.3pitch 0.4inning 0.7lineup 0.1hit
16Separation
- Strict Rule Separation
- For Example, The Solution will create a new
document every time a page of type X is seen - Advanced Separation
- Uses probabilities to ascertain from page
classifications the most likely document
structure - Rule-based separation
- implemented through scripting (Ascent Capture
platform only, introduced in version 4).
17Advanced Text Separation Methodology
- 1st Pass Determine if given page is the first,
middle, or last page of a known form (each page
may receive multiple answers) - 2nd Pass Individual page assignments are used
to identify most likely form order and separation
points within document (using context of pages
before/after each page)
3
4
5
6
7
8
9
1
2
Page
Classifier Result
Last Form X
Last Form Z
Middle Form X
Middle Form Y
First Form X
First Form Y
?
Last Form Y
First Form Z
Most Likely Grouping
Form X
Form Y
Form Z
18Advanced Text Separation Methodology
- 1st Pass Determine if given page is the first,
middle, or last page of a known form (each page
may receive multiple answers) - 2nd Pass Individual page assignments are used
to identify most likely form order and separation
points within document (using context of pages
before/after each page)
19Thank You!
Kofax Confidential
20Automatic Document ID and Indexing
S 90
E 90
S 65
M 70
E 85
S 72
E 80
S 85
E 50
S 55
M 65
E 70
S 70
E 75
E 22
M 15
S 12
M 10
E 65
S 12
E 30
S
E
S
M
E
S
E
21Automatic Document ID and Indexing
Page Identification
Document Separation
S
E
S
M
E
S
E
Index
22Automatic Document ID and Indexing
Page Identification
Document Separation
S
E
S
M
E
S
E
Index
23Classification Waterfall Technique
Using multiple classification engines
- Performance is optimized by attempting fastest
classification techniques first, accepting
results only if very confident - Mohomine text classification is used as catch
all methodvery accurate with widest reach, but
dependent on full-page OCR
1
2
3
4
5
6
7
8
Page
First Form X
1 ms
First Form Z
First Form Y
20 ms
Last Form X
Last Form Z
Last Form Y
200 ms
Middle Form X
Middle Form Z
1000 ms
24How do we actually build a model?
Business
Dictionary
NEW YORK (Reuters) - Former WorldCom Inc. finance
chief Scott Sullivan, who has become the star
witness against Bernard Ebbers, admitted on
Wednesday to a history of lies, saying he had
deceived shareholders, analysts and the board
while his staff undertook an 11 billion
accounting fraud. Sharply questioned by the lead
attorney for Ebbers, the one-time chief executive
officer
SAN JOSE, Calif. (AP) -- One week after firing
its top executive, Hewlett-Packard Co. reported
quarterly earnings that were essentially flat,
and the interim chief executive acknowledged,
There is work to be done.'' For the three
months ended Jan. 31, HP reported a profit of
943 million, or 32 cents per share, only 0.7
percent higher than the 936 million, or 30 cents
per share, it earned in the first fiscal
quarter
Sports
PARIS (AP) -- Still hungry to race but wary he is
not in the best shape, Lance Armstrong wants to
take his Tour de France record to even mightier
heights He will try for a seventh straight title
this summer. Armstrong had left open the
possibility he wouldn't compete this year in
cycling's showcase event to pursue other races.
But in an announcement Wednesday on the Web site
of his Discovery Channel team the Tour's only
six-time winner
Saying this was a "sad, regrettable day,"
Commissioner Gary Bettman announced today that
the National Hockey League was canceling the
season because negotiators had failed to come to
an agreement with the players' union on salary
caps. With his announcement, the N.H.L. becomes
the first major pro sports league in North
America to lose an entire season to a labor
dispute
Technology
SAN FRANCISCO, Feb. 15 - Late in the summer of
1973, two young scientists in the nascent field
of computer networks hunkered down in a
conference room of the Cabana Hyatt Hotel in Palo
Alto, Calif., a clean but bland stopping place
for salesmen and the parents of students at
nearby Stanford University. Their goal was to
thrash out a way to make different, isolated
computer networks talk to each other.
A new battery-powered Etch A Sketch will rely on
digital electronics for a speedy interpretation
of each knob twist. It is designed, its makers
say, to transmit data along a wire plugged into a
television set that will display every line and
detail in real time, with accompanying sounds and
optional color. It will cost 20, twice the price
of the traditional Etch A Sketch. "I think the
kids are becoming more advanced in
25The Problem Document Separation
- Separation of unstructured documents is a
significant expense for a high volume capture
system - Typical structured recognition technologies are
not applicable - Manual insertion of separator sheets is the
primary solution today - 50 of document preparation labor spent sorting
documents and inserting separator pages
Where does one document stop and the next begin?
Here?
Here?
Here?
SS
26How Document Separation Works
3
4
5
1
2
Page
X
X
mC Result
First Form X (97)
Middle Form X (92)
Last Form X (95)
First Form Y (84)
Last Form Y (95)
FSM Constraints
- A First page must be followed by Middle or
Last of same type - After a Last page must come a First
- Custom Business Rules
Best Path Analysis
Form X
Form Y
27Customer Success Story
- Residential mortgage processing, 12 Million
images/month - Each customer folder 100 pages, 60-80 doc types
- Before automatic document separation
- 60 people doing document separation and
preparation - 16 people to review (QC) a customer folder
- 8.25 minutes per folder to review
- With automatic document separation
- 10 people doing document separation and
preparation - 3 people to review (exceeded goal to reduce staff
to 8) - 2 minutes per folder to review
- Exceeded processing goal targets at each step
- 420,000 annual savings in labor
- 100,000 annual savings in separator sheet
consumables
28Capabilities Overview
- Classification
- Content (text)
- Layout (topography)
- Combination of the above
- Extraction
- Rules (format, database)
- Learn-by-example
- Templates
- Any document
- Structured (inc. legacy forms)
- Semi-structured, e.g. invoices
- Unstructured documents, e.g. correspondence
29Key Applications/Use Cases
- Invoices (AP automation)
- Speed up AP process and reduce manual keying
- Pre-configured solution already available
- Sales Orders
- Improve sales order process and accuracy
- Mailroom applications/Workflow automation
- Automatic classification and routing
- Indexing (lt 3 fields) for archive
- No need for pre-sorting
- Image to archive automation
- Automatic classification and indexing for storage
in dm system - Better, quicker, more accurate batch capture
- Business process automation
- Full data capture
- Straight thru processing
- Semi-structured and unstructured documents
- Invoices and credit notes
- Correspondence
- Reports
30Kofax KTM Differentiators
- Integrated with Kofax Capture (offering HA, xx)
- Learn-by-example extraction
- Learn-by-example classification
- Continuous supervised learning in production
- Single product for all document types that is
upgradable
31Kofax Solution Strengths
- Market leader
- Out-of-the-box
- Unlimited import options
- VRS integrated with QC Later
- Better Recognition/Multiple Document Types
- API Integrated export
- Secure handling of images data
- Out-of-the-box reports
- You wont outgrow it