An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

Description:

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis Clemens Neudecker, KB National Library of the Netherlands – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 23
Provided by: Ploe
Category:

less

Transcript and Presenter's Notes

Title: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis


1
An Experimental Workflow Development Platform for
Historical Document Digitisation and Analysis
  • Clemens Neudecker, KB National Library of the
    Netherlands
  • Research Meeting, Amsterdam 3 November 2011

2
Background
  • gt 20 individual software components for specific
    challenges
  • Prototyping new algorithms, improving commercial
    solutions
  • Different frameworks (C, C, Java, etc.),
    platforms (Win/Linux)
  • Extensible with 3rd party applications
  • ? IMPACT Interoperability Framework (IIF)

3
Main requirements
  • Behavioural
  • Minimize integration effort
  • Minimize deployment effort
  • Maximize usability
  • Maximize scalability
  • Functional
  • Modular
  • Transparent
  • Expandable
  • Open source
  • Platform independent

4
Architecture
  • Java
  • Web Services
  • Apache
  • Taverna
  • Open Source available on https//github.com/impact
    centre
  • Free Hackathon 14/15 November, University of
    Manchester
  • http//impact-mygrid-taverna-hackathon.wikispaces.
    com/

5
Integration
  • Only requirementcommand line executable
  • Generic command line wrapperproduces web service
  • Web service exposed as workflow module
    withdocumentation

6
Generic Web Service Wrapper
  • ? Easy integration developers can focus on
    their application and have to worry less about
    integration higher quality software components

7
Workflows
  • OCR workflow data pipeline
  • Building blocks processing modules (nodes)
  • Integration interaction between nodes
    (mashups)
  • ? Collaboration with

8
(No Transcript)
9
Workflow management
  • Web 2.0 style registry myExperiment
  • Local client Taverna Workbench
  • Web client Project website

10
Local client Taverna Workbench
  • Background
  • BioSciences
  • Developed and maintained bymyGrid, UK
  • Open source
  • GUI for design and execution of web services
    workflows

11
Remote client Portal
  • SOAP/REST API
  • Remote execution of web services workflows

12
Community
  • Web2.0 style workflow registry
  • Community of experts
  • Sharing of resources
  • Knowledge exchange
  • A central meeting point for users and
    researchers

13
Scalability
  • Central ESB proxy manages multiple service
    copies
  • Process parallelization,Load distribution,Fail
    over, Security
  • Served gt2M requests
  • Throughput improvements of 94 with every
    additional instance
  • Tested on Dutch Cloud (Enlighten Your Research)

14
Dataset
  • Access to a representative and annotated dataset
    of significant size, with metadata, ground truth
    and search facilities

15
Evaluation features
  • Text based comparison of result with ground
    truth, using Levenshtein distance method
  • Layout based comparison of result with ground
    truth,
  • using the Page Analysis And Ground Truth
    Elements Framework
  • Example

16
The PAGE Format Framework
  • Two-level architecture
  • root structure
  • task specific sub-formats
  • Separate XML Schema definitions
  • Format identification via Namespaces
  • Mapping of
  • dependencies
  • process chains
  • alternative processing steps
  • Linking via IDs

17
Ground-Truthing Tools
  • Aletheia
  • FineReader
  • PAGE Exporter
  • GT Validator
  • GT Normalizer

17
18
Profile Full Text Recognition
  • Evaluation for general text recognition

Measure Weights Region Type Weights
Merge Text
Allowable Merge Image
Split Graphic
Allowable Split Chart
Miss Table
Partial Miss Separator
Misclassification Maths
False Detection Noise
18
19
Measures Segmentation Errors
Partial Miss
Miss
Mis-classi-fication
Merge
Caption
Paragraph
Ground Truth
Segmentation Result
Split
19
20
OCR Accuracy
21
Outlook
  • Online service for testing/evaluation
  • Specification Guidelines
  • Extending the scopeWorkflows for linguistic
    analysis CLARINWorkflows for preservation
    SCAPE
  • Even better scalability Map/Reduce
  • Supported by a community of developers
    practitioners

22
  • Anyway, the thing about progress is that is
    always seems greater than it really is.
  • Ludwig Wittgenstein, Philosophical
    Investigations (quoting Johann Nestroy)
Write a Comment
User Comments (0)
About PowerShow.com