Scruffy Validation - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Scruffy Validation

Description:

Scruffy Validation. Alex Brown. Griffin Brown Digital Publishing Ltd. Menu ... Scruffy: heuristic, procedural, narrative. Neat Validation ... Scruffy Validation ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 31
Provided by: AlexB65
Category:

less

Transcript and Presenter's Notes

Title: Scruffy Validation


1
Scruffy Validation
  • Alex BrownGriffin Brown Digital Publishing Ltd.

2
Menu
  • Background
  • A Case Study using the NLM DTD
  • Limitations of neat grammars
  • Possible solutions
  • XPath-based validation
  • The user perspective
  • Seeing it work
  • Other scruffy validation problems

3
Background
  • Neat vs. Scruffy a common opposition in
    computing
  • Neat algorithmic, function-based, mathematical
  • Scruffy heuristic, procedural, narrative

4
Neat Validation
  • DTDs, Relax-NG (for example) grammar-based
    schema languages
  • Syntax-directing, so suitable for editor
    implementations
  • In practice, models can be represented
    diagrammatically

5
Scruffy Validation
  • Includes many non-grammatical tests, shading into
    application-specific data testing (aka business
    rules)
  • Sees validation not as a boolean test but as a
    more nuanced report on a instances state good
    enough
  • Includes concept of validation management from
    user perspective

6
A Simple Example
  • NCBI/NLM DTD family http//dtd.nlm.nih.gov/publish
    ing/
  • Journal Publishing application enjoying some
    traction
  • State of the art DTD

7
NCBI/NLM Publishing DTD
8
NCBI/NLM Publishing DTD
9
Data Rules
  • The ltfpagegt and ltlpagegt elements should have
    numeric content
  • If one exists, both must exist
  • The number contained by ltlpagegt must be equal to
    or greater than that contained by ltfpagegt

10
The DTD cannot enforce this
  • Content model is(PCDATA fpage lpage )
  • Enforcement of the data rules is beyond the
    abilities of the schema language.

11
Solution 1 Programming
  • Either using specialised XML processing language,
    or standards-based approach (SAX)
  • Pseudocode
  • When ltfpagegt is encountered, check value and
    store
  • When ltlpagegt is encounteed, check format and
    compare with stored ltfpagegt value

12
Advantages of SAX-based validation
  • Powerful the limits are those of the
    implementing system
  • Fast execution stream based parsing is fast
  • Memory efficient no need to build in-memory
    tree may be only route for certain applications

13
Disadvantages of SAX-based validation
  • Can be cumbersome to code
  • Therefore is difficult to maintain moves
    validation firmly into the realm of software
    development
  • Conceptually messy it feels like theres a
    generalisation to be made (API?)

14
Solution 2 Schema Language
  • Schematron, XMLProbe/SILCN, CLiX,BI-ICS4J XSLT
  • Generally higher-level expression of rules in
    XML! (well )
  • Generally, built around an in-memory tree
    representation of XML to be tested

15
Using XPath to express a constraint
  • //lpagematch-regexp("(\d)",.) and
    match-regexp("(\d)",preceding-siblingfpage)
    not(number(.)gt number(preceding-siblingfpage
    ))
  • N.B. XPath has here been extended (thus we have a
    route to the power of the implementing system)

16
Advantages of XPath-based validation
  • In harmony with XML model extensible (and needs
    to be!)
  • Visual tools available for expression testing
  • Maintainable by non-programmers

17
Disadvantages of XPath-based validation
  • Needs an in-memory tree
  • Not suitable for certain kinds of testing, where
    programming is required, and the result is more
    complex than a node list
  • NB where XPath is, XQuery will be

18
Fitting Scruffy Validation into a Process
Well-formed?
NeatValidation
ScruffyValidation
DTDRELAX NGW3C Schema
SchematronXSLTXMLProbe
XML 1.0/1.1
19
Finding interesting nodes
20
The User Perspective Reporting Data Faults
  • Users ideally want to know
  • What is wrong
  • Why (or how much) its wrong
  • Where its wrong
  • How to fix it
  • Not good enough to say this is a quality of
    implementation issue

21
Zooming Out Deviations from Neatness
  • Some defects may be acceptable
  • Correctness may ultimately only be decidable by a
    human
  • Different audiences may want different views onto
    data validation reports

22
A Comparison with Software Development
  • lint purify augmenting syntax checking
  • For many users XML validation is stuck in the 80s
    (generally)
  • This is partly a tools issue but XML can help

23
XML Can Help
  • XML validation tools should emit reports in XML
    (preferably in a well-defined language)
  • Constraints should have identity
  • Bad nodes should be located within the tested
    instance, and associated with the constraint they
    have violated
  • formal statement this has gone wrong here

24
this has gone wrong here
  • Enables filtering (and other transformation) of
    results
  • Enables annotation of source instance
  • Enables onward machine processing of all kinds

25
In Practice
  • An instance
  • A rule
  • Processing
  • Result (XML)
  • Result (styled)

26
Other scruffiness
  • Content-mandated structure, e.g.
  • Urls should be marked-up //not(name()'uri')
    and not(name()'ext-link')text()match-regexp("\
    bwww.", .) or match-regexp("\bhttp", .)
  • Conformance to non-XML specifications (e.g. ISO
    formats)

27
Other scruffiness (contd.)
  • Co-location tests (long-range grammar)
  • Lexical checks (on CDATA marked sections, entity
    use, PI format, comments, e.g.)
  • External resource availability / type (moving
    beyond validation?)

28
Other scruffiness (contd.)
  • Tables !
  • Constrained vocabulary checking
  • Weird stuff (no numeric character entities?)

29
Conclusions
  • XML datasets have data quality problems beyond
    the scope of neat validation languages
  • There is scope to generalise beyond custom
    programming
  • Most XML processing pipelines need one!
  • Consider unchecked data faulty

30
Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com