Scruffy Validation - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Scruffy Validation

Description:

Scruffy Validation. Alex Brown. Griffin Brown Digital Publishing Ltd. Menu ... Scruffy: heuristic, procedural, narrative. Neat Validation ... Scruffy Validation ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 31

Provided by: AlexB65

Category:

more less

Transcript and Presenter's Notes

Title: Scruffy Validation

1
Scruffy Validation

Alex BrownGriffin Brown Digital Publishing Ltd.

2
Menu

Background
A Case Study using the NLM DTD
Limitations of neat grammars
Possible solutions
XPath-based validation
The user perspective
Seeing it work
Other scruffy validation problems

3
Background

Neat vs. Scruffy a common opposition in
computing
Neat algorithmic, function-based, mathematical
Scruffy heuristic, procedural, narrative

4
Neat Validation

DTDs, Relax-NG (for example) grammar-based
schema languages
Syntax-directing, so suitable for editor
implementations
In practice, models can be represented
diagrammatically

5
Scruffy Validation

Includes many non-grammatical tests, shading into
application-specific data testing (aka business
rules)
Sees validation not as a boolean test but as a
more nuanced report on a instances state good
enough
Includes concept of validation management from
user perspective

6
A Simple Example

NCBI/NLM DTD family http//dtd.nlm.nih.gov/publish
ing/
Journal Publishing application enjoying some
traction
State of the art DTD

7
NCBI/NLM Publishing DTD
8
NCBI/NLM Publishing DTD
9
Data Rules

The ltfpagegt and ltlpagegt elements should have
numeric content
If one exists, both must exist
The number contained by ltlpagegt must be equal to
or greater than that contained by ltfpagegt

10
The DTD cannot enforce this

Content model is(PCDATA fpage lpage )
Enforcement of the data rules is beyond the
abilities of the schema language.

11
Solution 1 Programming

Either using specialised XML processing language,
or standards-based approach (SAX)
Pseudocode
When ltfpagegt is encountered, check value and
store
When ltlpagegt is encounteed, check format and
compare with stored ltfpagegt value

12
Advantages of SAX-based validation

Powerful the limits are those of the
implementing system
Fast execution stream based parsing is fast
Memory efficient no need to build in-memory
tree may be only route for certain applications

13
Disadvantages of SAX-based validation

Can be cumbersome to code
Therefore is difficult to maintain moves
validation firmly into the realm of software
development
Conceptually messy it feels like theres a
generalisation to be made (API?)

14
Solution 2 Schema Language

Schematron, XMLProbe/SILCN, CLiX,BI-ICS4J XSLT
Generally higher-level expression of rules in
XML! (well )
Generally, built around an in-memory tree
representation of XML to be tested

15
Using XPath to express a constraint

//lpagematch-regexp("(\d)",.) and
match-regexp("(\d)",preceding-siblingfpage)
not(number(.)gt number(preceding-siblingfpage
))
N.B. XPath has here been extended (thus we have a
route to the power of the implementing system)

16
Advantages of XPath-based validation

In harmony with XML model extensible (and needs
to be!)
Visual tools available for expression testing
Maintainable by non-programmers

17
Disadvantages of XPath-based validation

Needs an in-memory tree
Not suitable for certain kinds of testing, where
programming is required, and the result is more
complex than a node list
NB where XPath is, XQuery will be

18
Fitting Scruffy Validation into a Process
Well-formed?
NeatValidation
ScruffyValidation
DTDRELAX NGW3C Schema
SchematronXSLTXMLProbe
XML 1.0/1.1
19
Finding interesting nodes
20
The User Perspective Reporting Data Faults

Users ideally want to know
What is wrong
Why (or how much) its wrong
Where its wrong
How to fix it
Not good enough to say this is a quality of
implementation issue

21
Zooming Out Deviations from Neatness

Some defects may be acceptable
Correctness may ultimately only be decidable by a
human
Different audiences may want different views onto
data validation reports

22
A Comparison with Software Development

lint purify augmenting syntax checking
For many users XML validation is stuck in the 80s
(generally)
This is partly a tools issue but XML can help

23
XML Can Help

XML validation tools should emit reports in XML
(preferably in a well-defined language)
Constraints should have identity
Bad nodes should be located within the tested
instance, and associated with the constraint they
have violated
formal statement this has gone wrong here

24
this has gone wrong here

Enables filtering (and other transformation) of
results
Enables annotation of source instance
Enables onward machine processing of all kinds

25
In Practice

An instance
A rule
Processing
Result (XML)
Result (styled)

26
Other scruffiness

Content-mandated structure, e.g.
Urls should be marked-up //not(name()'uri')
and not(name()'ext-link')text()match-regexp("\
bwww.", .) or match-regexp("\bhttp", .)
Conformance to non-XML specifications (e.g. ISO
formats)

27
Other scruffiness (contd.)

Co-location tests (long-range grammar)
Lexical checks (on CDATA marked sections, entity
use, PI format, comments, e.g.)
External resource availability / type (moving
beyond validation?)

28
Other scruffiness (contd.)

Tables !
Constrained vocabulary checking
Weird stuff (no numeric character entities?)

29
Conclusions

XML datasets have data quality problems beyond
the scope of neat validation languages
There is scope to generalise beyond custom
programming
Most XML processing pipelines need one!
Consider unchecked data faulty

30
Any Questions?

Write a Comment

User Comments (0)