Title: Scruffy Validation
1Scruffy Validation
- Alex BrownGriffin Brown Digital Publishing Ltd.
2Menu
- Background
- A Case Study using the NLM DTD
- Limitations of neat grammars
- Possible solutions
- XPath-based validation
- The user perspective
- Seeing it work
- Other scruffy validation problems
3Background
- Neat vs. Scruffy a common opposition in
computing - Neat algorithmic, function-based, mathematical
- Scruffy heuristic, procedural, narrative
4Neat Validation
- DTDs, Relax-NG (for example) grammar-based
schema languages - Syntax-directing, so suitable for editor
implementations - In practice, models can be represented
diagrammatically
5Scruffy Validation
- Includes many non-grammatical tests, shading into
application-specific data testing (aka business
rules) - Sees validation not as a boolean test but as a
more nuanced report on a instances state good
enough - Includes concept of validation management from
user perspective
6A Simple Example
- NCBI/NLM DTD family http//dtd.nlm.nih.gov/publish
ing/ - Journal Publishing application enjoying some
traction - State of the art DTD
7NCBI/NLM Publishing DTD
8NCBI/NLM Publishing DTD
9Data Rules
- The ltfpagegt and ltlpagegt elements should have
numeric content - If one exists, both must exist
- The number contained by ltlpagegt must be equal to
or greater than that contained by ltfpagegt
10The DTD cannot enforce this
- Content model is(PCDATA fpage lpage )
- Enforcement of the data rules is beyond the
abilities of the schema language.
11Solution 1 Programming
- Either using specialised XML processing language,
or standards-based approach (SAX) - Pseudocode
- When ltfpagegt is encountered, check value and
store - When ltlpagegt is encounteed, check format and
compare with stored ltfpagegt value
12Advantages of SAX-based validation
- Powerful the limits are those of the
implementing system - Fast execution stream based parsing is fast
- Memory efficient no need to build in-memory
tree may be only route for certain applications
13Disadvantages of SAX-based validation
- Can be cumbersome to code
- Therefore is difficult to maintain moves
validation firmly into the realm of software
development - Conceptually messy it feels like theres a
generalisation to be made (API?)
14Solution 2 Schema Language
- Schematron, XMLProbe/SILCN, CLiX,BI-ICS4J XSLT
- Generally higher-level expression of rules in
XML! (well ) - Generally, built around an in-memory tree
representation of XML to be tested
15Using XPath to express a constraint
- //lpagematch-regexp("(\d)",.) and
match-regexp("(\d)",preceding-siblingfpage)
not(number(.)gt number(preceding-siblingfpage
)) - N.B. XPath has here been extended (thus we have a
route to the power of the implementing system)
16Advantages of XPath-based validation
- In harmony with XML model extensible (and needs
to be!) - Visual tools available for expression testing
- Maintainable by non-programmers
17Disadvantages of XPath-based validation
- Needs an in-memory tree
- Not suitable for certain kinds of testing, where
programming is required, and the result is more
complex than a node list - NB where XPath is, XQuery will be
18Fitting Scruffy Validation into a Process
Well-formed?
NeatValidation
ScruffyValidation
DTDRELAX NGW3C Schema
SchematronXSLTXMLProbe
XML 1.0/1.1
19Finding interesting nodes
20The User Perspective Reporting Data Faults
- Users ideally want to know
- What is wrong
- Why (or how much) its wrong
- Where its wrong
- How to fix it
- Not good enough to say this is a quality of
implementation issue
21Zooming Out Deviations from Neatness
- Some defects may be acceptable
- Correctness may ultimately only be decidable by a
human - Different audiences may want different views onto
data validation reports
22A Comparison with Software Development
- lint purify augmenting syntax checking
- For many users XML validation is stuck in the 80s
(generally) - This is partly a tools issue but XML can help
23XML Can Help
- XML validation tools should emit reports in XML
(preferably in a well-defined language) - Constraints should have identity
- Bad nodes should be located within the tested
instance, and associated with the constraint they
have violated - formal statement this has gone wrong here
24this has gone wrong here
- Enables filtering (and other transformation) of
results - Enables annotation of source instance
- Enables onward machine processing of all kinds
25In Practice
- An instance
- A rule
- Processing
- Result (XML)
- Result (styled)
26Other scruffiness
- Content-mandated structure, e.g.
- Urls should be marked-up //not(name()'uri')
and not(name()'ext-link')text()match-regexp("\
bwww.", .) or match-regexp("\bhttp", .) - Conformance to non-XML specifications (e.g. ISO
formats)
27Other scruffiness (contd.)
- Co-location tests (long-range grammar)
- Lexical checks (on CDATA marked sections, entity
use, PI format, comments, e.g.) - External resource availability / type (moving
beyond validation?)
28Other scruffiness (contd.)
- Tables !
- Constrained vocabulary checking
- Weird stuff (no numeric character entities?)
29Conclusions
- XML datasets have data quality problems beyond
the scope of neat validation languages - There is scope to generalise beyond custom
programming - Most XML processing pipelines need one!
- Consider unchecked data faulty
30Any Questions?