Unified Validation Reporting in HTML Renderings

Gerrit Imsieke, le-tex publishing services GmbH

@gimsieke, @letexml

Multi-Step Conversion Pipelines

IDML synthesis from docx via XHTML

IDML synthesis from docx via XHTML

Multi-Step Conversion Pipelines

IDML to XML to EPUB conversion

IDML → XML → EPUB conversion

Unanchored frames
Unexpected styles

Dangling figure captions
Skipped headings

Uncited literature
typography

Missing metadata

epubcheck
broken links
CMYK JPEGs

Pipelines

  • XProc orchestration
  • XSLT transformations
  • Relax NG validation
  • Schematron validation

transpect

XProc/XSLT Libraries
+ Methodology:

  • Cascaded configuration (global, per imprint, per book series, per work, etc.)
  • Consolidated HTML reports

Libraries

  • IDML → flat Hub XML
  • docx → flat Hub XML
  • XHTML → IDML
  • DocBook → docx
  • CSS parser
  • HTML → EPUB
  • Hub/DocBook → TEI
  • Hub/DocBook → JATS/BITS/ISOSTS
  • TEI → HTML
  • JATS/BITS/ISOSTS → HTML
  • xslx → XHTML
  • EPUB → Hub XML
  • Image analysis/resizing

Error reporting

Error reporting classification

xmllint

epubcheck

Spellcheck

LanguageTool

oXygen

transpect
htmlreports

Sample report (docx→Hub→TEI→HTML→EPUB)

Unionsverlag Testbuch

Relating source locations to error messages

srcpaths in Hub XML

Prerequisite: @srcpath added at first conversion step

Relating source locations to error messages

srcpaths in Schematron

Including the closest @srcpath as <span class="srcpath"> in Schematron messages

Patching reports into HTML

srcpaths

Suffice it to say that it involves xsl:namespace-alias.

Schema validation

  • jing patch that outputs XPath for each validation error.
  • Look up closest @srcpath for given XPath.
  • Emit SVRL with srcpath spans:

srcpaths

Drawbacks of @srcpath

  • XSLT developers have to take care that no srcpaths are lost in the steps
  • They clutter intermediate files (a p:delete will fix that)

Alternative to @srcpath:
Processing instructions

  • PIs may get lost during conversion, too
  • no rich (XML) markup within PIs: no links to add’l info, no lists, no tables, …
  • documents even more cluttered

Benefit of hierarchic @srcpaths

as opposed to more compact versions, based on CRC checksums, for ex. '75A16B0' instead of the full

'file:/C:/cygwin/home/gerrit/Unionsverlag/content/unionsverlag/standard/39000/docx/UV_STD_00000_39000_DOCX_TestBuch.docx.tmp/word/document.xml?xpath=/w:document[1]/w:body[1]/w:p[995]/w:r[13]'

Hierarchical srcpaths allow backtracking: if /w:document[1]/w:body[1]/w:p[995]/w:r[13] was unwrapped or discarded, try /w:document[1]/w:body[1]/w:p[995]

Complaints / improvements

  • Noise (irrelevant/incomprehensible messages)
    ⇒ improve Schematron (matches, tests, and texts)
  • incomprehensible categories (docx2hub, flat, evolved, tei, …) ⇒ <span class="category"> (before/after)
  • unclear affordances
    in-place messages
  • unavailable content
  • language switching in the report

Thank you