le-tex transpect

Conversion & Checking Framework

Gerrit Imsieke, le-tex publishing services
XML Prague, 2014-02-14

@gimsieke, @letexml

Book Publishing Workflows in 2014

  • less standardized than journal publishing or tech doc
  • predominant tools/formats: Word and InDesign (even in STM)
  • EPUB/KF8 output alongside print mandatory
  • XML output often required
  • poor EPUB & XML support in Word or InDesign
  • poor support for mapping Word manuscripts to InDesign
  • authors, copy editors, typesetters not trained in XML & data conversion
  • frequent requirement: need Word data from previous edition

Example: Conventional-wisdom InDesign ↔ XML workflow

  1. Create XML from Word (by conversion tool or offshore service vendor)
  2. Import XML
  3. Beautify layout
  4. Carry out author corrections
  5. Notice that layout & tagging have diverged
  6. Try to fix embedded tagging despite crappy editor, insufficient XML knowledge and lack of feedback from the app as to the XML’s appropriateness

#fail

  • We can’t fix the tools
  • We can’t make them use other tools
  • We don’t want to create plugins for the apps
    ⇒ compatibility / support / small vendor dependency / functionality / API longevity issues
Wrong life cannot be lived rightly.  ― Adorno

XML stack to the rescue

  • Their tools generate well-defined, long-term stable, XML-based container formats (OOXML a.k.a. .docx, IDML)
  • XSLT 2.0 is well suited to up-convert the flat structures
  • Schematron is well suited to detect conversion obstacles or input amiguities on many levels

our approach

  • Typesetter typesets
  • adheres to conventions (styles, anchoring, …)
  • but no exposure to tags
  • exports IDML & uploads it to conversion/checking service
  • IDML is converted to Hub XML → XHTML → EPUB
    (much XSLT 2, so XProc, many passes, wow)
  • Schematron checks on some of the intermediate XML docs
  • Visualization of the errors in an HTML rendering
  • Typesetter corrects mistakes in InDesign

A Framework?

Should I keep working on this app/library/framework?

Don’t worry

There are also apps (standalone tools, demo) and libs (the modules):

Detailed list of transpect modules and standalone tools.

And people are actually using it.

Example: Hogrefe Verlag

Anatomy of a transpect project: Externals

transpect project repository with svn externals and own files


svn:externals
  schema/Hub -r37 https://github.com/gimsieke/Hub/trunk/
  schema/Hub/css -r42 https://github.com/gimsieke/CSSa/trunk/
  schema/rdfa -r179 https://subversion.le-tex.de/common/schema/rdfa/trunk/
  calabash -r1344 https://subversion.le-tex.de/common/calabash/
  converter -r1411 https://subversion.le-tex.de/common/pubcoach/trunk/
  crossref -r1461 https://subversion.le-tex.de/common/crossref/trunk/
  evolve-hub -r1418 https://subversion.le-tex.de/common/evolve-hub/
  hobots2html -r1416 https://subversion.le-tex.de/common/hobots2html_simple/trunk
  hub2hobots -r1465 https://subversion.le-tex.de/common/hub2bits/trunk
  htmlreports -r1299 https://subversion.le-tex.de/common/htmlreports/trunk
  idml2xml -r404 https://subversion.le-tex.de/idmltools/trunk/idml2xml
  epubtools -r1435 https://subversion.le-tex.de/common/epubtools/
  fontlib/dejavu-sans -r673 https://subversion.le-tex.de/common/fontlib/dejavu-sans
  fontlib/xmlcatalog -r908 https://subversion.le-tex.de/common/fontlib/xmlcatalog
  css-expand -r1362 https://subversion.le-tex.de/common/css-expand/
  css-generate -r1265 https://subversion.le-tex.de/common/css-generate/
  xproc-util/store-debug -r1183 https://subversion.le-tex.de/common/xproc-util/store-debug
  xproc-util/xml-model -r1356 https://subversion.le-tex.de/common/xproc-util/xml-model
  xproc-util/xslt-mode -r1357 https://subversion.le-tex.de/common/xproc-util/xslt-mode
  xslt-util/colors -r1408 https://subversion.le-tex.de/common/letex-util/colors/
  xslt-util/hex -r966 https://subversion.le-tex.de/common/letex-util/hex/
  xslt-util/lengths -r1393 https://subversion.le-tex.de/common/letex-util/lengths/
  xslt-util/resolve-uri -r1161 https://subversion.le-tex.de/common/letex-util/resolve-uri/
  xslt-util/mime-type -r1391 https://subversion.le-tex.de/common/letex-util/mime-type/
  xslt-util/functx/XML_Elements_and_Attributes/XML_Document_Structure -r465 https://subversion.le-tex.de/common/functx/XML_Elements_and_Attributes/XML_Document_Structure
  xslt-util/functx/Sequences/Positional -r461 https://subversion.le-tex.de/common/functx/Sequences/Positional
  xslt-util/functx/xmlcatalog -r468 https://subversion.le-tex.de/common/functx/xmlcatalog
  infrastructure/epubcheck -r1170 https://subversion.le-tex.de/common/epubcheck

Note: svn externals work with github repos, too (in case we’re moving everything to github)

Anatomy of a transpect project: XML Catalog

${pdu}/xmlcatalog/catalog.xml

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <rewriteURI uriStartString="http://customers.le-tex.de/generic/book-conversion/" rewritePrefix="../"/>
  
  <nextCatalog catalog="content-repo.catalog.xml"/>

  <nextCatalog catalog="../idml2xml/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xproc-util/store-debug/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xproc-util/xslt-mode/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xproc-util/xml-model/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../htmlreports/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../hobots2html/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../hub2hobots/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../evolve-hub/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../converter/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../epubtools/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../css-expand/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../css-generate/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../schema/Hub/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../fontlib/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xslt-util/colors/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xslt-util/hex/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xslt-util/lengths/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xslt-util/resolve-uri/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xslt-util/functx/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../xslt-util/mime-type/xmlcatalog/catalog.xml"/>
  
  <!-- content-repo.catalog.xml is unversioned.
    Use this single-entry catalog for local configuration, like this: 
  <rewriteURI uriStartString="http://cms.publisher.com/Books/" rewritePrefix="file:///c:/cygwin/home/user/Publisher/BookWorkflow/content/"/>
    where http://cms.publisher.com/Books/ is a placeholder for what is actually in /publisher-conf/@content-base-uri
    You may also provide a versioned file, content-repo.default.catalog.xml. It will be used if
    no unversioned file is available.
    Please note that due to shortcomings of Saxon’s use of catalogs, we’ll have to do catalog 
    resolution by hand with fixed file names in pubcoach/xsl/paths.xsl. So please use only
    content-repo.catalog.xml and content-repo.default.catalog.xml because otherwise it will break.
  -->
  <nextCatalog catalog="content-repo.catalog.xml"/>
  <nextCatalog catalog="content-repo.default.catalog.xml"/>

  <rewriteURI uriStartString="http://hobots.hogrefe.com/" rewritePrefix="../" />
  <rewriteURI uriStartString="https://hobots.hogrefe.com/" rewritePrefix="../" />
  
</catalog>

${pdu}/idml2xml/xmlcatalog/catalog.xml

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <rewriteURI uriStartString="http://transpect.le-tex.de/idml2xml/" 
    rewritePrefix="../"/>
</catalog>

Anatomy of a transpect project

  • a directory in an svn repo
  • svn externals for the modules
  • one of the modules is calabash
  • an XML catalog in xmlcatalog/catalog.xml
  • an adaptions directory for adaptations (XSLT, XProc, Schematron, CSS, …)
  • one or more front-end XProc pipelines, typically in adaptions/common/xpl
  • adaptions may be on several levels (e.g., common, imprint, series, work)

Customization levels

  • common
  • publisher
  • series
  • work

Number and names of customization levels may be customized (in principle).

Will be supplied either as XProc options or via input file name parsing (see .xpl and .xsl for project-specific filename parsing).

The paths document

<c:param-set xmlns:c="http://www.w3.org/ns/xproc-step">
  <c:param name="debug" value="yes"/>
  <c:param name="debug-dir-uri"
          value="file:///data/davrails/data/davbase/generic/HoBoTS_a6d6d3c094a8/gui/101026_00042_RAT.idml/out/debug"/>
  <c:param name="srcpaths" value="yes"/>
  <c:param name="pipeline" value="idml2hobots.xpl"/>
  <c:param name="publisher" value="hogrefe.de"/>
  <c:param name="series" value="RAT"/>
  <c:param name="work" value="00042"/>
  <c:param name="progress" value="yes"/>
  <c:param name="progress-to-stdout" value="no"/>
  <c:param name="common-path"
          value="file:/data/davrails/code/converter/generic/HoBoTS_a6d6d3c094a8/BookTagSet/trunk/adaptions/common/"/>
  <c:param name="publisher-path"
          value="file:/data/davrails/code/converter/generic/HoBoTS_a6d6d3c094a8/BookTagSet/trunk/adaptions/hogrefe.de/"/>
  <c:param name="series-path"
          value="file:/data/davrails/code/converter/generic/HoBoTS_a6d6d3c094a8/BookTagSet/trunk/adaptions/hogrefe.de/RAT/"/>
  <c:param name="file"
          value="/data/davrails/data/davbase/generic/HoBoTS_a6d6d3c094a8/gui/101026_00042_RAT.idml/out/101026_00042_RAT.idml"/>
  <c:param name="file-uri"
          value="file:///data/davrails/data/davbase/generic/HoBoTS_a6d6d3c094a8/gui/101026_00042_RAT.idml/out/101026_00042_RAT.idml"/>
  <c:param name="work-path"
          value="file:/data/davrails/code/converter/generic/HoBoTS_a6d6d3c094a8/BookTagSet/content//hogrefe.de/RAT/00042/"/>
  <c:param name="work-basename" value="101026_00042_RAT"/>
  <c:param name="interface-language" value="de"/>
</c:param-set>
			    

bc:load-cascaded, bc:load-cascaded-binary

<bc:load-cascaded name="lc" 
  required="no" 
  filename="hobots2html/hobots2html.xsl" 
  fallback="http://transpect.le-tex.de/hobots2html/xsl/hobots2html.xsl">
  <p:input port="paths">
    <p:pipe port="paths" step="hobots2html"/>
  </p:input>
</bc:load-cascaded>

will look for hobots2html/hobots2html.xsl in $work-path, $series-path, $publisher-path, and then $common-path. First match will be returned. If nothing available, fallback URI will be used.

xsl:import a more generic stylesheet

…/content/hogrefe.com/ADHOC/00376/hobots2html/hobots2html.xsl (a per-work customization)

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:jats="http://jats.nlm.nih.gov"
  xmlns:dbk="http://docbook.org/ns/docbook"
  xmlns:css="http://www.w3.org/1996/css"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  xmlns="http://www.w3.org/1999/xhtml"
  exclude-result-prefixes="css jats dbk xs"
  version="2.0">
  <xsl:import href="http://customers.le-tex.de/generic/book-conversion/adaptions/common/hobots2html/hobots2html.xsl"/>
  <xsl:template match="boxed-text/sec[p[@specific-use eq 'EpubAlternative']]/title/inline-graphic" mode="hobots2html"
    priority="2"/>
  <xsl:template match="inline-graphic[contains(@xlink:href, 'dot_matrix')]" mode="hobots2html">
    <code>T7253j989c</code>
  </xsl:template>
  <xsl:template match="inline-graphic[matches(@xlink:href, 'fig_(tree)')]" mode="hobots2html"/>
</xsl:stylesheet>

Why load-cascaded?

  • No need to contaminate your core module code with conditional code for imprints, series, or works
  • Predictable, canonical locations for the overriding code

Multi-pass conversions: XProc Orchestration

idml2xml: 20 XSLT passes in different modes
docx2hub: 7 XSLT passes
hobots2html: 3 XSLT passes

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" 
  xmlns:c="http://www.w3.org/ns/xproc-step"  
  xmlns:cx="http://xmlcalabash.com/ns/extensions" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:bc="http://transpect.le-tex.de/book-conversion"
  xmlns:transpect="http://www.le-tex.de/namespace/transpect"  
  xmlns:din="http://din.de/namespace"
  xmlns:letex="http://www.le-tex.de/namespace"
  version="1.0"
  name="hobots2html"
  >
  
  <p:option name="debug" required="false" select="'no'"/>
  <p:option name="debug-dir-uri" />
  
  <p:input port="source" primary="true"/>
  <p:input port="parameters" kind="parameter" primary="true"/>
  <p:input port="stylesheet"/>
  <p:output port="result" primary="true"/>
  
  <p:import href="http://xmlcalabash.com/extension/steps/library-1.0.xpl" />
  <p:import href="http://transpect.le-tex.de/xproc-util/xslt-mode/xslt-mode.xpl"/>
  
  <letex:xslt-mode prefix="hobots2html/02" mode="epub-alternatives">
    <p:input port="source">
      <p:pipe step="hobots2html" port="source"/>
    </p:input>
    <p:input port="stylesheet"><p:pipe step="hobots2html" port="stylesheet"/></p:input>
    <p:input port="models"><p:empty/></p:input>
    <p:with-option name="debug" select="$debug"/>
    <p:with-option name="debug-dir-uri" select="$debug-dir-uri"/>
  </letex:xslt-mode>

  <letex:xslt-mode prefix="hobots2html/05" mode="hobots2html">
    <p:input port="stylesheet"><p:pipe step="hobots2html" port="stylesheet"/></p:input>
    <p:input port="models"><p:empty/></p:input>
    <p:with-option name="debug" select="$debug"/>
    <p:with-option name="debug-dir-uri" select="$debug-dir-uri"/>
  </letex:xslt-mode>

  <letex:xslt-mode prefix="hobots2html/20" mode="clean-up">
    <p:input port="stylesheet"><p:pipe step="hobots2html" port="stylesheet"/></p:input>
    <p:input port="models"><p:empty/></p:input>
    <p:with-option name="debug" select="$debug"/>
    <p:with-option name="debug-dir-uri" select="$debug-dir-uri"/>
  </letex:xslt-mode>
  
</p:declare-step>

Cascaded XProc pipeline loading

What if a book needs a special, additional pipeline step, or if a series is known to not contain lists, tables, or figures?

  • evolve-hub: 330 XSLT passes, depending on project
  • evolve-hub is more of a toolkit

⇒ load the inner xpl for that step dynamically, too

⇒ cx:eval

⇒ bc:dynamic-transformation-pipeline

where both xpl and xsl are (independently) 
bc:load-cascaded

<bc:dynamic-transformation-pipeline 
    load="evolve-hub/driver" 
    fallback-xpl="http://transpect.le-tex.de/evolve-hub/xpl/fallback.xpl" 
    fallback-xsl="http://transpect.le-tex.de/evolve-hub/evolve-hub.xsl">
    <p:with-option name="debug" select="$debug"/>
    <p:with-option name="debug-dir-uri" select="$debug-dir-uri"/>
    <p:input port="additional-inputs"><p:empty/></p:input>
    <p:input port="options"><p:empty/></p:input>
  </bc:dynamic-transformation-pipeline>

There’s no p:import like xsl:import

(unfortunately)

You don’t want to copy&paste 29 of the 30 steps of another pipeline just because you want to skip an expensive but non-essential step in the middle.

You don’t want additional levels in the cascade just to be able to store common configurations.

You want…

.xpl.xsl

…new levels of indirection

…XSLT FTW

The whole evolve-hub pipeline, except certain steps

.xsl for generating .xpl

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs"
  version="2.0">
  
  <xsl:template name="main">
    <xsl:apply-templates select="document('http://customers.le-tex.de/generic/book-conversion/adaptions/common/evolve-hub/driver.xpl')"/>
  </xsl:template>

  <xsl:template match="@* | *">
    <xsl:copy copy-namespaces="yes">
      <xsl:apply-templates select="@*, node()"/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="*:xslt-mode[matches(@mode, '(split|figure|complex|preprocess-hierarchy|right-tab)')]"/>
    
</xsl:stylesheet>

Validation

  • Schematron or Relax NG checks may be inserted after each step
  • Schematron checks are named/categorized (“idml”, “flat-hub”, “evolved-hub”, etc.). This category is called the check “family” (because “group” and “phase” were already taken)
  • Within each family, the checks are consolidated by common ID prior to execution: only the most specific check with a given ID wins.
  • Results (as SVRL with @srcpath information) will be grouped by @srcpath
  • and then patched into an HTML rendering (that does also include @srcpaths)

Schematron check results grouped by srcpath

<bc:messages srcpath="file:/C:/cygwin/home/gerrit/Hogrefe/BookTagSet/content/hogrefe.de/RAT/00042/idml/101026_00042_RAT.idml.tmp/Stories/Story_u221.xml?xpath=/idPkg:Story[1]/Story[1]/ParagraphStyleRange[89]">
      <bc:message srcpath="file:/C:/cygwin/home/gerrit/Hogrefe/BookTagSet/content/hogrefe.de/RAT/00042/idml/101026_00042_RAT.idml.tmp/Stories/Story_u221.xml?xpath=/idPkg:Story[1]/Story[1]/ParagraphStyleRange[89]"
                  xml:id="BC_d29424e2426"
                  severity="warning"
                  type="evolve-hub warning wrong_bibliography_heading2"
                  rendered-key="L"
                  occurrence="3">
         <svrl:text xmlns:schold="http://www.ascc.net/xml/schematron"
                    xmlns:iso="http://purl.oclc.org/dsdl/schematron"
                    xmlns:xhtml="http://www.w3.org/1999/xhtml"
                    xmlns:dbk="http://docbook.org/ns/docbook"
                    xmlns:css="http://www.w3.org/1996/css"
                    xmlns:xlink="http://www.w3.org/1999/xlink">
            <span xmlns="http://purl.oclc.org/dsdl/schematron" class="srcpath">file:/C:/cygwin/home/gerrit/Hogrefe/BookTagSet/content/hogrefe.de/RAT/00042/idml/101026_00042_RAT.idml.tmp/Stories/Story_u221.xml?xpath=/idPkg:Story[1]/Story[1]/ParagraphStyleRange[89]</span> 
        There is a style of a bibliography title used but the title neither contains 'References', 'Bibliography',
        'Discography'. It's style is: 'hog_paragraphs_headings_sec_p_h_sec3_supplementary_bibliography'.</svrl:text>
         <svrl:diagnostic-reference xmlns:schold="http://www.ascc.net/xml/schematron"
                                    xmlns:iso="http://purl.oclc.org/dsdl/schematron"
                                    xmlns:xhtml="http://www.w3.org/1999/xhtml"
                                    xmlns:dbk="http://docbook.org/ns/docbook"
                                    xmlns:css="http://www.w3.org/1996/css"
                                    xmlns:xlink="http://www.w3.org/1999/xlink"
                                    diagnostic="wrong_bibliography_heading2_de"
                                    xml:lang="de">

      Hier wurde eine Überschrift für ein Literaturverzeichnis verwendet, aber der Inhalt der Überschrifte ist weder »Literatur«,
      »Bibliografie«, »References« oder »Diskografie«. Ihr Format lautet: 'hog_paragraphs_headings_sec_p_h_sec3_supplementary_bibliography'.
    </svrl:diagnostic-reference>
      </bc:message>
      <bc:message srcpath="file:/C:/cygwin/home/gerrit/Hogrefe/BookTagSet/content/hogrefe.de/RAT/00042/idml/101026_00042_RAT.idml.tmp/Stories/Story_u221.xml?xpath=/idPkg:Story[1]/Story[1]/ParagraphStyleRange[89]"
                  xml:id="BC_d29424e2683"
                  severity="error"
                  type="RNG error RNG_hobots"
                  href="#BC_d29424e2688"
                  rendered-key="Q"
                  occurrence="4">
         <svrl:text xmlns="http://purl.oclc.org/dsdl/svrl">
            <span xmlns="http://purl.oclc.org/dsdl/schematron" class="srcpath">file:/C:/cygwin/home/gerrit/Hogrefe/BookTagSet/content/hogrefe.de/RAT/00042/idml/101026_00042_RAT.idml.tmp/Stories/Story_u221.xml?xpath=/idPkg:Story[1]/Story[1]/ParagraphStyleRange[89]</span>/book/front-matter[2]/foreword[1]/back[1]/ref-list[1]/ref-list[4]/title[1] element "sec" not allowed here; expected the element end-tag or element "ack", "address", "alternatives", "answer", "answer-set", "array", "boxed-text", "chem-struct-wrap", "code", "def-list", "disp-formula", "disp-formula-group", "disp-quote", "fig", "fig-group", "graphic", "list", "media", "ns:math", "p", "preformat", "question", "question-wrap", "ref", "ref-list", "related-article", "related-object", "speech", "statement", "supplementary-material", "table-wrap", "table-wrap-group", "tex-math", "verse-group" or "x" (with xmlns:ns="http://www.w3.org/1998/Math/MathML")</svrl:text>
      </bc:message>
   </bc:messages>

Indirection fun fact: An XSLT stylesheet is being generated from the grouped messages. It processes the HTML rendering, matching @srcpaths for which messages exist. See the result

Other transpect projects

  • Demo (.docx → EPUB)
  • .docx → Hub → XHTML → IDML
  • .docx → Hub → linguistic analysis → .docx

Summary: The Framework’s Methodology

  • Externals
  • Catalogs
  • Configuration Cascade
  • Paths document of the parameter kind
  • Dynamic XProc generation & execution
  • @srcpath
  • HTML reports

Service offerings

  • Customizing
  • Hosting
  • Compatibility checks (Continuous Integration server)

Outlook / To Do

  • documentation (maybe also based on xprocdoc)
  • move to github?
  • Maven packages?