This post is also available in: German
Suppose you imported XML data into an InDesign document. Suppose that the layout should convey the markup’s semantics: keywords in italics, proper names in small caps, block quotes indented etc.
There are several ways how to map the markup to the layout. But it is important to know: no matter how you’ve mapped it, once the mapping has taken place, markup and layout information may evolve in totally different directions. And this is dangerous. For real-world XML document types and real-world typesetting, mapping is a one-way street, leading from markup to layout, and not the other way round, as we’ll see later. If you trust that after carrying out author corrections, everything that looks like a keyword will be a keyword in the exported XML, you may be proven wrong later. Or the two paragraphs that you see in InDesign are still a single one in XML, because it has been split only visually after import and the markup hasn’t been updated accordingly.
One of the main rationales behind XML-first workflows is that the correctness of the tagging should be verified by proofreading the main rendering, which is traditionally for print. This is based on two major assumptions: that every relevant tagging will be rendered distinctively, and that everything that is rendered in a certain way is tagged in a corresponding manner. There is no guarantee that especially the latter assumption is satisfied in popular InDesign XML workflows.
Consider a publisher’s workflow where manuscripts are being normalized in Word, then converted to XML and imported into InDesign. An equivalent scenario will be a workflow where manuscripts are being edited in an online editor, saved as XML and then imported. A typesetter will refine the page makeup then, a proof (PDF or printout) will be sent to the author, and corrections are being carried out in InDesign.
Some typical author corrections:
- make italic words upright, or vice versa;
- split a paragraph into two, or join two paragraphs, or insert new paragraph;
- add a new image with a caption.
These corrections will be carried out meticulously, with the following results:
- The words in italic lack the corresponding XML markup;
- in terms of tagging, the split paragraph still looks as a single one;
- the image and its caption don’t appear at all in the XML export.
What’s going wrong here? Apparently formatting changes and newly created text frames don’t find their way to the underlying XML structures automatically. This is because mapping occurs during import, from XML to formatting, but not the other way round.
Technically it’s because layout and markup information are two independent sets of information that sit on top of the textual content. The software designers of InDesign probably thought that a typesetter’s styling doesn’t affect the markup. In database publishing, for example, this may be a valid assumption. Therefore they don’t even issue warnings when there is untagged content or when character or paragraph formatting deviates from the formatting that has been created initially according to the mapping instructions. A warning in InDesign could look like the pink backgound color that indicates missing fonts. Lack of such a warning is an omission that may be fixed easily if there was some priority to it.
To give full credit, the developers indeed conceived a way of mapping styles to element names (“tags”). This will work for named paragraph or character styles only – fair enough. But this mapping is so absurdly limited that it cannot accommodate real-life document types with all kinds of attributes that distinguish different paragraph types, or with content grouped in boxes or in a hierarchy. You could only use this mapping to create a trivial XML format that will have to be transformed into the real thing later, for example by applying XSLT on export. This in turn is limited to XSLT 1.0 where important concepts such as grouping are not as readily available as in XSLT 2. In addition, the typesetter will still be able to use explicit ad-hoc formatting that looks like there’s some semantics behind the appearance, but actually it isn’t because plain italics won’t map to the keyword tag. For these reasons, we think that this rather trivial mapping during editing, and the whole idea of carrying simplified XML with you during editing, could be dispensed with, in favor of processing an IDML export of the final document, as described below.
But returning to the case where you have mapped an article’s real-life complex XML to InDesign formatting and you’ve started applying author corrections. Even if InDesign warned about deviating XML/formatting, somebody would have to assume the responsibility of making appearence and tagging coherent. And this is extra work, vulgo costs, that these InDesign XML first workflows entail. It’s not exactly the dreaded “double data set” redundancy, but it’s a “double piggyback information” problem. Corrections to the main text’s wording don’t have to be carried out twice, fortunately, but corrections to semantic or structural meta-information have to be carried out in any of the two piggyback information sets, namely formatting and XML.
So what can be done to eliminate the extra work? There are two principal paths: either
- change your workflows, or
- do some scripting.
None of them are a walk in the park.
Note: Updating the underlying XML and re-importing is not an option if substantial manual makeup corrections have already been carried out. Nobody wants to redo the page makup of 400 pages for 40 corrections that affect tagging.
When going the scripting path, autocorrecting markup that deviates from layout, there are two principal alternatives:
- IDML processing (XSLT 1, XSLT 2, or APIs such as IDMLLib).
In any case, there is no assurance that you export XML data that reflect all alterations made during the proof stage. But you can cover common cases such as untagged stories/ stories in unanchored frames, character styles that deviate from what would be mapped, split/joined paragraphs or tabbed tables.
When deciding whether to use native API scripting or IDML processing, there are two important things to consider: maintainability and performance.
Maintainability means that you don’t want to hand-craft and test your autocorrection script for every single layout and XML tag set. The script should introspect the layout and the mapping instructions, and autocorrect as much of the tagging as possible.
There are several different mapping methods around. We are using mostly the aid attribute approach and IDML synthesis with prior aid attribute generation. The latter is the most versatile way to generate layout from markup. In contrast to other mapping methods, it permits scriptless assignment of object styles in IDML files, where “scriptless” should be read here as “without native scripting”. And we prefer XSLT 2 processing of IDML exports for autocorrection. One of the reasons is that it is very maintainable: you can read all the style definitions in the IDML file as well as the actually applied styles and the styling that was originally requested by the XML tags’ aid attributes. If there’s deviating styling or no tagging at all, we may auto-insert the appropriate XML tagging that corresponds to the styling in place.
Another important reason for us not to use InDesign’s API is performance. Given the hours that we waited for CS4 or CS5 to finish scripted postprocessing of a 700-page XML import, an issue that ultimately forced us to switch to IDML synthesis, we cannot imagine that native scripting performs any better at checking and autocorrecting the tagging of 10,000 or more text extents.
We never used IDMLlib, but I think that it compares to native scripting as the Aspose.Words library compares to Word VBA scripting. Aspose.Words is a Java lib that runs on the file format and not within the native application, as is IDMLlib, and it is really fast. But also in the .docx case, I prefer XSLT 2 processing to Java processing. This is because when you break down a complex transformation to multiple steps, you will have to design an intermediate object model in Java for each step, and the transformations are also easier to code in XSLT 2. So you have much shorter and still very comprehensible code there. So maintainability and performance are the main reasons why we are using XSLT 2 processing of IDML exports for autocorrection.
Change your workflows?
So far, we’ve only been occupied with the “do some scripting” remedy. How about changing the workflows?
We saw that if there were no author corrections after fine-tuning page makeup, we wouldn’t have to worry about syncing deviating layout and markup information.
A possible approach is to make the author carry out all corrections in the source data prior to XML conversion or import. For example, make the author correct a Word manuscript before conversion to XML, or correct XML chunks in a content management system. But many authors cannot do without the impression of the final print layout, and often this is actually important because of page count estimations or because the typesetters have messed up manuscripts before and the authors need to check the conversion quality.
A solution to this is to generate galley proofs with a rough-cut page makeup, by just importing the XML or opening a generated IDML in InDesign, and then maybe apply some (few) scripts, just for figure placement or for splitting text boxes that stretch beneath the type area. An author will discover almost all issues of missing keywords or images, overly long paragraphs that need to be split, half a page that needs to be eliminated for page count reasons, …
The corrections will be carried out in some master data (for example, Word or XML), and only then will the typesetter start to fine-tune the layout. In most cases these layout tunings will be alterations that won’t affect tagging. So, apart from the inevitable last-minute corrections that will have to be carried out in the underlying XML data and in the layouted InDesign file, there’s no necessity to export a correctly tagged XML from InDesign after the print PDF has been finished. This is still a valid XML-first workflow.
All other renderings, such as EPUB (which, btw, you still don’t want InDesign to generate) or online database content, will be derived from the same quality-assured XML that flows into InDesign, not from the quality-dubious XML export that the currently practiced XML-first or round-trip workflows naïvely spit out.
Another strong case for this “rough-cut first” approach is again performance. Two recent projects have been so large (> 700 pages without hard page break, > 300k XML elements) that InDesign simply couldn’t finish XML import. It couldn’t even open an IDML file if it contained piggyback XML data. So due to the lack of imported XML, exporting XML after carrying out author corrections was a “noption” and we generated the IDML without the tagging.
Does it have to be XML first at all? Our typesetting staff that executes the above-depicted XML-first workflow for relatively shallow-structured fiction books is dissatisfied with all the coordination overhead involved when another department (technical copy editors who work on the Word files) normalize the manuscript and yet another department (XSLT people) convert it to DocBook with aid attributes. They asked whether we could replace the Word normalization, XML import/export process to an InDesign normalization, XML export process. And I think given the heterogeneity of input data and the high degree of normalization they achieve, this isn’t a bad idea. The ex-post XML generation process is conceptually the same as autocorrecting the tagging heuristically according to the actual formatting, with the starting condition that there is no tagging at all initially.
In such a workflow it’s pivotal that the typesetters only use accepted named styles and accepted custom formatting. This may be checked within the XSLT 2 autocorrection stylesheet, by a separate custom Schematron rule set or by a purpose-built IDMLLib program. These routines that check manuscript or typeset data for “structurability” are another article’s subject.
Finally finally, does it have to be InDesign at all? Consider alternative typesetting systems. In LaTeX, XSL-FO or 3B2 many layout things may be influenced by processing instructions (PIs). In LaTeX and XSL-FO, you don’t have any other means than PIs for adding custom layout instructions. This means that because this XML input file is the only source of wisdom for adjusting the layout, there is no room for layout data that will evolve differently than the tagging. On the other hand, for example in LaTeX, you could do all kinds of nasty things in processing instructions, up to replacing the entirety of the default rendering with custom layout and content that is included in a processing instruction. These close-to-criminal offences happen in InDesign, too, and in our view, there’s no other way to suppress them than to visually compare a genuine rendering of the actual unspoiled XML content against the print rendering.
So why does it have to be InDesign so often? The kind of books I’ve been talking about rarely use the full range of layout tweaks that InDesign offers. So a classic XML-aware book typesetting system should be acceptable to the publishers. The choice of the renderer for print output shouldn’t be that important, provided that there’s a choice of service providers that support this typesetting system.
I think the preference towards InDesign that I perceive among many production editing executives is mainly a cultural phenomenon, as many people’s preference for MS Word has been for a long time.