Mar 112013
 

CrossRef members may use the XML batch query interface in order to look up DOIs for citations. Here’s an XProc implementation of that interface for JATS-like input.

Assuming that there’s a calabash.sh somewhere on your system, a sample invocation looks like this:

calabash.sh \
  -i source=bits.xml \
  -o qb=query_batch.xml \
  crossref/xpl/jats-submit-crossref-query.xpl \
  email=X@Y \
  user=USER pass=PASS \
  xpath='(//sec)[last()]'

It takes some BITS/JATS/HoBoTS XML on the source port, and it needs your CrossRef credentials and a return email address where the resolved XML will be sent to after batch processing.

If you want to restrict the query to only some citations in your source XML, you can supply an XPath expression. In the example above, only the ref elements below the last sec will be included in the query. In addition, only refs with an id attribute will be processed, and refs that already have a DOI (pub-id[@pub-id-type eq 'doi']) will be excluded. Currently, it will only process mixed-citations.

There is a JATS-specific part (.xpl, .xsl) for generating a CrossRef Query Input 2.0 compliant query body, and there is a generic part for wrapping this, together with the return email address and a generated batch ID, in a query_batch element, validating it against the XSD, wrapping it, together with the credentials, in a POST p:http-request and finally submitting the query to CrossRef.

If other input formats were to be supported, it would be nice if there was only one front-end script that selects and executes the format-specific query generation pipeline dynamically, with cx:eval.

The main obstacles that had to be overcome were all related to the HTTP POST request in wrap-query.xpl:

  <p:template>
    <p:input port="template">
      <p:inline>
        <c:request 
          method="POST" 
          href="http://doi.crossref.org/servlet/deposit?operation=doQueryUpload&amp;login_id={$user}&amp;login_passwd={$pass}">
          <c:multipart content-type="multipart/form-data" boundary="=-=-=-=-=">
            <c:body content-type="application/xml" disposition='form-data; name="fname"; filename="hobots-refs.xml"'>
              {/*}
            </c:body>
          </c:multipart>
        </c:request>
      </p:inline>
    </p:input>
    <p:input port="source">
      <p:pipe step="wrap" port="result"/>
    </p:input>
    <p:input port="parameters">
      <p:pipe step="vars" port="result"/>
    </p:input>
  </p:template>
  
  <p:http-request omit-xml-declaration="false" encoding="US-ASCII"/>

In line 74, there’s a spurious GET-like URL – but it seems to work alongside the POSTed payload. This has to be in a multipart chunk, and it’s important that it has name="fname" in its Content-Disposition field. The filename is static at this point, but it’s obvious how to dynamicize it using some curly braces expression. Well, it could be difficult if there were single quotes in the expression because the whole disposition attribute is enclosed in single quotes already – I avoided this problem for the time being.

Other important attributes are the p:http-request attributes in line 91. Without omit-xml-declaration=”false”, the posted data won’t be interpreted as XML. It was hard for me to find out that this serialization option belongs to p:http-request itself, rather than p:body.

After I had successfully posted my first batch, I saw that nothing was recognized. I quickly discovered that en dashes and other “special characters” had been converted to plain question marks when sending the query batch as UTF-8. So I switched the serialization to US-ASCII, hoping that everything beyond ASCII will be transmitted as numerical entity references, and there you go.

Another problem that I discovered later: although i and b formatting is allowed in an unstructured_citation, virtually no citation would resolve. I noticed that all these formatting elements were missing in the returned, unresolved query. So they probably didn’t resolve because essential text was discarded prior to resolution. I dissolved all markup within unstructured_citation, and the results were much better.

  One Response to “XProc-based CrossRef DOI lookup for unstructured citations”

  1. […] Ein weiteres Beispiel ist der XProc-basierte Cross-Ref-DOI-Resolver von meinem Kollegen Gerrit Imsie… […]

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)