Alexander Gromnitsky's Blog

nokogiri vs. xsltproc

Latest update:

Before 鋸, a common approach to working with XML was XSLT, which is a DSL invented for converting XML into various desired formats (typically HTML, text, or another XML).

A modern programmer rarely thinks of XSLT. If he encounters data in an XML format, the 1st thing that comes to his mind is "How do I convert it to JSON?"

I've heard about XSLT a little, but never used it. Sometimes when I post elsewhere a snippet of code that uses Nokogiri, one of the responses is oft "This is fine, but can I do this without installing Ruby?" Usually I ignore such affronting remarks (apparently, there are people who don't use Ruby as a sh replacement? This is extraordinary shocking), but the question appears, nevertheless, too common.

Does a typical Linux box have a utility to handle XML? There is xsltproc, that is a CL interface for libxslt C library. The latter is fairy popular, Google Chrome (still) uses it, for example.

XSLT itself went through several revisions, but libxslt stuck with v1.0, that W3C had specified in 1999.

In this post I use 3 examples in which a chunk of Ruby code gets rewritten in XSLT to answer the question: are there any cases where xsltproc can serve as a Nokogiri replacement?

Google autocomplete

Google has an (unofficial?) API for its "autocomplete" service, that returns results in an XML form. To play with it, we'll use 2 scripts:

  1. google-autocomplete.bash that takes 2 arguments: country_code query;
  2. google-autocomplete-parse.sh that parses XML.
$ ./google-autocomplete.bash us how do | ./google-autocomplete-parse.sh
["how does groundhog day work",
 "how do you say",
 "how do you get pink eye",
 "how does the world cup work",
 "how do you get strep throat",
 "how do i say goodbye",
 "how does golo work",
 "how do you pronounce qatar",
 "how do you get pneumonia",
 "how do you get ringworm"]

You can change us argument to ie, for example, to see the results for Ireland.

$ cat google-autocomplete.bash
#!/usr/bin/env bash
curl 'https://google.com/complete/search?output=toolbar' \
     --data-urlencode "gl=$1" \
     --data-urlencode "q=${*:2}" -Gs

The XML that we are parsing looks like

<?xml version="1.0"?>
<toplevel>
  <CompleteSuggestion>
    <suggestion data="how does groundhog day work"/>
  </CompleteSuggestion>
  ...
</toplevel>

The Nokogiri script is so tiny it can be a shell alias:

$ cat google-autocomplete-parse.sh
#!/bin/sh
nokogiri -e 'pp $_.css("suggestion").map {|v| v["data"]}' "$@"

Now, if we rewrite it into a .xsl file, the pipeline returns

$ ./google-autocomplete.bash us how do | xsltproc google-autocomplete.xsl -
how does groundhog day work
how do you say
how do you get pink eye
how does the world cup work
how do you get strep throat
how do i say goodbye
how does golo work
how do you pronounce qatar
how do you get pneumonia
how do you get ringworm

Alas, I find even such a small chunk of XSLT unpleasant to read:

$ cat google-autocomplete.xsl
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" />
  <xsl:template match="text()" /> <!-- mute text nodes -->
  
  <xsl:template match="suggestion">
    <xsl:value-of select="@data" />
    <xsl:text>&#xa;</xsl:text>  <!-- insert a newline -->
  </xsl:template>
</xsl:stylesheet>

A one-liner in Ruby, despite of being imperative, is immediately obvious. To understand this XSLT code, you need to know that XSLT 1.0 has a "hidden" built-in template rule that copies text through:

<xsl:template match="text()|@*">
  <xsl:value-of select="."/>
</xsl:template>

because of it we need to undo printing text nodes, otherwise the result may be filled with junk we didn't select via <xsl:template match="suggestion"> rule.

On the other hand, you can use it in a browser via XSLTProcessor! If you have nothing better to do, examine a vanilla JS example of Google autocomplete XML parsing that uses only native browser APIs.

Renaming tags

Suppose we have a bunch of .xhtml files & we want to promote h2/h3 headings to h1/h2 correspondingly, or vice-versa, to demote h1/h2 → h2/h3. This sometimes happens when you edit an epub.

A test file:

$ cat foo.xhtml
<?xml version="1.0"?>
<!-- comment 1 -->
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>foo</title></head>
  <body>
    <h1 class="important">bar &amp; <b class="uber">baz</b></h1>
    <p>word1 <!-- comment 2 --> word2</p>

    <h2>subheading</h2>
    <p>paragraph</p>
  </body>
</html>

Any other nodes (including comments) we want to preserve as-is.

A Ruby script is straightforward--walk through the node tree, look for matched tags, perform modifications in-place, serialise result.

$ cat headings
#!/usr/bin/env -S ruby -r nokogiri

doc = Nokogiri::XML STDIN
doc.traverse do |node|
  if ARGV[0] && node.name.match(ARGV[0])
    node.name = 'h' + (node.name[1..].to_i + ARGV[1].to_i).to_s
  end
end
puts doc.to_xml

Running it without any arguments produces an exact copy of the original XML, but

$ ./headings 'h1|h2' 1 < foo.xhtml

promotes h1/h2 to h2/h3. Easy-peasy.

To write an XSLT script we ought first to think how we can pass custom command line arguments to xsltproc. For that it has --stringparam option & we can set default values for them as

<xsl:param name="h">h1,h2,h3,h4,h5,h6</xsl:param>
<xsl:param name="l">0</xsl:param>

$h is not a regex, for XSLT v1.0 doesn't support it. We just check if a node name can be found in h1,h2,h3,h4,h5,h6 string or in a whatever value a user provided.

A general approach is similar to a Ruby version: traverse through all nodes, modify names of those that we are interested in, leave out those that don't match our $h parameter.

$ cat headings.xsl
<?xml version="1.0"?>
<!-- $ xsltproc -\-stringparam h h2 -\-param l 1 headings.xsl file.xml -->

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:param name="l">0</xsl:param>
  <xsl:param name="h">h1,h2,h3,h4,h5,h6</xsl:param>

  <xsl:template match="@*|node()">
    <xsl:choose>
      <xsl:when test="self::* and contains($h, local-name())">
        <xsl:element name="{concat('h', substring(local-name(),2,1) + $l)}">
          <!-- copy the rest untoched -->
          <xsl:apply-templates select="@*|node()" />
        </xsl:element>
      </xsl:when>
      <xsl:otherwise>
        <!-- identity transformation -->
        <xsl:copy><xsl:apply-templates select="@*|node()" /></xsl:copy>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

</xsl:stylesheet>

Do you like it? I don't.

A clumsy XSLT conditional

<xsl:choose>
  <xsl:when ...>  ... </xsl:when>
  <xsl:otherwise> ... </xsl:otherwise>
</xsl:choose>

also doesn't make it look particularly elegant. I say it is visually appalling from any angle.

Generating Table of Contents

Staying with the epub theme, another very common task is an automatic TOC generation from a bunch of .xhtml files. For simplicity's sake our TOC is going to be 1 level deep.

The final result may look like

<nav epub:type="toc">
  <h1>Contents</h1>
  <ol>
    <li><a href="ch01.xhtml">Chapter 1</a></li>
    <li><a href="ch02.xhtml">Chapter 2</a></li>
  </ol>
</nav>

A Ruby script usage:

$ ./toc path/to/files h2

It's Nokogiri again but with ERB:

$ cat toc
#!/usr/bin/env -S ruby -rnokogiri -rerb

selector = ARGV[1] || 'h2'

anchors = Dir.glob(File.join ARGV[0] || './', '*.xhtml').map do |file|
  doc = Nokogiri::XML File.read file
  doc.css(selector).map {|n| {href: File.basename(file), text: n.text }}
end.filter {|v| v.size > 0}.flatten

puts ERB.new(DATA.read, trim_mode: '-').result(binding)

__END__
<nav epub:type="toc">
  <h1>Contents</h1>
  <ol>
    <% anchors.each do |a| -%>
      <li><a href="<%= a[:href] %>"><%= a[:text] %></a></li>
    <% end %>
  </ol>
</nav>

An XSLT version works quite differently. We

  1. collect a list of files outside of xsltproc;
  2. run xsltproc for each .xhtml file;
  3. run xsltproc with a TOC stylesheet into which we inject the result fromt #2.

Extracting headings from an .xhtml file isn't hard, if you forget that there's no way to get the file name of an XML file from within the stylesheet--xsltproc could, for example, set some non-standard variable, like __FILE__, but it doesn't do that, hence the only way to set an href in an anchor is to set a custom variable using --stringparam CLO:

$ cat li.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:h="http://www.w3.org/1999/xhtml"
                exclude-result-prefixes="h"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" omit-xml-declaration="yes" />
  <xsl:template match="text()" /> <!-- mute text nodes -->
  <xsl:template match="h:h2">
    <li><a href="{$href}"><xsl:value-of select="text()" /></a></li>
  </xsl:template>
</xsl:stylesheet>

There's also a bunch of new little annoyances, like a necessity of setting the xhtml namespace, otherwise our h2 rule won't match anything.

To collect the results in a single intermediate XML fragment, we could write a sh-script or a simple Makefile (to irritate you even more):

$ cat li.mk
d :=
chapter := $(sort $(patsubst %.xhtml, %.chapter, $(wildcard $(d)/*.xhtml)))

footer: $(chapter)
    @echo '</ol>'

$(chapter): header

header:
    @echo '<?xml version="1.0"?>'
    @echo '<ol>'

%.chapter: %.xhtml
    @xsltproc --stringparam href "$(notdir $<)" li.xsl $<

li.xsl + li.mk together work like this:

$ make -f li.mk d=path/to/files
<?xml version="1.0"?>
<ol>
<li><a href="ch01.xhtml">Chapter 1</a></li>
<li><a href="ch02.xhtml">Chapter 2</a></li>
</ol>

The final (#3) step is the easiest one:

$ make -f li.mk d=path/to/files | xsltproc toc.xsl -

toc.xsl stylesheet simply injects incoming XML into a desired location:

$ cat toc.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:epub="http://www.idpf.org/2007/ops"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <nav epub:type="toc">
      <h1>Contents</h1>
      <xsl:copy-of select="." />
    </nav>
  </xsl:template>
</xsl:stylesheet>

Do I recommend this to anyone? No. Let XSLT stay a curiosity, a forgotten DSL from the 2000s.

You can say that some of the hurdles can be obviated by using a modern XSLT v3.0 processor, but that eliminates the whole point of "living off the land" approach, where you don't modify the user's environment.


Tags: ойті
Authors: ag