nokogiri vs. xsltproc
Latest update:
Before 鋸, a common approach to working with XML was XSLT, which
is a DSL invented for converting XML into various desired formats
(typically HTML, text, or another XML).
A modern programmer rarely thinks of XSLT. If he encounters data in an
XML format, the 1st thing that comes to his mind is "How do I convert
it to JSON?"
I've heard about XSLT a little, but never used it. Sometimes when I
post elsewhere a snippet of code that uses Nokogiri, one of the
responses is oft "This is fine, but can I do this without installing
Ruby?" Usually I ignore such affronting remarks (apparently, there are
people who don't use Ruby as a sh replacement? This is extraordinary
shocking), but the question appears, nevertheless, too common.
Does a typical Linux box have a utility to handle XML? There is
xsltproc, that is a CL interface for libxslt C library. The latter
is fairy popular, Google Chrome (still) uses it, for example.
XSLT itself went through several revisions, but libxslt stuck with
v1.0, that W3C had specified in 1999.
In this post I use 3 examples in which a chunk of Ruby code gets
rewritten in XSLT to answer the question: are there any cases where
xsltproc can serve as a Nokogiri replacement?
Google autocomplete
Google has an (unofficial?) API for its "autocomplete" service, that
returns results in an XML form. To play with it, we'll use 2 scripts:
google-autocomplete.bash
that takes 2 arguments: country_code query
;
google-autocomplete-parse.sh
that parses XML.
$ ./google-autocomplete.bash us how do | ./google-autocomplete-parse.sh
["how does groundhog day work",
"how do you say",
"how do you get pink eye",
"how does the world cup work",
"how do you get strep throat",
"how do i say goodbye",
"how does golo work",
"how do you pronounce qatar",
"how do you get pneumonia",
"how do you get ringworm"]
You can change us
argument to ie
, for example, to see the results
for Ireland.
$ cat google-autocomplete.bash
#!/usr/bin/env bash
curl 'https://google.com/complete/search?output=toolbar' \
--data-urlencode "gl=$1" \
--data-urlencode "q=${*:2}" -Gs
The XML that we are parsing looks like
<?xml version="1.0"?>
<toplevel>
<CompleteSuggestion>
<suggestion data="how does groundhog day work"/>
</CompleteSuggestion>
...
</toplevel>
The Nokogiri script is so tiny it can be a shell alias:
$ cat google-autocomplete-parse.sh
#!/bin/sh
nokogiri -e 'pp $_.css("suggestion").map {|v| v["data"]}' "$@"
Now, if we rewrite it into a .xsl file, the pipeline returns
$ ./google-autocomplete.bash us how do | xsltproc google-autocomplete.xsl -
how does groundhog day work
how do you say
how do you get pink eye
how does the world cup work
how do you get strep throat
how do i say goodbye
how does golo work
how do you pronounce qatar
how do you get pneumonia
how do you get ringworm
Alas, I find even such a small chunk of XSLT unpleasant to read:
$ cat google-autocomplete.xsl
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" />
<xsl:template match="text()" /> <!-- mute text nodes -->
<xsl:template match="suggestion">
<xsl:value-of select="@data" />
<xsl:text>
</xsl:text> <!-- insert a newline -->
</xsl:template>
</xsl:stylesheet>
A one-liner in Ruby, despite of being imperative, is immediately
obvious. To understand this XSLT code, you need to know that XSLT 1.0
has a "hidden" built-in template rule that copies text through:
<xsl:template match="text()|@*">
<xsl:value-of select="."/>
</xsl:template>
because of it we need to undo printing text nodes, otherwise the
result may be filled with junk we didn't select via <xsl:template match="suggestion">
rule.
On the other hand, you can use it in a browser via XSLTProcessor!
If you have nothing better to do, examine a vanilla JS example of
Google autocomplete XML parsing that uses only native browser APIs.
Renaming tags
Suppose we have a bunch of .xhtml files & we want to promote h2/h3
headings to h1/h2 correspondingly, or vice-versa, to demote h1/h2 →
h2/h3. This sometimes happens when you edit an epub.
A test file:
$ cat foo.xhtml
<?xml version="1.0"?>
<!-- comment 1 -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>foo</title></head>
<body>
<h1 class="important">bar & <b class="uber">baz</b></h1>
<p>word1 <!-- comment 2 --> word2</p>
<h2>subheading</h2>
<p>paragraph</p>
</body>
</html>
Any other nodes (including comments) we want to preserve as-is.
A Ruby script is straightforward--walk through the node tree, look for
matched tags, perform modifications in-place, serialise result.
$ cat headings
#!/usr/bin/env -S ruby -r nokogiri
doc = Nokogiri::XML STDIN
doc.traverse do |node|
if ARGV[0] && node.name.match(ARGV[0])
node.name = 'h' + (node.name[1..].to_i + ARGV[1].to_i).to_s
end
end
puts doc.to_xml
Running it without any arguments produces an exact copy of the
original XML, but
$ ./headings 'h1|h2' 1 < foo.xhtml
promotes h1/h2 to h2/h3. Easy-peasy.
To write an XSLT script we ought first to think how we can pass custom
command line arguments to xsltproc. For that it has --stringparam
option & we can set default values for them as
<xsl:param name="h">h1,h2,h3,h4,h5,h6</xsl:param>
<xsl:param name="l">0</xsl:param>
$h
is not a regex, for XSLT v1.0 doesn't support it. We just check
if a node name can be found in h1,h2,h3,h4,h5,h6
string or in a
whatever value a user provided.
A general approach is similar to a Ruby version: traverse through all
nodes, modify names of those that we are interested in, leave out
those that don't match our $h
parameter.
$ cat headings.xsl
<?xml version="1.0"?>
<!-- $ xsltproc -\-stringparam h h2 -\-param l 1 headings.xsl file.xml -->
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="l">0</xsl:param>
<xsl:param name="h">h1,h2,h3,h4,h5,h6</xsl:param>
<xsl:template match="@*|node()">
<xsl:choose>
<xsl:when test="self::* and contains($h, local-name())">
<xsl:element name="{concat('h', substring(local-name(),2,1) + $l)}">
<!-- copy the rest untoched -->
<xsl:apply-templates select="@*|node()" />
</xsl:element>
</xsl:when>
<xsl:otherwise>
<!-- identity transformation -->
<xsl:copy><xsl:apply-templates select="@*|node()" /></xsl:copy>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Do you like it? I don't.
A clumsy XSLT conditional
<xsl:choose>
<xsl:when ...> ... </xsl:when>
<xsl:otherwise> ... </xsl:otherwise>
</xsl:choose>
also doesn't make it look particularly elegant. I say it is visually
appalling from any angle.
Generating Table of Contents
Staying with the epub theme, another very common task is an automatic
TOC generation from a bunch of .xhtml files. For simplicity's sake
our TOC is going to be 1 level deep.
The final result may look like
<nav epub:type="toc">
<h1>Contents</h1>
<ol>
<li><a href="ch01.xhtml">Chapter 1</a></li>
<li><a href="ch02.xhtml">Chapter 2</a></li>
</ol>
</nav>
A Ruby script usage:
$ ./toc path/to/files h2
It's Nokogiri again but with ERB:
$ cat toc
#!/usr/bin/env -S ruby -rnokogiri -rerb
selector = ARGV[1] || 'h2'
anchors = Dir.glob(File.join ARGV[0] || './', '*.xhtml').map do |file|
doc = Nokogiri::XML File.read file
doc.css(selector).map {|n| {href: File.basename(file), text: n.text }}
end.filter {|v| v.size > 0}.flatten
puts ERB.new(DATA.read, trim_mode: '-').result(binding)
__END__
<nav epub:type="toc">
<h1>Contents</h1>
<ol>
<% anchors.each do |a| -%>
<li><a href="<%= a[:href] %>"><%= a[:text] %></a></li>
<% end %>
</ol>
</nav>
An XSLT version works quite differently. We
- collect a list of files outside of xsltproc;
- run xsltproc for each .xhtml file;
- run xsltproc with a TOC stylesheet into which we inject the result
fromt #2.
Extracting headings from an .xhtml file isn't hard, if you forget
that there's no way to get the file name of an XML file from within the
stylesheet--xsltproc could, for example, set some non-standard
variable, like __FILE__
, but it doesn't do that, hence the only
way to set an href in an anchor is to set a custom variable using
--stringparam
CLO:
$ cat li.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:h="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="h"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" />
<xsl:template match="text()" /> <!-- mute text nodes -->
<xsl:template match="h:h2">
<li><a href="{$href}"><xsl:value-of select="text()" /></a></li>
</xsl:template>
</xsl:stylesheet>
There's also a bunch of new little annoyances, like a necessity of
setting the xhtml namespace, otherwise our h2 rule won't match anything.
To collect the results in a single intermediate XML fragment, we could
write a sh-script or a simple Makefile (to irritate you even more):
$ cat li.mk
d :=
chapter := $(sort $(patsubst %.xhtml, %.chapter, $(wildcard $(d)/*.xhtml)))
footer: $(chapter)
@echo '</ol>'
$(chapter): header
header:
@echo '<?xml version="1.0"?>'
@echo '<ol>'
%.chapter: %.xhtml
@xsltproc --stringparam href "$(notdir $<)" li.xsl $<
li.xsl
+ li.mk
together work like this:
$ make -f li.mk d=path/to/files
<?xml version="1.0"?>
<ol>
<li><a href="ch01.xhtml">Chapter 1</a></li>
<li><a href="ch02.xhtml">Chapter 2</a></li>
</ol>
The final (#3) step is the easiest one:
$ make -f li.mk d=path/to/files | xsltproc toc.xsl -
toc.xsl
stylesheet simply injects incoming XML into a desired
location:
$ cat toc.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:epub="http://www.idpf.org/2007/ops"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<nav epub:type="toc">
<h1>Contents</h1>
<xsl:copy-of select="." />
</nav>
</xsl:template>
</xsl:stylesheet>
Do I recommend this to anyone? No. Let XSLT stay a curiosity, a
forgotten DSL from the 2000s.
You can say that some of the hurdles can be obviated by using a modern
XSLT v3.0 processor, but that eliminates the whole point of "living off
the land" approach, where you don't modify the user's environment.
Tags: ойті
Authors: ag