xslt-utf8-decode

Overview

xslt-utf8-decode provides UTF-8 decoding functionality to XSLT 2.0 stylesheets. UTF-8 strings are represented as sequences of integers, with each integer representing an octet.

My God, why?

I wanted to decode percent-encoded strings that you might find in a URI query string. I wanted that to support UTF-8-encoded non-ASCII characters.

After spending too many hours on this, I decided that this was probably a really bad idea for long-term maintainability. Nevertheless, I didn't want to throw this work away.

Security

The decoder tries to faithfully follow RFC3987 and its security warnings. Overly-long UTF-8 sequences and UTF-16 surrogate words are detected and replaced with U+FFFD REPLACEMENT CHARACTER (�), as are octets that don't belong in a decodable sequence.

If you discover an error that you believe might have security implications, please contact me immediately at the below e-mail address. You are encouraged to use OpenPGP to sign your e-mail, and encrypt it to the public key identified below.

Correctness

A trivial test was performed: the decoder correctly decodes a UTF-8 sequence (4,382,557 octets in length) of each of the 1,112,033 characters allowed in XML, once, in sequence from the lowest-numbered codepoints to the highest. On a dual 32-bit Xeon machine at 2.4GHz, with Saxon-B 9.0.0.4 on Sun's Java 6 JRE, this required about 1GB of RAM, and took about a few minutes.

Query String Parameters

Included is an example library which decodes a query string and represents the ampersand-delimited parameters as XML elements.

An XSLT parameter is defined to hold a query string. When this parameter is given, an XSLT variable holds the XML representation of the decoded string.

Alternatively, an XPath function is provided to decode any query string and represent the result in the same XML form.

Each ampersand-delimited parameter is encoded into XML with an element in the http://www.thoughtcrime.us/ns/xslt-query-params namespace. The localname of the element is the parameter "key"—the part of the query parameter before the equal sign (if any). (If the key can't be represented as an XML element localname, the parameter is omitted.) If there is a parameter "value"—the part after the first equal sign, if any—it appears as a text node under the element; the element is empty otherwise.

Keys and values are "plus-decoded" ("+" becomes "%20"), "percent-decoded" to sequences of UTF-8 octets (%21 becomes "!"), and UTF-8 decoded to strings.

Elements with the same key may appear more than once.

For example, given this query string:

ebat%c3%b5en%c3%a4oline-n%c3%a4ide=dead+beef+caf%c3%A9%21&​boolean-param&​repeating-param=foo&​repeating-param=bar&​non-unicode-heart=%3C3

The included XPath function qp:decode-query-string would return a sequence of these elements (each in the http://www.thoughtcrime.us/ns/xslt-query-params namespace):

<ebatõenäoline-näide>dead beef café!</ebatõenäoline-näide>
<boolean-param/>
<repeating-param>foo</repeating-param>
<repeating-param>bar</repeating-param>
<non-unicode-heart>&lt;3</non-unicode-heart>

Download

The most recent release is 0.1, released 2009-09-11. It has no external dependencies.

I sign all my software with OpenPGP, key ID 0xE979FFBEA002D20F, fingerprint A87B 1C5A 28C4 03BD 54BA CE8E E979 FFBE A002 D20F. (Releases were previously signed with 0x80555CED7394F948, which has been revoked, but not compromised. See my OpenPGP transition statement.)

Copying

This software is licensed under permissive, BSD-like terms, copied from the ISC license:

Copyright (c) 2009, Jean-Paul Guy Larocque

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Contact

I'm reachable via e-mail for feedback, questions, help requests, and bug reports: jpl-software at thoughtcrime.us

— J.P. Larocque