Whitespace and RELAX NG Native Datatypes

Whitespace and RELAX NG Native Datatypes
Prev	Constraining Text Values	Next

RELAX NG includes a native type system, but this type library has been kept minimal by design because more complete type libraries are available. It consists of just two datatypes (token and string) that differ only in the whitespace processing applied before validation. The whole RELAX NG datatype system can be seen as a mechanism for adding validating transformations to text nodes. These transformations change text nodes into canonical formats (formats in which all the different formats for a same value are converted into a single normalized or "canonical" format). The two native datatypes don't detect format errors (their formats are broad enough to allow any value) but still transform text nodes in their canonical forms, which can make a difference for enumerations. Other datatype libraries, covered in Chapter 8, can detect format errors.

Enumerations are the first place you can see datatypes at work. Applying datatypes to enumeration values is done by adding a type attribute in value patterns. Up to now, we haven't specified any datatype when we've written value elements. By default, they have the default type token from the built-in library. Text values of this datatype receive full whitespace normalization similar to that performed by the XPath normalize-space( ) function: all sequences of one or more whitespace characters—the characters #x20 (space), #x9 (tab), #xA (linefeed), and #xD (carriage return)—are replaced by a single space, and the leading space and trailing space are then trimmed.

Reconsidering previous examples, writing:

<attribute name="available">
 <choice>
  <value>available</value>
  <value>checked out</value>
  <value>on hold</value>
 </choice>
</attribute>

or:

attribute 
 available {"available"|"checked out"|"on hold"}

has used the default type value (token) and is equivalent to the following:

<attribute name="available">
 <choice>
  <value type="token">available</value>
  <value type="token">checked out</value>
  <value type="token">on hold</value>
 </choice>
</attribute>

or:

attribute available {token "available"|token "checked out"|token
"on hold"}

When the token datatype is used, whitespace normalization is applied to the value defined in the schema and to the value found in the instance document. The comparison is done using the result of the normalization, which explains why "on hold" was matching " on hold " with spaces or tabs added before, between, and after the words.

Note

	Note
The name of the `token` datatype, borrowed from W3C XML Schema, is highly confusing. In IT jargon, a token is a piece of a string between two delimiters, what is called a "word" in plain English. The `token` datatype doesn't denote a word. Otherwise, "on" and "hold" would be valid tokens; "on hold" wouldn't. The `token` datatype is more a "token-ized" datatype, in the sense that it's a string that can be easily cut into tokens when nonsignificant whitespace is removed. This confusion is dangerous because it can cause you to use the `string` datatype when what you need is `token`. (You'll see later in this chapter that using the `string` datatype should be reserved for select cases).

The name of the token datatype, borrowed from W3C XML Schema, is highly confusing. In IT jargon, a token is a piece of a string between two delimiters, what is called a "word" in plain English. The token datatype doesn't denote a word. Otherwise, "on" and "hold" would be valid tokens; "on hold" wouldn't. The token datatype is more a "token-ized" datatype, in the sense that it's a string that can be easily cut into tokens when nonsignificant whitespace is removed.

This confusion is dangerous because it can cause you to use the string datatype when what you need is token. (You'll see later in this chapter that using the string datatype should be reserved for select cases).

To suppress this normalization, you can specify the second built-in datatype, string, which doesn't perform any transformation on the values before comparing them to the specified value:

 <attribute name="available">
  <choice>
   <value type="string">available</value>
   <value type="string">checked out</value>
   <value type="string">on hold</value>
  </choice>
 </attribute>

or:

attribute available {string "available"|string "checked out"|string
"on hold"}

Using the new definition, the value of our attribute must exactly match the value specified in the schema: available, checked out, and on hold. No extra whitespace is permitted.

	Tip
	The native `token` and `string` datatypes have the same basic definition as the W3C XML Schema `token` and `string` datatypes. The difference is that additional restrictions, which can be applied using `param` attributes to the W3C XML Schema datatypes, aren't available with RELAX NG's native datatypes. More details are provided in Chapter 8.

Prev	Up	Next
Enumerations	Home	Using String Datatypes in Attribute Values