Common Patterns

RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)

Common Patterns
Prev�	Chapter 9: Using Regular Expressions to Specify Simple Datatypes	�Next

Common Patterns

After this overview of the syntax used by pattern facets, let's see some common pattern facets that you may have to use (or adapt) in your schemas or just consider as examples.

String Datatypes

Regular expressions treat information in its textual form. This makes them an excellent mechanism for constraining strings.

Unicode blocks

Unicode is one of XML's greatest assets. However, there are few applications able to process and display all the characters of the Unicode set correctly and still fewer users able to read them! If you need to check that your string datatypes belong to one (or more) Unicode blocks, you can use these pattern facets:

  <define name="BasicLatinToken">
    <data type="token">
      <param name="pattern">\p{IsBasicLatin}*</param>
    </data>
  </define>
	
  <define name="Latin-1Token">
    <data type="token">
      <param name="pattern">[\p{IsBasicLatin}\p{IsLatin-1Supplement}]*</param>
    </data>
  </define>

Or:

  BasicLatinToken = xsd:token {pattern = "\p{IsBasicLatin}*"}

  Latin-1Token = xsd:token {pattern = "[\p{IsBasicLatin}\p{IsLatin-1Supplement}]*"

Note that such pattern facets do not impose a character encoding on the document itself and that, for instance, the Latin-1Token datatype would validate instance documents using UTF-8, UTF-16, ISO-8869-1 or another encoding. (This assumes the characters used in this string belong to the two Unicode blocks BasicLatin and Latin-1Supplement.) In other words, even the lexical space reflects some processing done by the parser, below the level you can control with a schema.

Counting words

The pattern facet can be used to limit the number of words in a text block. To do so, we will define an atom, which is a sequence of one or more "word" characters (\w+) followed by one or more nonword characters (\W+), and thus control the number of occurrences of this atom. If we are not very strict about punctuation, we also need to allow an arbitrary number of nonword characters at the beginning of our value and deal with the possibility of a value ending with a word (without further separation). One of the ways to avoid any ambiguity at the end of the string is to dissociate the last occurrence of a word, making the trailing separator optional:

  <define name="story100-200words">
    <data type="token">
      <param name="pattern">\W*(\w+\W+){99,199}\w+\W*</param>
    </data>
  </define>

Or:

  story100-200words= xsd:token {pattern = "\W*(\w+\W+){99,199}\w+\W*"}

URIs

The xsd:anyURI datatype doesn't care about making relative URI references into absolute URI references. In some cases it is wise to require the usage of absolute URIs, which are easier to process. Furthermore, it can also be useful for some applications to limit the accepted URI schemes. This can easily be done by a set of pattern facets such as:

  <define name="httpURI">
    <data type="anyURI">
      <param name="pattern">http://.*</param>
    </data>
  </define>

Or:

  httpURI= xsd:anyURI {pattern = "http://.*"}

Numeric and Float Types

While numeric types aren't strictly text, pattern facets can still be used to constrain their lexical form and effectively their content.

Leading zeros

Getting rid of leading zeros is quite simple but requires some precautions if we want to keep the optional sign and the number "0" itself. This can be done using pattern facets such as:

  <define name="noLeadingZeros">
    <data type="integer">
      <param name="pattern">[+-]?([^0][0-9]*|0)</param>
    </data>
  </define>

Or:

  noLeadingZeros= xsd:integer {pattern = "[+-]?([^0][0-9]*|0)"}

Note that in this pattern facet, we chose to redefine all the lexical rules that apply to an integer. This pattern facet would give the same lexical space applied to an xsd:token datatype as on an xsd:integer. We could also have relied on the expectations of the base datatype and written:

  <define name="noLeadingZeros">
    <data type="integer">
      <param name="pattern">[+-]?([^0].*|0)</param>
    </data>
  </define>

Or:

  noLeadingZeros= xsd:integer {pattern = "[+-]?([^0].*|0)"}

Relying on the base datatype in this manner can produce simpler pattern facets, but it can also be more difficult to interpret since we would have to combine the lexical rules of the base datatype to the rules expressed by the pattern facet to understand the result.

Fixed format

The maximum number of digits can be fixed using xsd:totalDigits and xsd:fractionDigits. However, these facets are only maximum numbers and work on the value space. If we want to fix the format of the lexical space to be, for instance, "DDDD.DD", we can write a pattern facet such as:

  <define name="fixedDigits">
    <data type="decimal">
      <param name="pattern">[+-]?\.{4}\..{2}</param>
    </data>
  </define>

Or:

  fixedDigits= xsd:decimal {pattern = "[+-]?\.{4}\..{2}"}

Datetimes

Dates and time have complex lexical representations. Patterns can give developers extra control over how they are used.

Time zones

The time zone support of W3C XML Schema is quite controversial and needs some additional constraints to avoid comparison problems. These pattern facets can be kept relatively simple since the syntax of the datetime is already checked by the schema validator and only simple additional checks need to be added. Applications which require that their datetimes specify a time zone may use the following template which checks that the time part ends with a "Z" or contains a sign:

  <define name="dateTimeWithTimezone">
    <data type="dateTime">
      <param name="pattern">.+T[^Z+-]+</param>
    </data>
  </define>

or:

 dateTimeWithTimezone= xsd:dateTime {pattern = ".+T[^Z+-]+"}

Simpler applications that want to make sure that none of their datetime values specify a time zone may simply check that the time part doesn't contain the characters "+", "-", or "Z":

  <define name="dateTimeWithoutTimezone">
    <data type="dateTime">
      <param name="pattern">.+T[^Z+-]+</param>
    </data>
  </define>

or:

 dateTimeWithoutTimezone= xsd:dateTime {pattern = ".+T[^Z+-]+"}

In these two datatypes, we used the separator "T". This is convenient, since no occurrences of the signs can occur after this delimiter except in the time zone definition. This delimiter would be missing if we wanted to constrain dates instead of datetimes, but, in this case, we can detect the time zones on their ":" instead:

  <define name="dateWithTimezone">
    <data type="date">
      <param name="pattern">.+[:Z].*</param>
    </data>
  </define>
  <define name="dateWithoutTimezone">
    <data type="date">
      <param name="pattern">[^:Z]</param>
    </data>
  </define>

or:

dateWithTimezone= xsd:date {pattern = ".+[:Z].*"} 
dateWithoutTimezone= xsd:date {pattern = "[^:Z]"}

Applications may also simply impose a set of time zones to use:

  <define name="dateTimeInMyTimezones">
    <data type="dateTime">
      <param name="pattern">.+(\+02:00|\+01:00|\+00:00|Z|-04:00)</param>
    </data>
  </define>

or:

 dateTimeInMyTimezones= xsd:dateTime {
 	pattern = ".+(\+02:00|\+01:00|\+00:00|Z|-04:00)"
 }

We can also constrain xsd:duration to a couple of subsets that can be reliably compared. The first datatype will consist of durations expressed only in months and years, and the second will consist of durations expressed only in days, hours, minutes, and seconds. The criteria used for the test can be the presence of a "D" (for day) or a "T" (the time delimiter). If neither of those characters are detected, then the datatype uses only year and month parts. The test for the other type cannot be based on the absence of "Y" and "M", since there is also an "M" in the time part. We can test to ensure that, after an optional sign, the first field is either the day part or the "T" delimiter:

  <define name="YMduration">
    <data type="duration">
      <param name="pattern">[^TD]+</param>
    </data>
  </define>
  <define name="DHMSduration">
    <data type="duration">
      <param name="pattern">-?P((\d+D)|T).*</param>
    </data>
  </define>

or:

YMduration= xsd:duration {pattern = "[^TD]+"} 
DHMSduration= xsd:duration {pattern = "-?P((\d+D)|T).*"}

It may seem tricky, but this is a powerful tool for resolving complex problems simply.

You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.

Prev�	Up	�Next
More Atoms�	Home	�Chapter 10: Creating Building Blocks