by Eric van der Vlist is published by O'Reilly & Associates (ISBN: 0596004214)


Common Patterns

After this overview of the syntax used by pattern facets, let's see some common pattern facets you may have to use (or adapt) in your schemas or just consider as examples.

String Datatypes

Regular expressions treat information in its textual form. This makes them an excellent mechanism for constraining strings.

Unicode blocks

Unicode is one of XML's greatest assets. However, there are few applications able to process and display all the characters of the Unicode set correctly and still fewer users able to read them! If you need to check that your string datatypes belong to one (or more) Unicode blocks, you can use these pattern facets:

<define name="BasicLatinToken">
  <data type="token">
    <param name="pattern">\p{IsBasicLatin}*</param>
  </data>
</define>

<define name="Latin-1Token">
  <data type="token">
    <param name="pattern">[\p{IsBasicLatin}\p{IsLatin-1Supplement}]*</param>
  </data>
</define>

or:

BasicLatinToken = xsd:token {pattern = "\p{IsBasicLatin}*"}

Latin-1Token = xsd:token {pattern = "[\p{IsBasicLatin}\p{IsLatin-1Supplement}]*"

Note that such pattern facets don't impose a character encoding on the document itself and that, for instance, the Latin-1Token datatype validates instance documents using UTF-8, UTF-16, ISO-8869-1 or another encoding. (This statement assumes the characters used in this string belong to the two Unicode blocks BasicLatin and Latin-1Supplement.) In other words, even the lexical space reflects some processing done by the parser, below the level you can control with a schema.

Counting words

The pattern facet can limit the number of words in a text block. To do so, we will define an atom, which is a sequence of one or more word characters (\w+) followed by one or more nonword characters (\W+), and thus control the number of occurrences of this atom. If you're not very strict about punctuation, you also need to allow an arbitrary number of nonword characters at the beginning of our value and deal with the possibility of a value ending with a word (without further separation). One way to avoid any ambiguity at the end of the string is to dissociate the last occurrence of a word, making the trailing separator optional:

<define name="story100-200words">
  <data type="token">
    <param name="pattern">\W*(\w+\W+){99,199}\w+\W*</param>
  </data>
</define>

or:

story100-200words= xsd:token {pattern = "\W*(\w+\W+){99,199}\w+\W*"}

URIs

The xsd:anyURI datatype doesn't care about making relative URI references into absolute URI references. In some cases, it is wise to require the usage of absolute URIs, which are easier to process. Furthermore, it can also be useful for some applications to limit the accepted URI schemes, which can easily be done by a set of pattern facets such as:

<define name="httpURI">
  <data type="anyURI">
    <param name="pattern">http://.*</param>
  </data>
</define>

or:

 httpURI= xsd:anyURI {pattern = "http://.*"}

Numeric and Float Types

While numeric types aren't strictly text, pattern facets can still be used to constrain their lexical form and effectively, their content.

Leading zeros

Getting rid of leading zeros is quite simple but requires some precautions if you want to keep the optional sign and the number itself. This can be done using pattern facets such as:

<define name="noLeadingZeros">
  <data type="integer">
    <param name="pattern">[+-]?([^0][0-9]*|0)</param>
  </data>
</define>

or:

noLeadingZeros= xsd:integer {pattern = "[+-]?([^0][0-9]*|0)"}

Note that in this pattern facet, I chose to redefine all the lexical rules that apply to an integer. This pattern facet gives the same lexical space applied to an xsd:token datatype as on an xsd:integer. You can also rely on the expectations of the base datatype and write:

<define name="noLeadingZeros">
  <data type="integer">
    <param name="pattern">[+-]?([^0].*|0)</param>
  </data>
</define>

or:

noLeadingZeros= xsd:integer {pattern = "[+-]?([^0].*|0)"}

Relying on the base datatype in this manner can produce simpler pattern facets, but it can also be more difficult to interpret because you have to combine the lexical rules of the base datatype to the rules expressed by the pattern facet to understand the result.

Fixed format

The maximum number of digits can be fixed using xsd:totalDigits and xsd:fractionDigits. However, these facets are only maximum numbers and work on the value space. If you want to fix the format of the lexical space to be, for instance, DDDD.DD, you can write a pattern facet such as:

<define name="fixedDigits">
  <data type="decimal">
    <param name="pattern">[+-]?\.{4}\..{2}</param>
  </data>
</define>

or:

fixedDigits= xsd:decimal {pattern = "[+-]?\.{4}\..{2}"}

Datetimes

Dates and time have complex lexical representations. Patterns give you extra control over how they are used.

Time zones

The time-zone support of W3C XML Schema is quite controversial and needs some additional constraints to avoid comparison problems. These pattern facets can be kept relatively simple because the syntax of the datetime is already checked by the schema validator, and only simple additional checks need to be added. Applications that require their datetimes to specify a time zone may use the following template that checks if the time part ends with a Z or contains a sign:

<define name="dateTimeWithTimezone">
  <data type="dateTime">
    <param name="pattern">.+T.+(Z|[+-].+)</param>
  </data>
</define>

or:

 dateTimeWithTimezone= xsd:dateTime {pattern = ".+T.+(Z|[+-].+)"}

Simpler applications that want to make sure that none of their datetime values specify a time zone can simply check that the time part doesn't contain the characters + - or Z:

<define name="dateTimeWithoutTimezone">
  <data type="dateTime">
    <param name="pattern">.+T[^Z+-]+</param>
  </data>
</define>

or:

dateTimeWithoutTimezone= xsd:dateTime {pattern = ".+T[^Z+-]+"}

In these two datatypes, the T separator is used. This separator is convenient because no occurrences of the signs can occur after this delimiter except in the time-zone definition. This delimiter would be missing if you want to constrain dates instead of datetimes, but, in this case, you can detect the time zones on their colon instead:

<define name="dateWithTimezone">
  <data type="date">
    <param name="pattern">.+[:Z].*</param>
  </data>
</define>
<define name="dateWithoutTimezone">
  <data type="date">
    <param name="pattern">[^:Z]</param>
  </data>
</define>

or:

dateWithTimezone= xsd:date {pattern = ".+[:Z].*"} 
dateWithoutTimezone= xsd:date {pattern = "[^:Z]"}

Applications may also impose a set of time zones to use:

<define name="dateTimeInMyTimezones">
  <data type="dateTime">
    <param name="pattern">.+(\+02:00|\+01:00|\+00:00|Z|-04:00)</param>
  </data>
</define>

or:

dateTimeInMyTimezones= xsd:dateTime {
        pattern = ".+(\+02:00|\+01:00|\+00:00|Z|-04:00)"
}

You can also constrain xsd:duration to a couple of subsets that can be reliably compared. The first datatype consist of durations expressed only in months and years, and the second will consist of durations expressed only in days, hours, minutes, and seconds. The criteria used for the test can be the presence of a D (for day) or a T (the time delimiter). If neither character is detected, the datatype uses only year and month parts. The test for the other type can't be based on the absence of Y and M because there is also an M in the time part. You can test to ensure that, after an optional sign, the first field is either the day part or the T delimiter:

<define name="YMduration">
  <data type="duration">
    <param name="pattern">[^TD]+</param>
  </data>
</define>
<define name="DHMSduration">
  <data type="duration">
    <param name="pattern">-?P((\d+D)|T).*</param>
  </data>
</define>

or:

YMduration= xsd:duration {pattern = "[^TD]+"} 
DHMSduration= xsd:duration {pattern = "-?P((\d+D)|T).*"}

It may seem tricky, but this is a powerful tool for resolving complex problems simply.


This text is released under the Free Software Foundation GFDL.