RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)

You are welcome to use our annotation system to give your feedback.


W3C XML Schema type library

W3C XML Schema simple types required several chapters in my XML Schema book to explain completely, but I'll try to give a brief overview here so that you can use the most basic features within RELAX NG schemas. You will find additional detail about the simple types' definitions in Chapter 9 W3C XML Schema Regular Expressions and in Chapter 19: W3C XML Schema Datatypes, and you are of course welcome to read chapters 4, 5, 6 and 16 of my W3C XML Schema book to get a deeper understanding of their behavior.

The W3C XML Schema datatypes which can be used in a RELAX NG schema are the predefined W3C XML Schema types--those which are defined in the W3C XML Schema Recommendation itself as opposed to user defined types, which are derived from the predefined types using the W3C XML Schema language and can not be used from a RELAX NG schema. We will see that restrictions (called facets in the terminology of W3C XML Schema) can be applied to these datatypes using the RELAX NG param pattern, so some customization is possible.

[Note]Note

RELAX NG's support for named patterns makes it effectively possible to derive types from W3C XML Schema simple types despite RELAX NG's lack of support for the W3C XML Schema type derivation system. This might be a bit confusing right now but it will become clearer with examples; RELAX NG borrows the most basic part of W3C XML Schema datatypes without borrowing its syntax and derivation methods.

The W3C XML Schema predefined datatypes are divided into primitive and derived types. Primitive types are basic types which do not share a common foundation of meaning and behave differently from each other. Derived types are built on the foundations of primitive types, sharing the semantics of its primitive type. Derived types are provided for the convenience of users, since it is expected that they will be commonly used and shouldn't need constant reinvention.

The other idea which needs to be introduced before we start is the concept of lexical and value spaces: lexical space is the string as it appears in the XML document (after whitespace normalization), while value space is the matching value as interpreted by the datatype library. The distinction is important since all the facets save one (the pattern facet, which will be covered in depth in next chapter: "Chapter 9: W3C XML Schema Regular Expressions") act on the value space.

The next few sections will give a brief presentation of the datatypes, organized by their primary types.

The string datatypes include:

string

This is the only datatype for which no whitespace normalization is done. There is no restriction on the lexical or value spaces of this datatype which is identical to the string RELAX NG built-in type. The difference is that restrictions can be applied through param patterns on the W3C XML Schema string type.

normalizedString

A string, but intermediate whitespace processing is done to this datatype: the occurrences of whitespaces (including tabs (#x9), linefeeds (#xA), and spaces (#x20) are replaced by the same number of spaces (#x20) but no space collapsing or trimming is performed. Just as for the string datatype, there is no restriction on the lexical or value spaces of this datatype.

token

This datatype is similar to the built-in token datatype: whitespaces are normalized--all the sequences of whitespaces are replaced by a single space and the leading and trailing spaces are removed. Including token and string , this is the third and last datatype which has no constraint on its value or lexical spaces. (We must also note that all the datatypes except "string" and normalizedString follow the same normalization rules as the token datatype.)

language

This was created to accept all the language codes standardized by RFC 1766. Some valid values for this datatype are en, en-US, fr, or fr-FR.

NMTOKEN

This corresponds to the XML 1.0 Nmtoken (Name token) production, which is a single token (a set of characters without spaces) composed of characters allowed in an XML name. Some examples of valid values for this datatype are "Snoopy", "CMS", "1950-10-04", or "0836217462". Invalid values include "brought classical music to the Peanuts strip" (spaces are forbidden) or "bold,brash" (commas are forbidden).

NMTOKENS

The lexical and value spaces of NMTOKENS is the whitespace separated lists of NMTOKEN.

Name

This is similar to NMTOKEN with the additional restriction that the values must start either with a letter or the characters ":" or "-". This datatype conforms to the XML 1.0 definition of a Name. Some examples of valid values for this datatype are "Snoopy", "CMS", or "-1950-10-04-10:00". Invalid values include "0836217462" (cannot start with a number) or "bold,brash" (commas are forbidden). This datatype should not be used for names that may be qualified by a namespace prefix. There is another datatype (QName) that has a specific semantic for these values.

NCName

This is a noncolonized name as defined by Namespaces in XML 1.0: a Name without any colons (":"). As such, this datatype is probably the predefined datatype that is closest to the notion of a name in most of the programming languages (some characters such as "-" or "." may still be a problem in many cases). Some valid values for this datatype are "Snoopy", "CMS", "-1950-10-04-10-00", or "1950-10-04". Invalid values include "-1950-10-04:10-00" or "bold:brash" (colons are forbidden).

ID

The lexical space of ID is the same as the lexical space of NCName. As defined by the W3C XML Schema recommendation, there is one constraint added to its value space: there must not be any duplicate values in a document. RELAX NG doesn't allow datatype libraries to perform this type of check. This is a job for the "DTD compatibility feature", as we will see at the end of this chapter. Its specification asks RELAX NG processors supporting this feature to enforce ID uniqueness for W3C XML Schema ID datatypes. Other implementations will just check its lexical space as a NCName.

IDREF

The lexical space of IDREF is the same as the lexical space of NCName. Just as for ID, W3C XML Schema adds the constraint that it must match an ID defined in the same document. RELAX NG makes this behavior optional for RELAX NG processors supporting the W3C XML Schema type library without supporting the DTD compatibility feature.

IDREFS

The lexical space of IDREFS is the whitespace separated list of NCName. Just as for ID and IDREF, W3C XML Schema adds the constraint that each of the values must match an ID defined in the same document. RELAX NG makes this behavior optional for RELAX NG processors supporting the W3C XML Schema type library without supporting the DTD compatibility feature.

ENTITY

The lexical space of ENTITY is the same as the lexical space of NCName. Also provided for compatibility with XML 1.0 DTDs, an ENTITY value and must match an unparsed entity defined in a DTD.

ENTITIES

The lexical and value spaces of ENTITIES is the whitespace separated list of ENTITY.

Up to now, we have only briefly mentioned XML namespaces. We will focus on them in Chapter 11: Namespaces but we need to use some of their concepts right now. If you're not familiar with namespaces, you would probably be safe skipping this section: you can be quite sure that you don't need qualified names quite yet. Even if you are a XML namespace guru, I wouldn't recommend you to use them with what I consider poor practice.

What we're talking about here is different from using qualified names for element and attribute names. Using qualified names for element and attribute names is required by the recommendation "Namespaces in XML 1.0" and there isn't much debate left on the subject. Here, we are speaking of using qualified names in element or attribute values--this is much more controversial since it's creating a dependency between markup and its content.

Because of this dependency, you cannot consider a qualified name as string datatype since its prefix is only a shortcut to the associated namespace URI. The value space of a qualified name is thus not what we see, but a tuple, two things combines, as a triple, composed of the associated namespace URI (replacing the prefix) and its local part (i.e. what is after the prefix and the colon).

For instance, if the xs prefix has been associated with the namespace URI http://www.w3.org/2001/XMLSchema, a qualified name (QName) xsd:language would thus have a value which is the tuple { http://www.w3.org/2001/XMLSchema, language}. It can be considered equal to a QName foo:language if the prefix foo has been associated with http://www.w3.org/2001/XMLSchema or language if http://www.w3.org/2001/XMLSchema has been defined as the default namespace.

There are two QName datatypes, which RELAX NG treats as equivalent:

XML 1.0 is not designed to store binary content: binary content must be encoded as some form of string before it can be included in an XML document. W3C XML Schema has defined two primary datatypes to support two encodings, one that is commonly used (base64) and one which is newer (hexBinary). These encodings may be used to include any binary content, including text formats whose content may be incompatible with the XML markup. Other binary text encodings could also be used in XML (such as uuXXcode, Quote Printable, BinHex, aencode, or base85, to name a few), but their values would not be recognized by W3C XML Schema.

[Note]Note

The W3C XML Schema Recommendation missed the fact that RFC 2045 requests a line break every 76 characters. This should be clarified in an errata. The consequence of these line breaks being thought of as optional by W3C XML Schema, is that the lexical and value spaces of base64Binary cannot be considered identical.

Numeric datatypes are built on top of four primitive datatypes: decimal for all the decimal types (including the integer datatypes, which are treated as decimals without a fractional part), double and float for single and double precision floats, and boolean for Booleans.

The first family of numeric datatypes is derived from the primitive type decimal:

decimal

This datatype represents decimal numbers. The number of digits can be arbitrarily long (the datatype doesn't impose any restrictions), but obviously, since a XML document has an arbitrary but finite length, the number of digits of the lexical representation of a decimal value needs to be finite. Although the number of digits is not limited, we will the next section (concerning facets) shows how the author of a schema can derive user-defined datatypes with a limited number of digits if needed. Leading and trailing zeros are not considered significant and may be trimmed. The decimal separator is always a dot ("."), and a leading sign ("+" or "-") may be used, but any characters other than the 10 digits, zero through nine are forbidden, including whitespace inside the value. Allowed values for decimal include "123.456", "+1234.456", "-.456" or "-456".

integer

This datatype is a subset of decimal, representing numbers which don't have any fractional digits in its lexical or value spaces. The characters that are accepted are reduced to the digits zero through nine, with an optional leading sign. Like its base datatype, integer doesn't impose any limitation on the number of digits, and leading zeros are not significant. Note that the decimal separator is forbidden even if the numbers following the decimal are omitted or zeros.

nonPositiveInteger

nonPositiveInteger is the category for an integers which are negative or zero (because zero is neither positive nor negative).

negativeInteger

Contains an integer whose value is less than zero.

nonNegativeInteger

Contains a positive or zero integer value.

positiveInteger

Contains an integer whose value is greater than zero.

long

Contains an integer between -9223372036854775808 and 9223372036854775807, i.e., the values that can be stored in a 64-bit word.

int

Contains an integer between -2147483648 and 2147483647 (32 bits).

short

Contains an integer between -32768 and 32767 (16 bits).

byte

Contains an integer between -128 and 127 (8 bits).

unsignedLong

Contains an unsigned integer between 0 and 18446744073709551615, i.e., the values that can be stored in a 64-bit word.

unsignedInt

Contains an integer between 0 and 4294967295 (32 bits).

unsignedShort

Contains an integer between 0 and 65535 (16 bits).

unsignedByte

Contains an integer between 0 and 255 (8 bits).

The second family is made of the float and double datatypes which represent IEEE simple (32 bits) and double (64 bits) precision floating-point types. These store the values in the form of mantissa and an exponent of a power of 2 (m x 2^e), allowing a large scale of numbers in a storage that has a fixed length. Fortunately, the lexical space doesn't require that we use powers of 2 (in fact, it doesn't accept powers of 2), but instead uses a traditional scientific notation based on integer powers of 10. Since the value spaces (powers of 2) don't exactly match the values from the lexical space (powers of 10), the recommendation specifies that the closest value is taken. The consequence of this approximate matching is that float datatypes are the domain of approximation; most of the float values can't be considered exact, and are approximate.

These datatypes accept several special values: positive zero (0), negative zero (-0) (which is less than positive 0 but greater than any negative value), infinity (INF), which is greater than any value, negative infinity (-INF), which is less than any value, and "not a number" (NaN).

The last member of the numeric types family is boolean, a primitive datatype that can take the values "true" and "false" (or "1" and "0", which are considered to be equivalent).

Dates and times are probably the most controversial aspect of W3C XML Schema datatypes. In order to meet the requirements of dates on the web, the W3C XML Schema Working Group attempted to define a value space for a subset of the ISO 8601 date formats--a syntactical specification of how dates should be exchanged on the web.

The result is complex and yet fails to satisfy the experts of date and time representations, doesn't support any other calendar system than Gregorian, and has no support for localization.

One of the fuzziest aspects of these datatypes is that many of them (such as dateTime which we'll introduce in a moment) accept values with and without timezones. This creates two classes of values which cannot be reliably and accurately compared.

Let's take a closer look at this important distinction before we present the details of these datatypes. Two dateTime values which include a timezone can be compared easily. W3C XML Schema states that a dateTime value without a timezone has an undetermined timezone but that you can still compare two of these to each other. Things get fuzzy when you want to compare a dateTime value with a timezone and a dateTime value without. all you know about the dateTime value that has an undetermined timezone is that it can be in an interval from 14 hours before UTC to 14 hours after UTC. You can never conclude that the two dateTime values are equal. You can only say that one value comes before the other when they are different enough.

Why 14 hours? No, that's not a typo! National regulations have some level of flexibility with the timezones used in their countries so that the timezone they use can vary from their geographical timezone. This variation can even change throughout the year with many countries having winter and summer times. As a result of that, when the W3C published the W3C XML Schema recommendation, the maximum number of hours of difference in timezones was not between -12 and +12 hours from UTC but between -13 and +12 hours. And since the W3C doesn't expect that national authorities would ask their permission or send prior notification if they wanted to enlarge this interval, they have added a security margin and written -14/+14 hours interval into their recommendation.

[Note]Note

Since fuzziness isn't what computers like best, it is certainly a very good practice to use timezones with your dateTime values!

The date, time, and related datatypes defined by W3C XML Schema are:

dateTime

This datatype is defined as representing a "specific instant of time". This is a subset of what ISO 8601 calls a "moment of time." Its lexical value follows the format "CCYY-MM-DDThh:mm:ss," in which all the fields must be present and may optionally be preceded by a sign and leading figures, if needed, and also followed by fractional digits for the seconds and a time zone. The time zone may be specified using the letter "Z," Zulu, which identifies UTC, or by the difference of time with UTC. As we've seen, a value such as "2001-10-26T21:32:52" which is defined without a timezone can't be compared to "2001-10-26T21:32:52+02:00" or "2001-10-26T19:32:52Z" which have a timezone. The last two values which have a timezone are considered as equal since they identify the same moment.

date

This datatype has the same lexical space as the date part of dateTime with an optional timezone and represents a period of one day in its time zone, "independent of how many hours this day has." The consequence of this definition is that two dates defined in a different time zone cannot be equal except if they designate the same interval (2001-10-26+12:00 and 2001-10-25-12:00, for instance). Another consequence is that, as with dateTime, the order relation between a date with a time zone and a date without a time zone can only be partially determined.

gYearMonth

A Gregorian calendar month: a period of one calendar month in its timezone. Its format is the format of date but leaving out the entry for the day: "2001-10", "2001-10+02:00" or "2001-10Z" for instance. ("g" stands for Gregorian)

gYear

A Gregorian calendar year: a period of one calendar year in its timezone. Its format is the format of gYearMonth without its month part: "2001", "2001+02:00" or "2001Z", for instance (note that these three values identify three different periods and are not considered equal).

time

The lexical space of time is identical to the time part of dateTime. The semantic of time represents a point in time that recurs every day; the meaning of "01:20:15" is "the point in time recurring each day at 01:20:15 am." Like date and dateTime, time accepts an optional time zone definition. The same issue arises when comparing times with and without time zones.

gDay

The lexical space of gDay is "---DD" with an optional time zone specification and it represents a recurring period of one day in the specified time zone occurring each Gregorian calendar month. "---01" represents for instance the first day of each month with an undetermined timezone. Dates are pinned down depending of the number of days of each month and in February for instance, "--31Z" would occur on February 28th (or 29th for leap years).

gMonthDay

The lexical space of gMonthDay is "--MM-DD" with an optional time zone specification and it represents a recurring period of one day in the specified time zone occurring each Gregorian calendar year. Christmas day in the UK would, for instance, be "--12-25Z".

gMonth

The lexical space of gMonth should have been "--MM" with an optional timezone, but a typo in the W3C XML Schema recommendation has specified it as "--MM--" which you can still find in some tools even though an erratum has corrected it to "--MM". It represents a recurring period of a calendar month in its timezone. The months of January in Paris, for instance, would be represented as "--01+01:00".

duration

The lexical space of duration is "PnYnMnDTnHnMnS". Each part (except the leading "P") is optional. A significant amount of complexity comes from the fact that you can mix quantities expressed as months (which have a variable number of days) with quantities expressed as days such as for instance "P1Y2M8DT123S" which means a duration of 1 year, 2 months, 8 days and 123 seconds. We will not enter into the detail of the algorithms here, but this leads to a partial order relation between durations which do not make it difficult to manage processing of this datatypes when all its parts are used.

After that long and dense enumeration of types, let's see how we could add W3C XML Schema datatypes in our first schema. The most natural choices seem to be:

Our first schema could thus be rewritten as:

 <element xmlns="http://relaxng.org/ns/structure/1.0" name="library"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <oneOrMore>
   <element name="book">
    <attribute name="id">
     <data type="NMTOKEN"/>
    </attribute>
    <attribute name="available">
     <data type="boolean"/>
    </attribute>
    <element name="isbn">
     <data type="NMTOKEN"/>
    </element>
    <element name="title">
     <attribute name="xml:lang">
      <data type="language"/>
     </attribute>
     <data type="token"/>
    </element>
    <zeroOrMore>
     <element name="author">
      <attribute name="id">
       <data type="NMTOKEN"/>
      </attribute>
      <element name="name">
       <data type="token"/>
      </element>
      <element name="born">
       <data type="date"/>
      </element>
      <optional>
       <element name="dead">
        <data type="date"/>
       </element>
      </optional>
     </element>
    </zeroOrMore>
    <zeroOrMore>
     <element name="character">
      <attribute name="id">
       <data type="NMTOKEN"/>
      </attribute>
      <element name="name">
       <data type="token"/>
      </element>
      <element name="born">
       <data type="date"/>
      </element>
      <element name="qualification">
       <data type="token"/>
      </element>
     </element>
    </zeroOrMore>
   </element>
  </oneOrMore>
 </element>

or:

 element library {
  element book {
   attribute id {xsd:NMTOKEN},
   attribute available {xsd:boolean},
   element isbn {xsd:NMTOKEN},
   element title {attribute xml:lang {xsd:language}, xsd:token},
   element author {
    attribute id {xsd:NMTOKEN},
    element name {xsd:token},
    element born {xsd:date},
    element dead {xsd:date}?}*,
   element character {
    attribute id {xsd:NMTOKEN},
    element name {xsd:token},
    element born {xsd:date},
    element qualification {xsd:token}}*
  } +
 }

Note the declaration of the datatypeLibrary in the XML version, while the W3C XML Schema datatype library has the special privilege of having its prefix built into the compact syntax: I have used the xsd prefix without needing to declare any datatype library! We will see later on that this isn't the case for the DTD compatibility type library.

The previous chapter explained that data type declarations are kind of a transition to a data pattern and are not inherited by child patterns. Let's illustrate this now that we have a richer set of datatypes at hand.

In the schema which we've just written, we have defined the available attribute as boolean but in our instance documents, we have only used one of the two syntaxes for boolean ("true" or "false") and not used the other equivalent one (0 or 1). We may want to exclude this second syntax for boolean (for instance if our applications hasn't been designed to support it). In this case, we can just exclude these two values:

    <attribute name="available">
     <data type="boolean">
      <except>
       <value>0</value>
       <value>1</value>
      </except>
     </data>
    </attribute>

or:

   attribute available {xsd:boolean - ("0"|"1")}

This looks rather natural, but why does it work? It works because RELAX NG forgets that the type of the attribute is boolean as soon as we've left the data pattern and instead uses the default type (RELAX NG's built-in token type) to test that the value is neither "0" nor "1". If RELAX NG did not forget the type of the attribute, the schema would have removed the entire lexical space of "boolean" and would have been impossible to use, since "0" and "false" are equivalent (and "1" and "true" too).

We have seen a situation where we rely on the fact that the types used in the data and value patterns are different. There are also situations where we would like them to be the same. In that case, we need to repeat the type attribute. If our applications are designed to accept both formats for the available attributes and if we need to test that the books are available, we might prefer to use the same type for both patterns. In this case we can write:

    <attribute name="available">
     <data type="boolean">
      <except>
       <value type="boolean">false</value>
      </except>
     </data>
    </attribute>

or

   attribute available {xsd:boolean - (xsd:boolean "false")},

We now rely on the datatype boolean to exclude both "0" and "false", which are equivalent. Of course, in the case of booleans, the number of possible values is limited. We could have simplified our schema to:

    <attribute name="available">
     <value type="boolean">true</value>
    </attribute>

or

   attribute available {xsd:boolean "true"}

but this wouldn't have made the point I wanted to make. This also works for other datatypes.

The restrictions, known as facets, that a user can apply to predefined W3C XML Schema datatypes, in the W3C XML Schema recommendation can be applied in a RELAX NG schema. This is done by using a pattern named param. The param patterns is directly included within data patterns and appears before the optional except pattern covered in the previous chapter. These param patterns have a name attribute which identifies a facet and their text content is the value of the facet. When several param patterns are included, all the constraints must be met (in other words, the result is a logical "and" of all the conditions). Also note that the same facet can't be repeated twice except for the facet named pattern.

The vocabularies used by RELAX NG and W3C XML Schema are slightly different. What RELAX NG calls param is called facet by W3C XML Schema while what is called a pattern by RELAX NG should not be confused with the facet named pattern by W3C XML Schema. Also note that we have seen previously that what RELAX NG calls whitespace normalization is not the same as the whitespace processing applied to the W3C XML Schema normalizedSpace datatype.

The facets defined by W3C XML Schema are:

whiteSpace

This somewhat controversial facet cannot be used in RELAX NG.

enumeration

This facet cannot be used in RELAX NG since it is equivalent to RELAX NG's own enumerations--RELAX NG's should be used instead.

pattern

this is the only facet which applied to the lexical space. All the other facets working in the value space only. This facet checks if the data matches a regular expression. This facet is covered in the next chapter Chapter 9: W3C XML Schema Regular Expressions. For the moment, let's just say that it is a superset of Perl regular expressions (anchored to the beginning and the end of the values to match) and that it does not support the POSIX style character classes defined in Perl, includes a few XML goodies, supports all the Unicode classes and blocks and defines a special construct to define differences between character classes.

length

this facet is available only for string, binary and list datatypes. For string (and string like) type, this defines the number of Unicode characters, for binary (i.e. hexBinary and base64Binary) datatypes it defines a number of bytes and for list datatypes (entities, idrefs and NMTOKENS) it defines the number of tokens in the list.

maxLength

same meaning and restrictions as length but defines a maximum length.

minLength

same meaning and restrictions as length but defines a minimum length.

maxExclusive

Applies only to decimal, integer (and derived), float and double and all the date time and duration datatypes. It defines a maximum value that cannot be reached. Note that, for date times and duration datatypes, the relation of order between two values is partial and that the result cannot always be determined.

minExclusive

same restriction as maxExclusive and defines a minimum value that cannot be reached.

maxInclusive

same restriction as maxExclusive but defines a maximum value that can be reached.

minInclusive

same restriction as maxExclusive but defines a minimum value that can be reached.

totalDigits

applies to decimal, integer and derived types to define the maximum number of digits (after and before the decimal point). As all the facets do(except pattern), this facet works on the value space, and "000001.10000000" for instance would be considered as only having 2 digits.

fractionDigits

applies to decimal to define the maximum number of fractional digits (those after the decimal point). As all the facets (except pattern) this facet works on the value space, and "000001.10000000", for instance, would be considered as only having 1 fractional digit.

Again, after this enumeration of facets, let's see how we could apply some of these to improve our library schema:

The corresponding schema would be:

 <element xmlns="http://relaxng.org/ns/structure/1.0"
  name="library" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <oneOrMore>
   <element name="book">
    <attribute name="id">
     <data type="NMTOKEN">
       <param name="maxLength">16</param>
     </data>
    </attribute>
    <attribute name="available">
     <data type="boolean"/>
    </attribute>
    <element name="isbn">
     <data type="NMTOKEN">
       <param name="pattern">[0-9]{9}[0-9x]</param>
     </data>
    </element>
    <element name="title">
     <attribute name="xml:lang">
      <data type="language">
       <param name="length">2</param>
      </data>
     </attribute>
     <data type="token">
       <param name="maxLength">255</param>
     </data>
    </element>
    <zeroOrMore>
     <element name="author">
      <attribute name="id">
       <data type="NMTOKEN">
        <param name="maxLength">16</param>
       </data>
      </attribute>
      <element name="name">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
      <element name="born">
       <data type="date">
        <param name="minInclusive">1900-01-01</param>
        <param name="maxInclusive">2099-12-31</param>
        <param name="pattern">[0-9]{4}-[0-9]{2}-[0-9]{2}</param>
       </data>
      </element>
      <optional>
       <element name="dead">
        <data type="date">
         <param name="minInclusive">1900-01-01</param>
         <param name="maxInclusive">2099-12-31</param>
         <param name="pattern">[0-9]{4}-[0-9]{2}-[0-9]{2}</param>
        </data>
       </element>
      </optional>
     </element>
    </zeroOrMore>
    <zeroOrMore>
     <element name="character">
      <attribute name="id">
       <data type="NMTOKEN">
        <param name="maxLength">16</param>
       </data>
      </attribute>
      <element name="name">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
      <element name="born">
       <data type="date">
        <param name="minInclusive">1900-01-01</param>
        <param name="maxInclusive">2099-12-31</param>
        <param name="pattern">[0-9]{4}-[0-9]{2}-[0-9]{2}</param>
       </data>
      </element>
      <element name="qualification">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
     </element>
    </zeroOrMore>
   </element>
  </oneOrMore>
 </element>

or:

 element library {
  element book {
   attribute id {xsd:NMTOKEN {maxLength = "16"}},
   attribute available {xsd:boolean "true"},
   element isbn {xsd:NMTOKEN {pattern = "[0-9]{9}[0-9x]"}},
   element title {
     attribute xml:lang {xsd:language {length="2"}},
     xsd:token {maxLength="255"}
   },
   element author {
    attribute id {xsd:NMTOKEN {maxLength = "16"}},
    element name {xsd:token {maxLength = "255"}},
    element born {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[0-9]{4}-[0-9]{2}-[0-9]{2}"
    }},
    element dead {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[0-9]{4}-[0-9]{2}-[0-9]{2}"
    }}?}*,
   element character {
    attribute id {xsd:NMTOKEN {maxLength = "16"}},
    element name {xsd:token {maxLength = "255"}},
    element born {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[0-9]{4}-[0-9]{2}-[0-9]{2}"
    }},
    element qualification {xsd:token {maxLength = "255"}}}*
  } +
 }

Note the usage of regular expressions in the pattern facets. The set of facets of W3C XML Schema isn't particularly rich, so the pattern facet acts as a Swiss army knife helping you to do all the tricky tasks that other facets can't do. Regular expressions and pattern will be explained in the next chapter.

Also note that facets only define restrictions. You cannot extend the lexical space of a datatype.


You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.