by Eric van der Vlist is published by O'Reilly & Associates (ISBN: 0596004214)
W3C XML Schema simple types required several chapters in my XML Schema (O'Reilly) book to explain completely, but I'll try to give a brief overview here so that you can use the basic features within RELAX NG schemas. You will find additional detail about the simple types' definitions in Chapter 9 and Chapter 19, and you are of course welcome to read Chapter 4, Chapter 5, Chapter 6, and Chapter 16 of my XML Schema book to get a deeper understanding of their behavior.
The W3C XML Schema datatypes that can be used in a RELAX NG schema are the predefined W3C XML Schema types—those defined in the W3C XML Schema Recommendation itself as opposed to user-defined types, which are derived from the predefined types using the W3C XML Schema language and can't be used from a RELAX NG schema. You'll see that restrictions (called facets in the terminology of W3C XML Schema) can be applied to these datatypes using the RELAX NG param pattern, so some customization is possible.
The W3C XML Schema predefined datatypes are divided into primitive and derived types. Primitive types are basic types that don't share a common foundation of meaning and behave differently from each other. Derived types are built on the foundations of primitive types, sharing the semantics of its primitive type. Derived types are provided for the convenience of users, since it is expected that they will be commonly used and shouldn't need constant reinvention.
The other idea that needs to be introduced before we start is the concept of lexical and value spaces: lexical space is the string as it appears in the XML document (after whitespace normalization), while value space is the matching value as interpreted by the datatype library. The distinction is important because all the facets save one (the pattern facet, which is covered in depth in Chapter 9) act on the value space.
The next few sections will give a brief presentation of the datatypes, organized by their primary types.
This is the only datatype for which no whitespace normalization is done. There is no restriction on the lexical or value spaces of this datatype, which is identical to the string RELAX NG built-in type. The difference is that restrictions can be applied through param patterns on the W3C XML Schema string type.
A string, but intermediate whitespace processing is performed on this datatype: occurrences of whitespace—including tabs (#x9), linefeeds (#xA), and spaces (#x20)—are replaced by the same number of spaces (#x20), but no space-collapsing or trimming is performed. Just as for the string datatype, there are no restrictions on the lexical or value spaces of this datatype.
This datatype is similar to the built-in token datatype: whitespaces are normalized, i.e., all the sequences of whitespaces are replaced by a single space, and the leading and trailing spaces are removed. Including token and string, this is the third and last datatype that has no constraint on its value or lexical spaces. (Also note that all the datatypes except string and normalizedString follow the same normalization rules as the token datatype.)
This datatype was created to accept all the language codes standardized by RFC 1766. Some valid values for this datatype are en, en-US, fr, or fr-FR.
This datatype corresponds to the XML 1.0 Nmtoken (Name token) production, which is a single token (a set of characters without spaces) composed of characters allowed in an XML name. Some examples of valid values for this datatype are "Snoopy", "CMS", "1950-10-04", or "0836217462". Invalid values include "brought classical music to the Peanuts strip" (spaces are forbidden) or "bold,brash" (commas are forbidden).
The lexical and value spaces of NMTOKENS are whitespace-separated lists of NMTOKEN components.
This datatype is similar to NMTOKEN with the additional restriction that the values must start either with a letter or the characters ":" or "_". This datatype conforms to the XML 1.0 definition of a Name. Some examples of valid values for this datatype are "Snoopy", "CMS", or "_1950-10-04-10:00". Invalid values include "0836217462" (can't start with a number) or "bold,brash" (commas are forbidden). This datatype shouldn't be used for names that may be qualified by a namespace prefix; another datatype, QName, has a specific semantic for these values.
This is a noncolonized name as defined by Namespaces in XML 1.0: a Name without any colons. As such, this datatype is probably the predefined datatype that is closest to the notion of a name in most of the programming languages (some characters such as "_" or "." may still be a problem in many cases). Valid values for this datatype include "Snoopy", "CMS", "_1950-10-04-10-00", or "1950-10-04". Invalid values are "_1950-10-04:10-00" or "bold:brash" (colons are forbidden).
The lexical space of ID is the same as the lexical space of NCName. As defined by the W3C XML Schema recommendation, there is one constraint added to its value space: there must not be any duplicate values in a document. RELAX NG doesn't allow datatype libraries to perform this type of check. This is a job for the DTD compatibility feature, as you will see at the end of this chapter. Its specification asks RELAX NG processors supporting this feature to enforce ID uniqueness for W3C XML Schema ID datatypes. Other implementations just check its lexical space as a NCName.
The lexical space of IDREF is the same as the lexical space of NCName. Just as for ID, W3C XML Schema adds the constraint that it must match an ID defined in the same document. RELAX NG makes this behavior optional for RELAX NG processors supporting the W3C XML Schema type library without supporting the DTD compatibility feature.
The lexical space of IDREFS is a whitespace-separated list of NCName values. Just as for ID and IDREF, W3C XML Schema adds the constraint that each of the values must match an ID defined in the same document. RELAX NG makes this behavior optional for RELAX NG processors supporting the W3C XML Schema type library without supporting the DTD compatibility feature.
The lexical space of ENTITY is the same as the lexical space of NCName. Also provided for compatibility with XML 1.0 DTDs, an ENTITY value and must match an unparsed entity defined in a DTD.
The lexical and value spaces of ENTITIES are the whitespace-separated lists of ENTITY components.
Strictly speaking, anyURI, the only member of this family, isn't considered a string because its value can be different from its lexical representation to compensate for the differences of format between XML and URIs, as specified in RFCs 2396 and 2732. These RFCs aren't very friendly toward non-ASCII characters and require many character escapes that aren't necessary in XML.
As an example of this transformation, the href attribute of an XHTML link written as:
<a href="http://dmoz.org/World/Français/"> World/Français </a> |
is converted to the value:
http://dmoz.org/World/Fran%C3%A7ais/ |
in the value space.
Also note that the anyURI datatype doesn't pay attention to xml:base attributes that may have been defined in the document.
Up to now, I have only briefly mentioned XML namespaces. I'll focus on them in Chapter 11, but we need to use some of their concepts right now. If you're not familiar with namespaces, you can skip this section: you don't need qualified names quite yet. Even if you are a XML namespace guru, I wouldn't recommend that you use them as they complicate many kinds of processing enormously.
What we're talking about here is different from using qualified names for element and attribute names. Using qualified names for element and attribute names is defined by the recommendation "Namespaces in XML 1.0" (you can find it at http://www.w3.org/TR/REC-xml-names), and there isn't much debate left on the subject. Here, I am speaking of using qualified names in element or attribute values. This usage is much more controversial because it creates a dependency between markup and its content.
Because of this dependency, you can't consider a qualified name string datatype, as its prefix is only a shortcut to the associated namespace URI. The value space of a qualified name is thus not what you see, but a tuple—two things combined, composed of the associated namespace URI (replacing the prefix) and its local part (i.e., what is after the prefix and the colon).
For instance, if the xsd prefix has been associated with the namespace URI http://www.w3.org/2001/XMLSchema, a qualified name (QName) xsd:language would thus have a value that is the tuple {http://www.w3.org/2001/XMLSchema, language}. It can be considered equal to a QName foo:language if the prefix foo has been associated with http://www.w3.org/2001/XMLSchema or language if http://www.w3.org/2001/XMLSchema has been defined as the default namespace.
There are two QName datatypes, which RELAX NG treats as equivalent:
A namespace-qualified name. The lexical space is the set of colonized names consisting of a prefix; a local name separated by a colon or a local name only if no prefix is used. The value space is the set of tuples {namespace URI, local name} as explained previously. Note that for a QName to be considered valid, the prefix must be defined through a namespace declaration in the scope of the location where it is used.
In W3C XML Schema, NOTATION is a QName that is used as a notation in a W3C XML Schema. Because RELAX NG has no equivalent syntax for declaring notations, RELAX NG processors treat NOTATION as a synonym for QName.
XML 1.0 isn't designed to store binary content: binary content must be encoded as some form of string before it can be included in an XML document. W3C XML Schema has defined two primary datatypes to support two encodings: one that is commonly used (base64) and one that is newer (hexBinary). These encodings may include any binary content, including text formats whose content may be incompatible with the XML markup. Other binary text encodings can also be used in XML (such as uuXXcode, Quote Printable, BinHex, aencode, or base85, to name a few), but their values aren't recognized by W3C XML Schema.
This datatype defines a simple way to code binary content as a character string by translating the value of each binary octet into two hexadecimal digits. (This encoding shouldn't be confused with the encoding method called BinHex, introduced by Apple and described by RFC 1741, which includes a mechanism to compress repetitive characters.) A UTF-8 XML header such as <?xml version="1.0" encoding="UTF-8"?> encoded in hexBinary is:
3f3c6d78206c657673726f693d6e3122302e20226e656f636964676e223d54552d4622383e3f. |
This mechanism uses the encoding known as base64, which is described in RFC 2045. It maps groups of 6 bits into an array of 64 printable characters. The same header encoded in base64Binary is:
PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCg= =. |
Numeric datatypes are built on top of four primitive datatypes: decimal for all the decimal types (including the integer datatypes, which are treated as decimals without a fractional part), double and float for single- and double-precision floats, and boolean for Booleans.
The first family of numeric datatypes is derived from the primitive type decimal:
This datatype represents decimal numbers. The number of digits can be arbitrarily long (the datatype doesn't impose any restrictions), but obviously, since a XML document has an arbitrary but finite length, the number of digits of the lexical representation of a decimal value needs to be finite. Although the number of digits isn't limited, the next section (concerning facets) shows how the author of a schema can derive user-defined datatypes with a limited number of digits if needed. Leading and trailing zeros aren't considered significant and may be trimmed. The decimal separator is always a dot (.), and a leading sign (+ or -) may be used, but any characters other than the 10 digits zero through nine are forbidden, including whitespace inside the value. Allowed values for decimal include 123.456, +1234.456, -.456 or -456.
This datatype is a subset of decimal, representing numbers that don't have any fractional digits in its lexical or value spaces. The characters that are accepted are reduced to the digits zero through nine, with an optional leading sign. Like its base datatype, integer doesn't impose any limitation on the number of digits, and leading zeros aren't significant. Note that the decimal separator is forbidden even if the numbers following the decimal are omitted or zeros.
nonPositiveInteger is the category for integers that are negative or zero (zero is neither positive nor negative).
Contains an integer whose value is less than zero.
Contains a positive or zero integer value.
Contains an integer whose value is greater than zero.
Contains an integer between -9223372036854775808 and 9223372036854775807; i.e., the values that can be stored in a 64-bit word.
Contains an integer between -2147483648 and 2147483647 (32 bits).
Contains an integer between -32768 and 32767 (16 bits).
Contains an integer between -128 and 127 (8 bits).
Contains an unsigned integer between 0 and 18446744073709551615; i.e., the values that can be stored in a 64-bit word.
Contains an integer between 0 and 4294967295 (32 bits).
Contains an integer between 0 and 65535 (16 bits).
Contains an integer between 0 and 255 (8 bits).
The second family is made of the float and double datatypes, which represent IEEE simple (32 bits) and double (64 bits) precision floating-point types. These store the values in the form of a mantissa and an exponent of a power of 2 (m × 2e), allowing a large scale of numbers in a storage that has a fixed length. Fortunately, the lexical space doesn't require powers of 2 (in fact, it doesn't accept powers of 2), but instead uses a traditional scientific notation based on integer powers of 10. Because the value spaces (powers of 2) don't exactly match the values from the lexical space (powers of 10), the recommendation specifies that the closest value is taken. The consequence of this approximate matching is that float datatypes are the domain of approximation; most of the float values can't be considered exact and are approximate.
These datatypes accept several special values: positive zero (0), negative zero (-0) (which is less than positive 0 but greater than any negative value); infinity (INF), which is greater than any value; negative infinity (-INF), which is less than any value; and "not a number" (NaN).
The last member of the numeric types family is boolean, a primitive datatype that can take the values true and false (or 1 and 0, which are considered equivalent).
Dates and times are probably the most controversial aspect of W3C XML Schema datatypes. In order to meet the requirements of dates on the Web, the W3C XML Schema Working Group attempted to define a value space for a subset of the ISO 8601 date formats—a syntactical specification of how dates should be exchanged on the Web.
The result is complex and yet fails to satisfy the experts of date and time representations, doesn't support any other calendar system than Gregorian, and has no support for localization.
One of the fuzziest aspects of these datatypes is that many of them (such as dateTime, which I'll introduce in a moment) accept values with and without time zones. This creates two classes of values, which can't be reliably and accurately compared.
Let's take a closer look at this important distinction before I present the details of these datatypes. Two dateTime values that include a time zone can be compared easily. W3C XML Schema states that a dateTime value without a time zone has an undetermined time zone, but that you can still compare two of these to each other. Things get fuzzy when you want to compare a dateTime value with a time zone and a dateTime value without. All you know about the dateTime value that has an undetermined time zone is that it can be in an interval from 14 hours before UTC to 14 hours after UTC. You can never conclude that the two dateTime values are equal. You can say only that one value comes before the other when they are different enough.
Why 14 hours? No, that's not a typo! National regulations have some level of flexibility with the time zones used in their countries, so that the time zone they use can vary from their geographical time zone. This variation can even change throughout the year, with many countries having winter and summer times. As a result, when the W3C published the W3C XML Schema recommendation, the maximum number of hours of difference in time zones was not between -12 and +12 hours from UTC but between -13 and +12 hours. And because the W3C doesn't expect that national authorities will ask their permission or send prior notification if they want to enlarge this interval, they have added a security margin and written the -14/+14 hours interval into their recommendation.
Warning | |
---|---|
Because computers aren't fond of fuzziness, it is certainly a very good practice to use time zones with your dateTime values! |
Here are the date, time, and related datatypes defined by W3C XML Schema:
This datatype is defined as representing a "specific instant of time." This instant is a subset of what ISO 8601 calls a "moment of time." Its lexical value follows the format CCYY-MM-DDThh:mm:ss, in which all the fields must be present and may optionally be preceded by a sign and leading figures, if needed, and also followed by fractional digits for the seconds and a time zone. The time zone may be specified using the letter "Z," Zulu, which identifies UTC, or by the difference of time with UTC. As you've seen, a value such as 2001-10-26T21:32:52 that's defined without a time zone can't be compared to 2001-10-26T21:32:52+02:00 or 2001-10-26T19:32:52Z, which have a time zone. The last two values, which have a time zone, are considered equal because they identify the same moment.
This datatype has the same lexical space as the date part of dateTime with an optional time zone and represents a period of one day in its time zone, "independent of how many hours this day has." The consequence of this definition is that two dates defined in a different time zone can't be equal, except if they designate the same interval (2001-10-26+12:00 and 2001-10-25-12:00, for instance). Another consequence is that, as with dateTime, the order relation between a date with a time zone and a date without a time zone can be only partially determined.
A Gregorian calendar month: a period of one calendar month in its time zone. Its format is the format of date but leaving out the entry for the day: 2001-10, 2001-10+02:00, or 2001-10Z for instance ("g" stands for Gregorian).
A Gregorian calendar year: a period of one calendar year in its time zone. Its format is the format of gYearMonth without its month part: 2001, 2001+02:00 or 2001Z, for instance (note that these three values identify three different periods and aren't considered equal).
The lexical space of time is identical to the time part of dateTime. The semantic of time represents a point in time that recurs every day; the meaning of "01:20:15" is "the point in time recurring each day at 01:20:15 am." Like date and dateTime, time accepts an optional time-zone definition. The same issue arises when comparing times with and without time zones.
The lexical space of gDay is ---DD with an optional time zone specification, and it represents a recurring period of one day in the specified time zone occurring each Gregorian calendar month. ---01 represents, for instance, the first day of each month with an undetermined time zone. Dates are pinned down depending on the number of days of each month; in February, for instance, --31Z occurs on February 28th (or 29th for leap years).
The lexical space of gMonthDay is --MM-DD with an optional time-zone specification, and it represents a recurring period of one day in the specified time zone occurring each Gregorian calendar year. For instance, Christmas day in the United Kingdom is --12-25Z.
The lexical space of gMonth should have been --MM with an optional time zone, but a typo in the W3C XML Schema recommendation has specified it as --MM-- which you can still find in some tools even though an erratum has corrected it to --MM. It represents a recurring period of a calendar month in its time zone. The months of January in Paris, for instance, are represented as --01+01:00.
The lexical space of duration is PnYnMnDTnHnMnS. Each part (except the leading "P") is optional. A significant amount of complexity comes from the fact that you can mix quantities expressed as months (which have a variable number of days) with quantities expressed as days, such as, for instance, P1Y2M8DT123S, which means a duration of 1 year, 2 months, 8 days and 123 seconds. I won't enter into the detail of the algorithms here, but formatting this leads to a partial order relation between durations that don't make it difficult to manage processing of this datatypes when all its parts are used.
After that long and dense enumeration of types, let's see how to add W3C XML Schema datatypes to our first schema. The most natural choices seem to be:
If we use the ID datatype for IDs, their uniqueness will be checked by RELAX NG processors that support the DTD compatibility feature.
The natural candidate for xml:lang is language.
We can use boolean for this attribute.
date seems the right choice, since we have been lucky enough to have ISO 8601 dates in our instance documents.
We have no reason to preserve whitespace in these elements and will use token datatypes for all of them.
Our first schema could thus be rewritten as:
<element xmlns="http://relaxng.org/ns/structure/1.0" name="library" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <oneOrMore> <element name="book"> <attribute name="id"> <data type="ID"/> </attribute> <attribute name="available"> <data type="boolean"/> </attribute> <element name="isbn"> <data type="NMTOKEN"/> </element> <element name="title"> <attribute name="xml:lang"> <data type="language"/> </attribute> <data type="token"/> </element> <zeroOrMore> <element name="author"> <attribute name="id"> <data type="ID"/> </attribute> <element name="name"> <data type="token"/> </element> <element name="born"> <data type="date"/> </element> <optional> <element name="died"> <data type="date"/> </element> </optional> </element> </zeroOrMore> <zeroOrMore> <element name="character"> <attribute name="id"> <data type="ID"/> </attribute> <element name="name"> <data type="token"/> </element> <element name="born"> <data type="date"/> </element> <element name="qualification"> <data type="token"/> </element> </element> </zeroOrMore> </element> </oneOrMore> </element> |
or:
element library { element book { attribute id {xsd:ID}, attribute available {xsd:boolean}, element isbn {xsd:NMTOKEN}, element title {attribute xml:lang {xsd:language}, xsd:token}, element author { attribute id {xsd:ID}, element name {xsd:token}, element born {xsd:date}, element died {xsd:date}?}*, element character { attribute id {xsd:ID}, element name {xsd:token}, element born {xsd:date}, element qualification {xsd:token}}* } + } |
Note the declaration of the datatypeLibrary in the XML version, while the W3C XML Schema datatype library has the special privilege of having its prefix built into the compact syntax: I have used the xsd prefix without needing to declare any datatype library! You will see later on that this isn't the case for the DTD compatibility type library.
The previous chapter explained that datatype declarations are kind of a transition to a data pattern and aren't inherited by child patterns. I'll illustrate this now that we have a richer set of datatypes at hand.
In the schema just written, I have defined the available attribute as boolean but our instance documents have used only one of the two syntaxes for boolean (true or false) and not used the other equivalent one (0 or 1). We may want to exclude this second syntax for boolean (for instance, if our application hasn't been designed to support it). In this case, we can just exclude these two values:
<attribute name="available"> <data type="boolean"> <except> <value>0</value> <value>1</value> </except> </data> </attribute> |
or:
attribute available {xsd:boolean - ("0"|"1")} |
This looks rather natural, but why does it work? It works because RELAX NG forgets that the type of the attribute is boolean as soon as we've left the data pattern and instead uses the default type (RELAX NG's built-in token type) to test that the value is neither 0 nor 1. If RELAX NG didn't forget the type of the attribute, the schema would have removed the entire lexical space of "boolean" and would have been impossible to use because 0 and false are equivalent (and 1 and true too).
You have seen a situation where we rely on the fact that the types used in the data and value patterns are different. You will find other situations in which you will want them to be the same. In that case, you need to repeat the type attribute. If your applications are designed to accept both formats for the available attributes, and if you need to test that the books are available, you might prefer to use the same type for both patterns. In this case, you can write:
<attribute name="available"> <data type="boolean"> <except> <value type="boolean">false</value> </except> </data> </attribute> |
or:
attribute available {xsd:boolean - (xsd:boolean "false")}, |
You can now rely on the datatype boolean to exclude both 0 and false, which are equivalent. Of course, in the case of booleans, the number of possible values is limited. You can simplify the schema to:
<attribute name="available"> <value type="boolean">true</value> </attribute> |
or:
attribute available {xsd:boolean "true"} |
but this doesn't make my point. This trick also works for other datatypes.
The restrictions, known as facets, that a user can apply to predefined W3C XML Schema datatypes, in the W3C XML Schema recommendation can be applied in a RELAX NG schema. This is done using an element named param. The param elements are directly included within data patterns and appears before the optional except pattern covered in the previous chapter. These param elements have a name attribute, which identifies a facet, and their text content is the value of the facet. When several param elements are included, all the constraints must be met (in other words, the result is a logical "and" of all the conditions). Also note that the same facet can't be repeated twice except for the facet named pattern.
The vocabularies used by RELAX NG and W3C XML Schema are slightly different. What RELAX NG calls param is called facet by W3C XML Schema, while what is called a pattern by RELAX NG shouldn't be confused with the facet named pattern by W3C XML Schema. Also note that as you have seen previously, what RELAX NG calls whitespace normalization isn't the same as the whitespace processing applied to the W3C XML Schema normalizedSpace datatype.
The facets defined by W3C XML Schema are:
This somewhat controversial facet can't be used in RELAX NG.
This facet can't be used in RELAX NG because it is equivalent to RELAX NG's own enumerations; RELAX NG's should be used instead.
This is the only facet that is applied to the lexical space. All the other facets work in the value space only. This facet checks whether the data matches a regular expression. This facet is covered in Chapter 9. For the moment, let's just say that it is a superset of Perl regular expressions (anchored to the beginning and the end of the values to match), and that it doesn't support the POSIX-style character classes defined in Perl. It includes a few XML goodies, supports all the Unicode classes and blocks, and defines a special construct to define differences between character classes.
Available only for string, binary, and list datatypes. For string (and string-like) type, this defines the number of Unicode characters; for binary (i.e., hexBinary and base64Binary) datatypes, it defines a number of bytes; and for list datatypes (entities, idrefs and NMTOKENS), it defines the number of tokens in the list.
Same meaning and restrictions as length but defines a maximum length.
Same meaning and restrictions as length but defines a minimum length.
Applies only to decimal, integer (and derived), float, and double and all the date time and duration datatypes. It defines a maximum value that can't be reached. Note that, for date times and duration datatypes, the relation of order between two values is partial and that the result can't always be determined.
Same restriction as maxExclusive but defines a minimum value that can't be reached.
Same restriction as maxExclusive but defines a maximum value that can be reached.
Same restriction as maxExclusive but defines a minimum value that can be reached.
Applies to decimal, integer, and derived types to define the maximum number of digits (after and before the decimal point). As all the facets do (except pattern), this facet works on the value space; "000001.10000000" (for instance) would be considered to have only two digits.
Applies to decimal types to define the maximum number of fractional digits (those after the decimal point). As all the facets (except pattern), this facet works on the value space; "000001.10000000," (for instance) would be considered to have only one fractional digit.
Again, after this enumeration of facets, let's see how to apply some of the following to improve our library schema:
We might want to ignore the regional differences and accept only two-character codes using the length facet.
There would be much more to check on ISBN number, but we might want to use a pattern to confirm that it's composed of nine digits terminated by a character that is either a digit or the character "x."
Assuming that our library is interested only in recent books we could check that they belong to the 20th or 21st centuries (in other words, between 1900 and 2099). We might also want to confirm that our dates don't specify a time zone, since we've seen that comparing dates with and without time zone is fuzzy and that the instance documents seen up to now have no timezones.
The maximum length can be constrained using a maxLength facet.
Here's the corresponding schema:
<element xmlns="http://relaxng.org/ns/structure/1.0" name="library" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <oneOrMore> <element name="book"> <attribute name="id"> <data type="ID"> <param name="maxLength">16</param> </data> </attribute> <attribute name="available"> <data type="boolean"/> </attribute> <element name="isbn"> <data type="NMTOKEN"> <param name="pattern">[0-9]{9}[0-9x]</param> </data> </element> <element name="title"> <attribute name="xml:lang"> <data type="language"> <param name="length">2</param> </data> </attribute> <data type="token"> <param name="maxLength">255</param> </data> </element> <zeroOrMore> <element name="author"> <attribute name="id"> <data type="ID"> <param name="maxLength">16</param> </data> </attribute> <element name="name"> <data type="token"> <param name="maxLength">255</param> </data> </element> <element name="born"> <data type="date"> <param name="minInclusive">1900-01-01</param> <param name="maxInclusive">2099-12-31</param> <param name="pattern">[0-9]{4}-[0-9]{2}-[0-9]{2}</param> </data> </element> <optional> <element name="died"> <data type="date"> <param name="minInclusive">1900-01-01</param> <param name="maxInclusive">2099-12-31</param> <param name="pattern">[0-9]{4}-[0-9]{2}-[0-9]{2}</param> </data> </element> </optional> </element> </zeroOrMore> <zeroOrMore> <element name="character"> <attribute name="id"> <data type="ID"> <param name="maxLength">16</param> </data> </attribute> <element name="name"> <data type="token"> <param name="maxLength">255</param> </data> </element> <element name="born"> <data type="date"> <param name="minInclusive">1900-01-01</param> <param name="maxInclusive">2099-12-31</param> <param name="pattern">[0-9]{4}-[0-9]{2}-[0-9]{2}</param> </data> </element> <element name="qualification"> <data type="token"> <param name="maxLength">255</param> </data> </element> </element> </zeroOrMore> </element> </oneOrMore> </element> |
or:
element library { element book { attribute id {xsd:ID {maxLength = "16"}}, attribute available {xsd:boolean "true"}, element isbn {xsd:NMATOKEN {pattern = "[0-9]{9}[0-9x]"}}, element title { attribute xml:lang {xsd:language {length="2"}}, xsd:token {maxLength="255"} }, element author { attribute id {xsd:ID {maxLength = "16"}}, element name {xsd:token {maxLength = "255"}}, element born {xsd:date { minInclusive = "1900-01-01" maxInclusive = "2099-12-31" pattern = "[0-9]{4}-[0-9]{2}-[0-9]{2}" }}, element died {xsd:date { minInclusive = "1900-01-01" maxInclusive = "2099-12-31" pattern = "[0-9]{4}-[0-9]{2}-[0-9]{2}" }}?}*, element character { attribute id {xsd:ID {maxLength = "16"}}, element name {xsd:token {maxLength = "255"}}, element born {xsd:date { minInclusive = "1900-01-01" maxInclusive = "2099-12-31" pattern = "[0-9]{4}-[0-9]{2}-[0-9]{2}" }}, element qualification {xsd:token {maxLength = "255"}}}* } + } |
Note the usage of regular expressions in the pattern facets. The set of facets provided by W3C XML Schema isn't particularly rich, so the pattern facet acts as a Swiss Army knife, helping you to do all the tricky tasks other facets can't do. Regular expressions and pattern are explained in Chapter 9.
Also note that facets only define restrictions. You can't extend the lexical space of a datatype through a facet (though you can create a choice between two types to merge their lexical space).
This text is released under the Free Software Foundation GFDL.