RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)

You are welcome to use our annotation system to give your feedback.


The downsides of ambiguous and non-deterministic content models

Again, if you're only interested in using a RELAX NG schema for validation which, after all, is the primary goal of RELAX NG, it is perfectly fine to design and use non-deterministic and even ambiguous schemas. The downsides of ambiguous schemas appear when we want to use RELAX NG schemas for adding validation information to instance documents or use a RELAX NG schema for guided editing. The downsides of non-deterministic schemas only appear when we want to be able to translate our schemas into a W3C XML Schema.

For the purposes of RELAX NG, instance annotation is the ability to attach information gathered during validation to facilitate instance document processing. Instance annotation is one of the more promising paths to automating XML document processing. Its applications cover domains from datatype assignment (the basis of XQuery 1.0, XPath 2.0 and XSLT 2.0), to data binding (automating the creation of objects from XML documents and the creation of XML documents from objects), to XML guided editing.

Some tools may have more stringent requirements depending on their algorithms (for instance, a SAX based streaming tool might require deterministic schemas), but in theory (and in general), it is sufficient for the applications of instance annotations to ensure that the annotations are consistent. This can be achieved if the schema is unambiguous.

Note that even this freedom from ambiguity isn't always required. These requirements are application dependent. Consider a databinding application which needs to know the content model of each element. This application might have trouble determining which content model to use if it finds a pattern such as this and an element foo with a content pattern matching the second pattern:

 element foo {first?,second}
 |element foo {second,third?}
 first=element first{xsd:integer}
 second=element second{xsd:token}
 third=element third{xsd:boolean}

Should it bind the contents of foo to an object allowing an optional first or to an object allowing an optional third? Such ambiguity is likely to be a problem for this application. On the other hand, if all you need to do is perform simple type assignment, this schema is perfectly fine. Even though it is ambiguous, there is no ambiguity as far as datatype assignment is concerned.

As a bottom line, we can say that being aware of ambiguity in your RELAX NG schemas is good practice. If you want to support instance annotation applications, you must also check the tools which you will be using since they can have either more stringent or more relaxed requirements.

I have promised to give an example of an unambiguous pattern which is not deterministic and can't be rewritten in a deterministic form. Here it is! Let's consider a pattern describing a book as a sequence of odd and even pages:

      <zeroOrMore>
        <ref name="odd"/>
        <ref name="even"/>
      </zeroOrMore>
      <optional>
        <ref name="odd"/>
      </optional>

or:

 (odd, even)*, odd?

This pattern is not ambiguous. Given any valid combinations of odd and even pages it is possible to know which pattern has matched each of the pages. It can't be deterministic, however, since for each odd page, you need to look ahead at the next one to see if it is the last before knowing if an even page is required in next position.

W3C XML Schema requires deterministic content models under the name of "Unique Particle Attribution" and "Consistent Declaration" rules. These rules forbid this simple and useful content model!

Another example of non-deterministic pattern is:

 <choice>
  <element name="foo">
   <attribute name="bar"/>
  </element>
  <element name="foo">
   <element name="bar">
    <text/>
   </element>
  </element>
 </choice>

or:

 element foo {attribute bar} | element foo {element bar {text}}

This one would seem easier to translate. At least, it can be factorized and rewritten as a deterministic pattern in RELAX NG as:

 <element name="foo">
  <choice>
   <attribute name="bar"/>
   <element name="bar">
    <text/>
   </element>
  </choice>
 </element>

or:

 element foo {attribute bar| element bar {text}}

Unfortunately, this doesn't help to translate our schema into W3C XML Schema since W3C XML Schema doesn't know how to handle the mixing of constraints on sub elements and attributes, except using difficult hacks with key definitions which don't work in all the cases.

Making sure that your schemas are deterministic is thus a good practice when you plan to translate your schemas into W3C XML Schema. Unfortunately there's no guarantee that they will translate gracefully. The only rule I can give if you want to make sure that your schemas will be easy to translate is to check the result of translation frequently as you write your schema. Also hope that James Clark will continue to improve the Trang conversion algorithm!

Nevertheless, W3C XML Schema deals nicely with datatype ambiguities. We can re-examine our example of datatype ambiguity:

 element foo{xsd:boolean|xsd:integer}

you will be surprised to know that it translates gracefully into:

  <xs:element name="foo">
    <xs:simpleType>
      <xs:union memberTypes="xs:boolean xs:integer"/>
    </xs:simpleType>
  </xs:element>

This is not considered ambiguous because W3C XML Schema has added a rule. When several datatypes are grouped "by union", which is effectively what our choice between datatype does, a processor should stop after the first type which matches and not evaluate the next alternatives.


You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.