by Eric van der Vlist is published by O'Reilly & Associates (ISBN: 0596004214)


The Downsides of Ambiguous and Nondeterministic Content Models

Again, if you're interested in using a RELAX NG schema only for validation—which, after all, is the primary goal of RELAX NG—it is perfectly fine to design and use nondeterministic and even ambiguous schemas. The downsides of ambiguous schemas appear when using RELAX NG schemas for adding validation information to instance documents or using a RELAX NG schema for guided editing. The downsides of nondeterministic schemas appear only when translating schemas into a W3C XML Schema.

Instance Annotations

For the purposes of RELAX NG, instance annotation is the ability to attach information gathered during validation to facilitate instance document processing. Instance annotation is one of the more promising paths to automating XML document processing. Its applications cover domains from datatype assignment (the basis of XQuery 1.0, XPath 2.0, and XSLT 2.0), to data binding (automating the creation of objects from XML documents and the creation of XML documents from objects), to XML guided editing.

Some tools may have more stringent requirements, depending on their algorithms (for instance, a SAX-based streaming tool might require deterministic schemas), but in theory (and in general), it is sufficient for the applications of instance annotations to ensure that the annotations are consistent. Consistency can be achieved if the schema is unambiguous.

Note that even this freedom from ambiguity isn't always required. These requirements are application-dependent. Consider a data binding application that needs to know the content model of each element. This application might have trouble determining which content model to use if it finds a pattern such as this and an element foo with a content pattern matching the second pattern:

 element foo {first?,second}
 |element foo {second,third?}

 first=element first{xsd:integer}
 second=element second{xsd:token}
 third=element third{xsd:boolean}

Should it bind the contents of foo to an object allowing an optional first or to an object allowing an optional third? Such ambiguity is likely to be a problem for this application. On the other hand, if all you need to do is perform simple type assignment, this schema is perfectly fine. Even though it is ambiguous, there is no ambiguity as far as datatype assignment is concerned.

Being aware of ambiguity in your RELAX NG schemas is good practice. If you want to support instance annotation applications, you must also check the tools you will be using because they can have either more stringent or more relaxed requirements.

Compatibility with W3C XML Schema

I promised to give an example of an unambiguous pattern that isn't deterministic and can't be rewritten in a deterministic form. Here it is! Consider a pattern describing a book as a sequence of odd and even pages:

      <zeroOrMore>
        <ref name="odd"/>
        <ref name="even"/>
      </zeroOrMore>
      <optional>
        <ref name="odd"/>
      </optional>

or:

 (odd, even)*, odd?

This pattern isn't ambiguous. Given any valid combinations of odd and even pages, it is possible to know which pattern has matched each page. It can't be deterministic, however, because for each odd page, you need to look ahead to the next one to see if it is the last before knowing if an even page is required in next position.

The W3C XML Schema requires deterministic content models under the name of "Unique Particle Attribution" and "Consistent Declaration" rules. These rules forbid this simple and useful content model!

Another example of nondeterministic pattern is:

 <choice>
  <element name="foo">
   <attribute name="bar"/>
  </element>
  <element name="foo">
   <element name="bar">
    <text/>
   </element>
  </element>
 </choice>

or:

 element foo {attribute bar} | element foo {element bar {text}}

This one seems easier to translate. At least, it can be factorized and rewritten as a deterministic pattern in RELAX NG as:

 <element name="foo">
  <choice>
   <attribute name="bar"/>
   <element name="bar">
    <text/>
   </element>
  </choice>
 </element>

or:

 element foo {attribute bar| element bar {text}}

Unfortunately, this doesn't help to translate our schema into a W3C XML Schema because W3C XML Schema doesn't know how to handle the mixing of constraints on subelements and attributes except by using difficult hacks with key definitions, which don't work in all cases.

Making sure your schemas are deterministic is thus a good practice when you plan to translate your schemas into W3C XML Schemas. Unfortunately there's no guarantee that they will translate gracefully. The only rule I can give if you want to make sure that your schemas will be easy to translate is to check the result of translation frequently as you write your schema. Also hope that James Clark will continue to improve the Trang conversion algorithm!

Nevertheless, W3C XML Schema deals nicely with datatype ambiguities. You can re-examine our example of datatype ambiguity:

 element foo{xsd:boolean|xsd:integer}

and you may be surprised to know that it translates gracefully into:

  <xs:element name="foo">
    <xs:simpleType>
      <xs:union memberTypes="xs:boolean xs:integer"/>
    </xs:simpleType>
  </xs:element>

This isn't considered ambiguous because the W3C XML Schema has added a rule. When several datatypes are grouped "by union," which is effectively what our choice between datatype does, a processor should stop after the first type that matches and not evaluate the next alternatives.


This text is released under the Free Software Foundation GFDL.