by Eric van der Vlist is published by O'Reilly & Associates (ISBN: 0596004214)
First, I will clarify the concepts of ambiguity and determinism. They're blurred in many papers and discussions, yet aren't as obscure as people often think.
The first distinction is the difference between ambiguity and determinism. These two terms have been given precise definitions by regular expressions and hedge grammar theoreticians, and in this chapter I will follow the usage of these terms as defined by Murata Makoto. Part of the confusion about these notions comes from their frequent misuse.
A schema is said to be ambiguous when a document may be valid when its contents match multiple different pattern alternatives. A trivial example is:
element foo{empty} | element foo{empty} |
When an empty element named foo is found in an instance document, there is no way to say whether it is valid per the left definition or the right definition of element foo{empty} in the schema.
There are, of course, more complex cases of ambiguity, and you'll see some of them in the next sections. Still, this is the general idea behind ambiguity.
Ambiguity is independent of any implementation or algorithm. It's a property of the schema itself; a schema is either ambiguous or not ambiguous.
Determinism has been introduced to facilitate the implementation of schema processors. The basic idea behind determinism is that at each point, when matching an instance document against a schema, the schema processor has at most one possible choice. This makes life easier for implementers, who can safely rely on well-known algorithms such as automatons (also called Finite State Machines or FSMs). Thus, they can be sure that their computation times will not grow exponentially. Because of the need to avoid exponential growth in processing times, schema languages often impose determinism on schema authors.
An ambiguous schema is never deterministic, but the opposite is far from being true. Consider, for instance:
element foo{empty} | (element foo{empty}, element bar{empty}) |
This example isn't ambiguous. A schema processor can say whether the right or left alternative is used (or neither, if the document is invalid) after having read the element that follows the empty element named foo. This is, however, also nondeterministic, since when a schema processor matches an empty element named foo, it has two different choices. It can't make the choice between them without looking ahead to the next element.
Ambiguous schemas aren't a problem as far as validation is concerned. The schema's validation reports are consistent, and we don't care why a document is valid or not as long as the answer (valid or invalid) is reliable. The only downside to ambiguous schemas is for applications performing datatype assignment—or, more generally, instance document annotation—using validation. You will see more about these issues in the next sections of this chapter.
The main problem with schema languages that require deterministic schemas is that some content models are fundamentally nondeterministic and can't be rewritten in a deterministic form. Such deterministic schema languages not only create restrictions to the forms used to write a schema, but limit their expressive power. They can't describe all the content models allowed in well-formed XML. You will see examples of content models that are impossible to describe in a deterministic form in the section about compatibility with W3C XML Schema.
In a RELAX NG schema, you can distinguish four different types of ambiguity: regular expression ambiguities, hedge grammar ambiguities, name class ambiguities, and datatype ambiguities. I'll briefly introduce each of them, as they have slightly different behaviors.
After a schema has been simplified, we can make a clear distinction between the definition of each element (embedded in its own named pattern) and the grammar that combines these definitions. What's called a regular expression ambiguity is an ambiguity that resides within the definition of an element.
Theoreticians have demonstrated that ambiguous regular expressions may be rewritten in an unambiguous way. The ambiguities may be considered merely as unlucky variations over unambiguous schemas.
A basic example of such a choice between a pattern and itself is:
<choice> <ref name="pattern"/> <ref name="pattern"/> </choice> |
or:
pattern|pattern |
The unambiguous form is somewhat difficult to find when the ambiguous pattern gets more complex. For instance, the following pattern:
<choice> <group> <optional> <ref name="first"/> </optional> <ref name="second"/> </group> <group> <ref name="second"/> <optional> <ref name="third"/> </optional> </group> </choice> |
or:
(first?,second)|(second,third?) |
is ambiguous because an instance nodeset that matches only the named pattern second without first or third is valid per the two alternatives. It can be rewritten by removing the option of matching only the second pattern from one of the alternatives:
<group> <optional> <ref name="first"/> </optional> <ref name="second"/> </group> <group> <ref name="second"/> <ref name="third"/> </group> </choice> |
or:
(first?,second)|(second,third) |
Note that explicit choices aren't the only pattern that can lead to ambiguous schemas. Consider this simple pattern:
<group> <optional> <ref name="pattern"/> </optional> <optional> <ref name="pattern"/> </optional> <group> |
or:
pattern?, pattern? |
If you have a content model that matches only one pattern, you can't know if it will match it for the first or the second occurrence of the pattern. Therefore, this schema can be considered ambiguous. Here's how to rewrite it as an unambiguous schema:
<optional> <ref name="pattern"/> <optional> <ref name="pattern"/> </optional> </optional> |
or:
(pattern, pattern?)? |
Although the rewriting approach isn't easy, the math behind RELAX NG can help, as high school algebra helps factorize mathematical expressions. As an exercise, let's decompose the chain of factorizations and simplifications to rewrite pattern?, pattern? as (pattern, pattern?)?.
The first step relies on the fact that pattern? is equivalent to empty|pattern:
pattern?, pattern? |
which is equivalent to:
(empty|pattern), (empty|pattern) |
which can be factorized as:
(empty,empty)|(empty,pattern)|(pattern,empty)|(pattern,pattern) |
which can be simplified to:
empty|pattern|(pattern,pattern) |
which is equivalent to:
empty|(pattern,(empty|pattern)) |
which is equivalent to:
(pattern, pattern?)? |
I could argue that the unambiguous forms are clearer, more logical, and easier to read than the ambiguous, but I think that this conclusion is a subjective one. These different methods for creating unambiguous schemas are highly dependent on the perspective used to analyze the content of instance documents. There isn't a good or a bad practice for this: working with a schema language such as RELAX NG that supports all of these forms saves a lot of time. The language doesn't force you to use its perspective on how to solve this problem.
In a RELAX NG context, I defined regular expression ambiguity as an ambiguity that resides within the definition of an element. Ambiguous regular hedge grammars, on the other hand, are ambiguities distributed over element definitions that play the mathematical role of "hedges" in a RELAX NG schema. A example of an ambiguous regular hedge grammar is:
<choice> <ref name="pattern1"/> <ref name="pattern2"/> </choice> ... <define name="pattern1"> <element name="foo"> <empty/> </element> </define> <define name="pattern2"> <element name="foo"> <empty/> </element> </define> |
or, in the compact syntax:
pattern1|pattern2 ... pattern1=element foo{empty} pattern2=element foo{empty} |
This example is ambiguous. When an empty element, foo, is found, you can't tell whether it's been validated through pattern1 or pattern2. It's an ambiguous hedge grammar (not an ambiguous regular expression) because the ambiguity is spread over two hedges; i.e., two definitions of the element foo.
Just as in ambiguous regular expressions, ambiguous regular hedge grammars can be rewritten in unambiguous forms. The disambiguation must be done at the level of the grammar itself and may require extensive changes to the structure of the schema.
The exercise of disambiguating regular hedge grammars can get significantly more complicated when compositions of named patterns and grammars are involved. Maintaining nonambiguous patterns while combining definitions by choice means that you need to exclude all the instance nodesets valid per the original definition from the pattern given as a choice. This isn't always possible without modifying the included schema. Consider, for instance, this pattern:
<define name="namedPattern"> <ref name="first"/> </define> |
or this:
namedPattern=first |
To add an optional second pattern, it may seem natural to combine it by choice as:
<define name="namedPattern" combine="choice"> <ref name="first"/> <optional> <ref name="second"/> </optional> </define> |
or:
namedPattern|=first,second? |
The result of the combination is equivalent to:
<define name="namedPattern"> <choice> <ref name="first"/> <group> <ref name="first"/> <optional> <ref name="second"/> </optional> </group> </choice> </define> |
or:
namedPattern=first|(first,second?) |
which gives an ambiguous pattern. Of course, outside the context of a pattern combination, this example would be trivial to rewrite as:
<define name="namedPattern"> <ref name="first"/> <optional> <ref name="second"/> </optional> </define> |
or:
namedPattern=first,second? |
but in this case, you won't get to disambiguation directly by pattern combination. You need to look at the problem from a different angle. Consider that you must leave out all features used in the alternative unless those features are already allowed in the original. In other words, you need to remove the case in which the first pattern isn't followed by the second one from the target of "the first pattern followed by an optional second pattern." The alternative will thus be between the first pattern alone and the first pattern followed by a second one:
<define name="namedPattern" combine="choice"> <choice> <ref name="first"/> <group> <ref name="first"/> <ref name="second"/> </group> </choice> </define> |
or:
namedPattern=first|(first,second) |
With this target in mind, you can rewrite the combination as:
<define name="namedPattern" combine="choice"> <ref name="first"/> <ref name="second"/> </define> |
or:
namedPattern|=first,second |
To avoid ambiguous hedge grammars, be careful when combining named patterns by choice. Without knowing how pattern1 and pattern2 are defined, it's impossible to say whether the following are ambiguous:
<choice> <ref name="pattern1"/> <ref name="pattern2"/> </choice> |
and:
pattern1|pattern2 |
This difficulty requires that if you wish to create unambiguous schemas, you must study the contents of schemas you want to include very carefully.
Another source of ambiguity occurs when name classes overlap when used as different alternatives of a choice. An example of such overlap is:
<choice> <element name="foo"> <empty/> </element> <element> <anyName/> <empty/> </element> </choice> |
or:
element foo{empty} | element * {empty} |
This example is ambiguous because the name class anyName includes the name class matching the name foo. An element foo is valid in both branches of the choice pattern.
The except name class can prevent name class ambiguity, because it lets you remove the overlap from one of the alternatives. This pattern can easily be rewritten as an ambiguous choice pattern:
<choice> <element name="foo"> <empty/> </element> <element> <anyName> <except> <name>foo</name> </except> </anyName> <empty/> </element> </choice> |
or:
element foo{empty} | element * - foo {empty} |
or, more simply:
<element> <anyName> <except> <name>foo</name> </except> </anyName> <empty/> </element> |
or:
element * {empty} |
Note that name class overlap isn't enough by itself to make an ambiguous pattern. For instance:
<choice> <element name="foo"> <attribute name="bar"> <empty/> </attribute> </element> <element> <anyName/> <empty/> </element> </choice> |
or:
element foo{attribute bar{empty}} | element * {empty}} |
This example isn't ambiguous because the content model of the elements with the two name classes don't overlap. This makes the bar attribute optional:
<choice> <element name="foo"> <optional> <attribute name="bar"> <empty/> </attribute> </optional> </element> <element> <anyName/> <empty/> </element> </choice> |
or:
element foo{attribute bar{empty}?} | element * {empty}} |
This code is enough to make our pattern ambiguous. However, this pattern is strictly equivalent to the preceding example, which means that you know how to rewrite it in an unambiguous way.
Finally, note that name class ambiguity may be considered as an extension of regular hedge grammar ambiguity. If, after simplification, to create a regular hedge grammar ambiguity, you have:
element foo{empty} | element foo{empty} |
the ambiguity comes from the fact that the name classes for both alternatives are the single value foo and thus, overlap.
Datatype ambiguity is the most difficult ambiguity to handle with RELAX NG. The difficulty doesn't come from RELAX NG itself, but rather from the fact that datatype libraries aren't built-in and are more opaque and less flexible than other patterns or name classes.
A basic example of ambiguous datatypes is:
<element name="foo"> <choice> <data type="boolean"/> <data type="integer"/> </choice> </element> |
or:
element foo{xsd:boolean|xsd:integer} |
Because the lexical space of the two possible datatypes do overlap (0 and 1 are valid as both W3C XML Schema booleans and integers), there is no way to determine what the datatype is for a foo element with a value of 0 or 1. Fortunately, the except operator makes it possible to remove the lexical space of one datatype from the lexical space of another datatype:
<element name="foo"> <choice> <data type="boolean"> <except> <data type="integer"/> </except> </data> <data type="integer"/> </choice> </element> |
or:
element foo{ (xsd:boolean - xsd:integer) |xsd:integer } |
As has been the case for name classes, the except pattern is a powerful tool for disambiguating datatype ambiguities. It's a pity that this pattern can't also be used to disambiguate regular expression or hedge grammar ambiguities.
This text is released under the Free Software Foundation GFDL.