Some Ideas to Make Disambiguation Easier

Some Ideas to Make Disambiguation Easier
Prev	Determinism and Datatype Assignment	Next

To close this chapter, I'll present some ideas that should ease the challenge of disambiguating schemas.

Generalizing the Except Pattern

In the different forms of ambiguity, name classes have been the easiest ones to disambiguate. Why is this? Name classes aren't inherently simpler than regular expressions or datatypes. All these tools are about defining sets of things that can happen in XML documents and in many ways, they are deeply similar. The reason that name classes and datatypes have been easier to disambiguate is because they have a first class except operator. If you had the same level of support for patterns and datatypes, you could more easily disambiguate them.

It is possible to apply the except pattern to datatypes and write:

element foo{ (xsd:boolean - xsd:integer) |xsd:integer}

A value that is only integer will obviously match only the right alternative. A value that is exclusively boolean (true or false) matches the left alternative. A value that is both a boolean and an integer (0 or 1) matches the first condition of the left alternative (xsd:boolean) but doesn't match the exception clause.

Unfortunately, this rule can't be generalized beyond the scope of data patterns. (Note that the examples given next with the except (-) operator aren't valid RELAX NG.)

If this rule could be generalized, and applied to an ambiguous regular expression such as:

 two|(one?,two+,three*)

you could write:

 two|((one?,two+,three*)-two)

Of course, this same set of results can be created with the existing RELAX NG patterns, but a generalized except would make that flexibility much more accessible.

Making Disambiguation Rules Explicit

My second proposal is far less disruptive. The idea is just the realization that these ambiguities are ambiguous because you haven't done anything to rule them out. There are plenty of examples in other computer languages of ambiguities that have been partially or fully ruled out: XSLT templates, order of evaluation of statements in programming languages, or, as we've seen in the section about W3C XML Schema, union of datatypes.

There is nothing preventing the creation of a specification defining a priority for the alternatives to be used by applications interested in instance annotation at large when they encounter ambiguities.

This specification wouldn't need to apply to RELAX NG processors interested only in validation and would not compromise their optimizations. It could apply only to RELAX NG processors performing instance annotation. It would also guarantee a consistent and interoperable type of annotation for schemas that are currently considered to be ambiguous.

The rule could be as simple as "use the first alternative in document order" or could also take into account additional factors, such as giving a lesser precedence to included grammars, as XSLT does with stylesheet imports.

Accepting Ambiguity

Jeni Tennison proposed a third approach on the xml-dev mailing list: instead of trying to fight against ambiguity, why not accept it? Why couldn't we acknowledge that something can have several datatypes (or models) and at the same time have a datatype "A" and "B"? Why couldn't a value be an integer and a boolean simultaneously?

This idea would have a serious impact on specifications, such as XPath 2.0—that assign a single datatype to each simple type element and attribute, but this approach would be much more compatible with the principle that markup is only the projection of a structure over a document. It often happens that a piece of text can have several meanings. By extension, acknowledging that elements and attributes may belong to multiple datatypes at the same time seems like something obvious, yet clever, to do.

Prev	Up	Next
The Downsides of Ambiguous and Nondeterministic Content Models	Home	Part II. Reference