Some ideas to make disambiguation easier

RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)

Some ideas to make disambiguation easier
Prev�	Chapter 16: Determinism and Datatype Assignment	�Next

Some ideas to make disambiguation easier

To close this chapter I'd like to present some ideas which would ease the challenge of disambiguating schemas.

Generalizing the except pattern

In the different forms of ambiguity, name classes have been the easiest ones to disambiguate. Why is this? Name classes are not inherently simpler than regular expressions or datatypes. All of these tools are about defining sets of things that can happen in a XML documents and in many ways they are deeply similar. The reason that name classes have been easier to disambiguate is because they have a first class except operator. If we had the same level of support for patterns and datatypes, we could more easily disambiguate them.

If we could apply the except pattern to datatypes, it would be possible to disambiguate our example:

 element foo{xsd:boolean|xsd:integer}

is to write:

 element foo{ (xsd:boolean - xsd:integer) |xsd:integer}

A value which is only integer will obviously match only the right alternative. A value which is exclusively boolean ( true or false) will match the left alternative. A value which is both a boolean and an integer (0 or 1) will match the first condition of the left alternative ( xsd:boolean) but will not match the exception clause.

Unfortunately, this can't be generalized beyond the scope of data patterns (note that the examples given below with the except ( -) operator are not valid RELAX NG).

If this could be generalized, and applied to an ambiguous regular expression such as:

 two|(one?,two+,three*)

We would be able to write:

 two|((one?,two+,three*)-two)

Of course, this same set of results can be created with the existing RELAX NG patterns, but a generalized except would make that flexibility much more accessible.

Making disambiguation rules explicit

My second proposal is far less disruptive. The idea is just the realization that these ambiguities are ambiguous because we haven't done anything to rule them out. There are plenty of examples in other computer languages of ambiguities which have been partially or fully ruled out: XSLT templates, order of evaluation of statements in programming languages or as we've seen in the section about W3C XML Schema union of datatypes.

There is nothing preventing the creation of a specification defining a priority for the alternatives to be used by applications interested in instance annotation at large when they encounter ambiguities.

This specification wouldn't need to apply to RELAX NG processors interested only in validation and would not compromise their optimizations. It could only apply to RELAX NG processors performing instance annotation. It would also guarantee a consistent and interoperable type of annotation for schemas which are currently considered to be ambiguous.

The rule could be as simple as "use the first alternative in document order" or it could also take into account additional factors such as giving a lesser precedence to included grammars, as XSLT does with stylesheet imports.

Accepting ambiguity

Jeni Tennison proposed a third approach on the xml-dev mailing list: instead of trying to fight against ambiguity, why not accept it? Why couldn't we acknowledge that something can have several datatypes (or models) and at the same time have a datatype "A" and "B"? Why couldn't a value be an integer and a boolean simultaneously?

This idea would have a serious impact on specifications such as XPath 2.0 which assign a single datatype to each simple type element and attribute, but that this approach would be much more compatible with the principle that markup is only the projection of a structure over a document. It often happens that a piece of text may have several meanings. By extension, acknowledging that elements and attributes may belong to multiple datatypes at the same time seems like something obvious, yet clever, to do.

You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.

Prev�	Up	�Next
The downsides of ambiguous and non-deterministic content models�	Home	�Chapter 17: Element reference guide