RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)

You are welcome to use our annotation system to give your feedback.


Extensible schemas

Sometimes building an extensible schema is a matter of capturing existing practice in RELAX NG, while other times the schema development comes before practice, and the schema developer has the opportunity to make a lot of choices. We often have to do our best to write an extensible schema for an existing XML vocabulary ( as if we were asked to cook the best "blanquette de veau"), one whose contents are already specified. This type of recipe contrasts with the type where we have the freedom to change the format itself and decide when we will use elements or attributes, whether order matters, and many other variables (as if we were asked to cook the best food containing veal without knowing which meal was considered best).

In the case of a fixed result, the only way we can manage extensibility lies in how named patterns are defined, much the same way that programmers' decisions about how to define classes in object oriented environments have a lot of impact on its extensibility. In this section, we will examine the major approaches to use when defining named patterns and start elements with extensibility in mind.

Let's have a look back at our first schema, the "Russian doll" schema:

 <?xml version="1.0" encoding="utf-8"  ?>
 <element xmlns="http://relaxng.org/ns/structure/1.0" name="library">
  <oneOrMore>
   <element name="book">
    <attribute name="id"/>
    <attribute name="available"/>
    <element name="isbn">
     <text/>
    </element>
    <element name="title">
     <attribute name="xml:lang"/>
     <text/>
    </element>
    <zeroOrMore>
     <element name="author">
      <attribute name="id"/>
      <element name="name">
       <text/>
      </element>
      <element name="born">
       <text/>
      </element>
      <optional>
       <element name="died">
        <text/>
       </element>
      </optional>
     </element>
    </zeroOrMore>
    <zeroOrMore>
     <element name="character">
      <attribute name="id"/>
      <element name="name">
       <text/>
      </element>
      <element name="born">
       <text/>
      </element>
      <element name="qualification">
       <text/>
      </element>
     </element>
    </zeroOrMore>
   </element>
  </oneOrMore>
 </element>

or, in the compact syntax:

 element library {
  element book {
   attribute id {text},
   attribute available {text},
   element isbn {text},
   element title {attribute xml:lang {text}, text},
   element author {
    attribute id {text},
    element name {text},
    element born {text},
    element died {text}?}*,
   element character {
    attribute id {text},
    element name {text},
    element born {text},
    element qualification {text}}*
  } +
 }

What happens if we want to derive a schema that has a new id attribute on the library element? That's simple: we have to take our schema, copy it, and edit it as a new one. There is no option for extensibility at all since we cannot include an attribute which doesn't have a grammar element as a root.

The first thing to consider when we want a RELAX NG schema to be extensible is that we always want the root element to be a grammar element. In this case, the change, producing russian-doll.rng, is minor:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
     <element name="library">
       <oneOrMore>
         <element name="book">
           <attribute name="id"/>
           <attribute name="available"/>
           <element name="isbn">
             <text/>
           </element>
           <element name="title">
             <attribute name="xml:lang"/>
             <text/>
           </element>
           <zeroOrMore>
             <element name="author">
               <attribute name="id"/>
               <element name="name">
                 <text/>
               </element>
               <element name="born">
                 <text/>
               </element>
               <optional>
                 <element name="died">
                   <text/>
                 </element>
               </optional>
             </element>
           </zeroOrMore>
           <zeroOrMore>
             <element name="character">
               <attribute name="id"/>
               <element name="name">
                 <text/>
               </element>
               <element name="born">
                 <text/>
               </element>
               <element name="qualification">
                 <text/>
               </element>
             </element>
           </zeroOrMore>
         </element>
       </oneOrMore>
     </element>
   </start>
 </grammar>

In the compact syntax, grammar is implicit, but you still need to have a start pattern if you want to be able to redefine anything. The result of adding this, russian-doll.rnc, looks like:

 start =
    element library
    {
       element book
       {
          attribute id { text },
          attribute available { text },
          element isbn { text },
          element title { attribute xml:lang { text }, text },
          element author
          {
             attribute id { text },
             element name { text },
             element born { text },
             element died { text }?
          }*,
          element character
          {
             attribute id { text },
             element name { text },
             element born { text },
             element qualification { text }
          }*
       }+
    }

Once these minor changes have been made, the schema can at least be included into another schema and modified there.

Although the previous schemas can be redefined, this redefinition is ineffective since the granularity is very coarse and so we can't redefine just the library element. The best we can do are these, which aren't much of an improvement:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="russian-doll.rng">
     <start>
       <element name="library">
         <attribute name="id"/>
         <oneOrMore>
           <element name="book">
             <attribute name="id"/>
             <attribute name="available"/>
             <element name="isbn">
               <text/>
             </element>
             <element name="title">
               <attribute name="xml:lang"/>
               <text/>
             </element>
             <zeroOrMore>
               <element name="author">
                 <attribute name="id"/>
                 <element name="name">
                   <text/>
                 </element>
                 <element name="born">
                   <text/>
                 </element>
                 <optional>
                   <element name="died">
                     <text/>
                   </element>
                 </optional>
               </element>
             </zeroOrMore>
             <zeroOrMore>
               <element name="character">
                 <attribute name="id"/>
                 <element name="name">
                   <text/>
                 </element>
                 <element name="born">
                   <text/>
                 </element>
                 <element name="qualification">
                   <text/>
                 </element>
               </element>
             </zeroOrMore>
           </element>
         </oneOrMore>
       </element>
     </start>
   </include>
 </grammar>

or:

 include "russian-doll.rnc"
 {
 start =
       element library
       {
          attribute id { text },
          element book
          {
             attribute id { text },
             attribute available { text },
             element isbn { text },
             element title { attribute xml:lang { text }, text },
             element author
             {
                attribute id { text },
                element name { text },
                element born { text },
                element died { text }?
             }*,
             element character
             {
                attribute id { text },
                element name { text },
                element born { text },
                element qualification { text }
             }*
          }+
       }
 }

In other words, we still need to redefine the whole schema. We've made no gains in modularity since any changes in the original schema would not be propagated into our resulting schema. To fix this, we need to create finer-grained definitions. Creating finer granularity involves defining a named pattern for each element (as with the schema style imposed by DTDs). That approach leads to a schema similar to the flat schema seen in Chapter 5: Flattening our first schema, called flat.rng:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
     <ref name="library-element"/>
   </start>
   <define name="library-element">
     <element name="library">
       <oneOrMore>
         <ref name="book-element"/>
       </oneOrMore>
     </element>
   </define>
   <define name="author-element">
     <element name="author">
       <attribute name="id"/>
       <ref name="name-element"/>
       <ref name="born-element"/>
       <optional>
         <ref name="died-element"/>
       </optional>
     </element>
   </define>
   <define name="book-element">
     <element name="book">
       <attribute name="id"/>
       <attribute name="available"/>
       <ref name="isbn-element"/>
       <ref name="title-element"/>
       <zeroOrMore>
         <ref name="author-element"/>
       </zeroOrMore>
       <zeroOrMore>
         <ref name="character-element"/>
       </zeroOrMore>
     </element>
   </define>
   <define name="born-element">
     <element name="born">
       <text/>
     </element>
   </define>
   <define name="character-element">
     <element name="character">
       <attribute name="id"/>
       <ref name="name-element"/>
       <ref name="born-element"/>
       <ref name="qualification-element"/>
     </element>
   </define>
   <define name="died-element">
     <element name="died">
       <text/>
     </element>
   </define>
   <define name="isbn-element">
     <element name="isbn">
       <text/>
     </element>
   </define>
   <define name="name-element">
     <element name="name">
       <text/>
     </element>
   </define>
   <define name="qualification-element">
     <element name="qualification">
       <text/>
     </element>
   </define>
   <define name="title-element">
     <element name="title">
       <attribute name="xml:lang"/>
       <text/>
     </element>
   </define>
 </grammar>

or, in the compact syntax, flat.rnc:

 start = library-element
        
 library-element = element library { book-element+ }
 author-element =
    element author
    {
       attribute id { text },
       name-element,
       born-element,
       died-element?
    }
        
 book-element =
    element book
    {
       attribute id { text },
       attribute available { text },
       isbn-element,
       title-element,
       author-element*,
       character-element*
    }
        
 born-element = element born { text }
 character-element =
    element character
    {
       attribute id { text },
       name-element,
       born-element,
       qualification-element
    }
        
 died-element = element died { text }
        
 isbn-element = element isbn { text }
        
 name-element = element name { text }
        
 qualification-element = element qualification { text }
        
 title-element = element title { attribute xml:lang { text }, text }

These new schemas are more verbose, but they're also much more extensible. To add our id attribute, we would only need to redefine the library element:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="flat.rng">
     <define name="library-element">
       <element name="library">
         <attribute name="id"/>
         <oneOrMore>
           <ref name="book-element"/>
         </oneOrMore>
       </element>
     </define>
   </include>
 </grammar>

or:

 include "flat.rnc"
 {
    library-element = element library { attribute id { text }, book-element+ }
 }

All changes made to the flat schemas - except to the library element - would now propagate through to the derived schemas.

Although the previous result is much more extensible, we still have to redefine the complete content of the library element to add our id attribute. We may have reduced the problem of redefinition we had with our Russian doll model, but we haven't eliminated it. If we change our main vocabulary and add a new attribute or element to the library element in "flat.rng", the modification will not be automatically taken into account in our schema. We'll need to edit it.

The modification isn't automatically transferred because the extensibility of a named pattern doesn't cross element boundaries. Since we have the boundary of the library element included within our library-element named pattern, the content of this element isn't extensible, as shown in Figure�1.

To avoid this difficulty, we could have split our named patterns according to the content of the elements rather than by the element themselves. We would then have been able to add new content within the library element, as shown in Figure�2.

Generalizing this approach for all the definitions of all the elements would lead to a schema that looks like flat-content.rng:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <start>
     <element name="library">
       <ref name="library-content"/>
     </element>
   </start>
   <define name="library-content">
     <oneOrMore>
       <element name="book">
         <ref name="book-content"/>
       </element>
     </oneOrMore>
   </define>
   <define name="book-content">
     <attribute name="id"/>
     <attribute name="available"/>
     <element name="isbn">
       <ref name="isbn-content"/>
     </element>
     <element name="title">
       <ref name="title-content"/>
     </element>
     <zeroOrMore>
       <element name="author">
         <ref name="author-content"/>
       </element>
     </zeroOrMore>
     <zeroOrMore>
       <element name="character">
         <ref name="character-content"/>
       </element>
     </zeroOrMore>
   </define>
   <define name="author-content">
     <attribute name="id"/>
     <element name="name">
       <ref name="name-content"/>
     </element>
     <element name="born">
       <ref name="born-content"/>
     </element>
     <optional>
       <element name="died">
         <ref name="died-content"/>
       </element>
     </optional>
   </define>
   <define name="born-content">
       <text/>
   </define>
   <define name="character-content">
     <attribute name="id"/>
     <element name="name">
       <ref name="name-content"/>
     </element>
     <element name="born">
       <ref name="born-content"/>
     </element>
     <element name="qualification">
       <ref name="qualification-content"/>
     </element>
   </define>
   <define name="died-content">
     <text/>
   </define>
   <define name="isbn-content">
     <text/>
   </define>
   <define name="name-content">
     <text/>
   </define>
   <define name="qualification-content">
     <text/>
   </define>
   <define name="title-content">
     <attribute name="xml:lang"/>
     <text/>
   </define>
 </grammar>

or, in the compact syntax, flat-content.rnc:

 start = element library { library-content }

 library-content = element book { book-content }+

 book-content =
    attribute id { text },
    attribute available { text },
    element isbn { isbn-content },
    element title { title-content },
    element author { author-content }*,
    element character { character-content }*

 author-content =
    attribute id { text },
    element name { name-content },
    element born { born-content },
    element died { died-content }?

 born-content = text

 character-content =
    attribute id { text },
    element name { name-content },
    element born { born-content },
    element qualification { qualification-content }

 died-content = text

 isbn-content = text

 name-content = text

 qualification-content = text

 title-content = attribute xml:lang { text }, text

We can now take full advantage of the named pattern and, instead of redefining it, we can combine it neatly with the id attribute:

 <?xml version="1.0" encoding="utf-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="flat-content.rng"/>
   <define name="library-content" combine="interleave">
     <attribute name="id"/>
   </define>
 </grammar>

or:

 include "flat-content.rnc"

 library-content &= attribute id { text }

Because of the nature of the content, the extension could be done using a combination by interleave. This method of combination is frequently useful, when attributes or elements need to be added, but it only works when the relative order isn't significant for the schema. Otherwise, we would still have needed to redefine the pattern or to combine it by choice.

When we are free to define the vocabulary, there are three principal guidelines for designing extensible formats. The first one is independent of any schema language. The second is specific to RELAX NG and maximizes the usage of combination through interleave. The third is a way to minimize the impact of interleave on schemas which need to be converted into W3C XML Schema or DTD schemas.

Attributes are generally difficult to extend. When choosing among elements and attributes, people often base their choice on the relative ease of their being processed, styled, or transformed. Instead, they should probably focus on their extensibility.

Independently of any XML schema language, when you have an attribute in an instance document, you are pretty much stuck with it. Unless you replace it with an element, there is no way to extend it. You can't add any child elements or attributes to it since it's designed to be a leaf node and to remain a leaf node. Furthermore, you can't extend the parent element to include a second instance of an attribute with the same name. (Attributes with duplicate names are forbidden by XML 1.0.) You are thus making an impact not only on the extensibility of the attribute, but also on the extensibility of the parent element.

To understand the reasons behind these limitations, it's worth looking back at the original use cases for attributes. Attributes were originally designed to hold metadata, information about the contents of the document. Elements themselves are a kind of metadata, labelling the content found in the document, and attributes are a mechanism for refining that metadata. (Data about metadata is still metadata.) Because of this, the editors of XML 1.0 decided that the lack of extensibility in XML attributes was not an issue.

Although most XML tools provide equal access to elements and attribute contents and do not require that attributes contain exclusively metadata, the syntactic restrictions created by considering attributes to be metadata remain. Therefore, it's wise to use attributes for what they've been designed for, metadata. My advice is to use attributes only when there is a good reason to do so: when the information is clearly metadata and we have good reason to believe that it will not have to be extended.

In our example library, identifiers are good candidates for being attributes, but even available probably should have been specified as an element. Although at first glance available may be considered metadata (available does not directly affect the description of the content of a book), other users looking at the book element may want this information item to be capable of storing more details. They may even want to give it more structure to extend it to indicate if the book is available as a new or as a used item, for example.

There are times when these rules about metadata and attributes must be relaxed. We saw in the previous chapter, Chapter 11: Namespaces, that it wasn't a good idea to add foreign elements into a text-only element. Doing so would transform its content model from text to mixed content. It's always risky to extend a text-only element by adding elements, while additional attributes will usually pass unnoticed by existing applications. In this case, the lack of further extensibility may be compensated for by the short term gain in backward compatibility between the vocabularies before and after the extension.

XML users often confuse the usage of elements and attributes. A bad habit many have taken up during our few years of XML experience is the assumption that schemas should always enforce a fixed order among child elements. In other words, the relative order between sub-elements always matters.

Relative order is much less natural than we usually think, at least at the schema level. To draw a parallel with another technology, it's considered poor practice to pay attention to the physical order of columns and rows in the table of a relational database. Furthermore, UML, the dominant modeling methodology, does not attach any order to the attributes of classes and does not attach any order to relations between classes (unless specifically specified). UML attributes are often used to represent not only XML attributes but also elements.

The main reasons people expect that order is required derive from limitations in DTDs and, more recently, in W3C XML Schema. Still, there are strong reasons to believe that when there is no special reason, relative order between sub-elements is something that should be left to the choice of those creating document instances, and we shouldn't bother users and applications with enforcing an unnecessary constraint at the schema level.

In RELAX NG, defining content models where the relative order of child elements is not significant is almost as simple as defining content models where it is significant. It's just a matter of adding interleave elements. When the relative order is not significant, the definition is more extensible since these content models can easily be extended through pattern combinations using interleave.

Using content models where the relative order of child elements isn't significant makes it easier to add new elements and attributes if necessary. We demonstrated this in our example about the addition of the id attribute in the library element in the first section of this chapter.

Note that together with the "element or attribute" question the issue of order significance is among the most controversial for XML experts. Technical constraints may, in some cases, justify enforcing element order in documents. These constraints come into play most notably during stream processing of huge documents; the requiring information to appear in a specific order may allow us to skip processing long content which would otherwise need to be buffered if this information came after the content. Other arguments for requiring that the order of elements is important, which I find far from being obvious, include the assertion that there is "disorder" carried by documents where element order is not enforced, that it's much easier to read documents when you know where to find each element, and finally there is concern that if the order isn't enforced, human users will be disoriented, confused, and find themselves in an insoluble quandary when it comes to choosing an order.

While the interleave pattern works just fine most of the time, you'll need to keep in mind the restriction about the interleave pattern already mentioned in Chapter 6: More Patterns. There can be only one text pattern in each interleave pattern. This restriction affects mixed content models found mainly in document oriented applications and may sometimes require schemas to specify the order when mixing textual content and elements.

Generalizing content models in which the relative order of child elements isn't significant may lead you to difficulties when you need to work with other schema languages, notably DTD and W3C XML Schema. This can be a problem if you are using RELAX NG as your main schema language and want to maintain the possibility of converting your RELAX NG schemas to DTDs or W3C XML Schemas for the same vocabulary.

A way to avoid these potential issues surrounding the relative order of elements is adding elements which act as containers. These containers can make it easier to specify that elements include either a text node, several elements which are not repeated or repeated elements with the same name.

Among the elements of our library, the book element is the only one which would be problematic for other schema languages if we decided to switch its content model to interleave. The book-content pattern would become:

  <define name="book-content">
    <interleave>
      <attribute name="id"/>
      <attribute name="available"/>
      <element name="isbn">
        <ref name="isbn-content"/>
      </element>
      <element name="title">
        <ref name="title-content"/>
      </element>
      <zeroOrMore>
        <element name="author">
          <ref name="author-content"/>
        </element>
      </zeroOrMore>
      <zeroOrMore>
        <element name="character">
          <ref name="character-content"/>
        </element>
      </zeroOrMore>
    </interleave>
  </define>

or, in the compact syntax:

 book-content =
    attribute id { text }
  & attribute available { text }
  & element isbn { isbn-content }
  & element title { title-content }
  & element author { author-content }*
  & element character { character-content }*

This would allow instance documents where author and character elements are mixed up with the other elements such as that shown in Figure�3:

W3C XML Schema cannot support this. In order to define a schema which could more easily be translated into a W3C XML Schema, we can add containers to isolate the author and character elements from the elements which cannot be repeated. The content of the book-content pattern would thus become:

  <define name="book-content">
    <interleave>
      <attribute name="id"/>
      <attribute name="available"/>
      <element name="isbn">
        <ref name="isbn-content"/>
      </element>
      <element name="title">
        <ref name="title-content"/>
      </element>
      <element name="authors">
        <zeroOrMore>
          <element name="author">
            <ref name="author-content"/>
          </element>
        </zeroOrMore>
      </element>
      <element name="characters">
        <zeroOrMore>
          <element name="character">
            <ref name="character-content"/>
          </element>
        </zeroOrMore>
      </element>
    </interleave>
  </define>

or:

book-content =
  attribute id { text }
 & attribute available { text }
 & element isbn { isbn-content }
 & element title { title-content }
 & element authors { element author { author-content }* }
 & element characters { element character { character-content }* }

and it would validate elements such as those shown in Figure�4:

The relative order between the isbn, title, authors and characters elements is still not significant, but the author and character elements are now grouped together under containers and cannot interleave between the other elements. That's enough to make this schema much friendler to schema languages with less expressive power than RELAX NG.

Note that even if these containers are not necessary for RELAX NG, they are considered to be a good practice by many XML experts. The containers facilitate the access to author and character elements. The downside is that additional hierarchies are added and XPath expressions which identify the contained elements become more verbose: instead of writing "/library/book/character" to access to the character elements, we will have to write "/library/book/characters/character". This can get long.


You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.