by Eric van der Vlist is published by O'Reilly & Associates (ISBN: 0596004214)


Merging Grammars

In the preceding sections, you have seen how an external grammar can be used as a single pattern. This is useful in cases in which you want to include a content model described by an external schema at a single point, not unlike when you mount a Unix filesystem. The description contained in the external grammar is mounted at the point where you make your reference.

The main drawback to this approach is that you can't individually reuse the definitions contained in the external schema. To do so, you need a new pattern, with a different meaning, which will let you control how two grammars are merged into a single one.

Merging Without Redefinition

In the simplest case, you will want to reuse patterns defined in common libraries of patterns without modifying them. Let's say we have defined a grammar with some common patterns, common.rng, which can be reused in many different schemas, such as:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
 
   <define name="element-name">
     <element name="name">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
   
   <define name="element-born">
     <element name="born">
       <data type="date"/>
     </element>
   </define>
   
   <define name="attribute-id">
     <attribute name="id">
       <data type="ID"/>
     </attribute>
   </define>
   
   <define name="content-person">
     <ref name="attribute-id"/>
     <ref name="element-name"/>
     <optional>
       <ref name="element-born"/>
     </optional>
   </define>
   
 </grammar>

or common.rnc, in the compact syntax:

 element-name = element name { token }
 element-born = element born { xsd:date }
 attribute-id = attribute id { xsd:ID }
 content-person = attribute-id, element-name, element-born?

These schemas are obviously not meant to be used as standalone schemas: they have no start patterns and would be invalid. However, they contain definitions that can be used to write the schema of our library. To employ these definitions, use include patterns and provide a supporting framework. In the XML syntax, this looks like:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
 
   <include href="common.rng"/>

   <start>
     <element name="library">
       <oneOrMore>
         <element name="book">
           <ref name="attribute-id"/>
           <attribute name="available">
             <data type="boolean"/>
           </attribute>
           <element name="isbn">
             <data type="token" datatypeLibrary=""/>
           </element>
           <element name="title">
             <attribute name="xml:lang">
               <data type="language"/>
             </attribute>
             <data type="token" datatypeLibrary=""/>
           </element>
           <oneOrMore>
             <element name="author">
               <ref name="content-person"/>
               <optional>
                 <ref name="element-died"/>
               </optional>
             </element>
           </oneOrMore>
           <zeroOrMore>
             <element name="character">
               <ref name="content-person"/>
               <ref name="element-qualification"/>
             </element>
           </zeroOrMore>
         </element>
       </oneOrMore>
     </element>
   </start>
      
   <define name="element-died">
     <element name="died">
       <data type="date"/>
     </element>
   </define>
   
   <define name="element-qualification">
     <element name="qualification">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
   
 </grammar>

The include pattern is translated to an include keyword in the compact syntax:

 include "common.rnc"
 start =
   element library {
     element book {
       attribute-id,
       attribute available { xsd:boolean },
       element isbn { token },
       element title {
         attribute xml:lang { xsd:language },
         token
       },
       element author {
         content-person,
         element-died?
       }+,
       element character {
         content-person,
         element-qualification
       }*
     }+
   }
 element-died = element died { xsd:date }
 element-qualification = element qualification { token }

Note that the name of the include pattern is slightly misleading. The include pattern here doesn't include the external grammar directly. (You have seen that this was the job of the externalRef pattern.) Instead, it includes the content of the external grammar, performing a merge of both grammars. This is exactly what you need; it allows you to make references to the named patterns defined in the common.rng grammar.

The result of this inclusion is thus equivalent to the following monolithic schema:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
 <!-- Content of the included grammar -->
   <define name="element-name">
     <element name="name">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
   <define name="element-born">
     <element name="born">
       <data type="date"/>
     </element>
   </define>
   <define name="attribute-id">
     <attribute name="id">
       <data type="ID"/>
     </attribute>
   </define>
   <define name="content-person">
     <ref name="attribute-id"/>
     <ref name="element-name"/>
     <optional>
       <ref name="element-born"/>
     </optional>
   </define>
 <!-- End of the included grammar -->
   <start>
     <element name="library">
       <oneOrMore>
         <element name="book">
           <ref name="attribute-id"/>
           <attribute name="available">
             <data type="boolean"/>
           </attribute>
           <element name="isbn">
             <data type="token" datatypeLibrary=""/>
           </element>
           <element name="title">
             <attribute name="xml:lang">
               <data type="language"/>
             </attribute>
             <data type="token" datatypeLibrary=""/>
           </element>
           <oneOrMore>
             <element name="author">
               <ref name="content-person"/>
               <optional>
                 <ref name="element-died"/>
               </optional>
             </element>
           </oneOrMore>
           <zeroOrMore>
             <element name="character">
               <ref name="content-person"/>
               <ref name="element-qualification"/>
             </element>
           </zeroOrMore>
         </element>
       </oneOrMore>
     </element>
   </start>
   <define name="element-died">
     <element name="died">
       <data type="date"/>
     </element>
   </define>
   <define name="element-qualification">
     <element name="qualification">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
 </grammar>

or, in the compact syntax:

 element-name = element name { token }
 element-born = element born { xsd:date }
 attribute-id = attribute id { xsd:ID }
 content-person = attribute-id, element-name, element-born?
      
 start =
   element library {
     element book {
       attribute-id,
       attribute available { xsd:boolean },
       element isbn { token },
       element title {
         attribute xml:lang { xsd:language },
         token
       },
       element author {
         content-person,
         element-died?
       }+,
       element character {
         content-person,
         element-qualification
       }*
     }+
   }
 element-died = element died { xsd:date }
 element-qualification = element qualification { token }

Merging and Replacing Definitions

In the previous example, we were lucky. The definitions of the common patterns included matched exactly what we needed. In the real world, this isn't always the case. It is quite handy to be able to replace definitions found in the grammar that we're including when they might conflict with other aspects of our schema design.

Let's say that we have already written this very flat version of our schema, called library.rng:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" 
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
      
   <start>
     <ref name="element-library"/>
   </start>
   
   <define name="element-library">
     <element name="library">
       <zeroOrMore>
         <ref name="element-book"/>
       </zeroOrMore>
     </element>
   </define>
   
   <define name="element-book">
     <element name="book">
       <ref name="attribute-id"/>
       <ref name="attribute-available"/>
       <ref name="element-isbn"/>
       <ref name="element-title"/>
       <oneOrMore>
         <ref name="element-author"/>
       </oneOrMore>
       <zeroOrMore>
         <ref name="element-character"/>
       </zeroOrMore>
     </element>
   </define>
   
   <define name="element-author">
     <element name="author">
       <ref name="content-person"/>
       <optional>
         <ref name="element-died"/>
       </optional>
     </element>
   </define>
   
   <define name="element-character">
     <element name="character">
       <ref name="content-person"/>
       <ref name="element-qualification"/>
     </element>
   </define>

   <define name="element-isbn">
     <element name="isbn">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>

   <define name="element-title">
     <element name="title">
       <ref name="attribute-xml-lang"/>
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>

   <define name="attribute-xml-lang">
     <attribute name="xml:lang">
       <data type="language"/>
     </attribute>
   </define>

   <define name="attribute-available">
     <attribute name="available">
       <data type="boolean"/>
     </attribute>
   </define>
      
   <define name="element-name">
     <element name="name">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
      
   <define name="element-born">
     <element name="born">
       <data type="date"/>
     </element>
   </define>
      
   <define name="element-died">
     <element name="died">
       <data type="date"/>
     </element>
   </define>
      
   <define name="attribute-id">
     <attribute name="id">
       <data type="ID"/>
     </attribute>
   </define>
      
   <define name="content-person">
     <ref name="attribute-id"/>
     <ref name="element-name"/>
     <optional>
       <ref name="element-born"/>
     </optional>
   </define>

   <define name="element-qualification">
     <element name="qualification">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
   
 </grammar>

or, in the compact syntax, library.rnc:

 start = element-library
 element-library = element library { element-book* }
 element-book =
   element book {
     attribute-id,
     attribute-available,
     element-isbn,
     element-title,
     element-author+,
     element-character*
   }
 element-author = element author { content-person, element-died? }
 element-character =
   element character { content-person, element-qualification }
 element-isbn = element isbn { token }
 element-title = element title { attribute-xml-lang, token }
 attribute-xml-lang = attribute xml:lang { xsd:language }
 attribute-available = attribute available { xsd:boolean }
 element-name = element name { token }
 element-born = element born { xsd:date }
 element-died = element died { xsd:date }
 attribute-id = attribute id { xsd:ID }
 content-person = attribute-id, element-name, element-born?
 element-qualification = element qualification {token}

This might be a good schema to use in production to validate incoming documents from a variety of patterns, so you wouldn't want to modify it. However, you might have a new application that doesn't work at the level of a library but only at the level of a book. This application needs to validate instance documents with book root elements. Of course you wouldn't want to copy and paste the definition of our existing schema into another one because that would mean maintaining two different versions with similar content.

This is a case in which you would want to redefine the start element of our schema. To do so, use an include pattern, embedding the definitions that must be substituted for the ones from the included grammar in the include pattern itself:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="library.rng">
     <start>
       <ref name="element-book"/>
     </start>
   </include>
 </grammar>

or:

 include "library.rnc" {
   start = element-book
 }

Note how the new definitions are embedded directly in the include pattern; the content of the include pattern is where all the redefinitions must be written. This short schema includes all the definitions from library.rng and redefines the start pattern. It validates instance documents with a book root element. Since we are performing an inclusion instead of a copy, we will inherit any modifications made to library.rng.

We have been able to redefine the start pattern, but each named pattern can also be redefined using the same syntax. Let's say for instance that I am not happy with the definition of the element-name pattern and want to check that the name is shorter than 80 characters. If I don't want to (or can't) modify the original schema, I can include it and redefine this pattern:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
          datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
   <include href="library.rng">
     <define name="element-name">
       <element name="name">
         <data type="token">
           <param name="maxLength">80</param>
         </data>
       </element>
     </define>
   </include>
 </grammar>

or:

 include "library.rnc" {
   element-name = element name { xsd:token{maxLength = "80"} }
 }

Here again, the grammar of library.rnc is merged with the grammar of the new schema (which happens to be empty) but before the merge, the definitions that are embedded in the include pattern are substituted to the original definitions.

The new definition can be as different from the original one as I want. While it might not always be good practice, I can, for instance, redefine attribute-available and replace the attribute by an element:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
   <include href="library.rng">
     <define name="attribute-available">
       <element name="available">
         <data type="boolean"/>
       </element>
     </define>
   </include>
 </grammar>

or:

 include "library.rnc" {
   attribute-available = element available { xsd:boolean }
 }

This seems rather confusing (the named pattern is called attribute-available, and it's now describing an element), but the schema is perfectly valid and describes instance documents in which the available attribute is replaced by an available element. The same approach can also remove this attribute:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="library.rng">
     <define name="attribute-available">
       <empty/>
     </define>
   </include>
 </grammar>

or:

 include "library.rnc" {
   attribute-available = empty
 }

Note how this uses a new pattern named empty. This pattern matches only text nodes made of whitespace, and it has the same effect as if the named pattern had been removed from the schema.

The include patterns have the effect of merging the content of their grammar, after replacement of the redefined patterns, with the content of the current grammar. This means that these redefinitions can make references to any definition from either the including or the included grammars. If you want, for instance, to add zero or more email addresses to the author element while retaining a flat structure, write:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
          datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
 
   <include href="library.rng">
   
     <define name="element-author">
       <element name="author">
         <ref name="content-person"/>
         <optional>
           <ref name="element-died"/>
         </optional>
         <zeroOrMore>
           <ref name="element-email"/>
         </zeroOrMore>
       </element>
     </define>
     
   </include>
   
   <define name="element-email">
     <element name="email">
       <data type="anyURI">
         <param name="pattern">mailto:.*</param>
       </data>
     </element>
   </define>
 </grammar>

or:

 include "library.rnc" {
   element-author =
     element author { content-person, element-died?, element-email* }
 }
 element-email =
   element email {
     xsd:anyURI { pattern = "mailto:.*" }
   }

Here the redefinition of the element-author pattern is making three references to three named patterns. content-person and element-died are defined in library.rng—i.e., the grammar that is included. The third, element-email, is defined in the top-level grammar—i.e., the including grammar.

Combining Definitions

When I've replaced the definitions in previous examples, the original definition was completely replaced by the new one. This can make the maintenance of these schemas more complicated than it should be. In the last example, if the included schema (library.rng) updated and the definition of element-author changed to add a new element to include a telephone number, this addition would be lost if I didn't add it explicitly in the including schema. As far as the element-author pattern is concerned, this redefinition is no better than a copy and paste. A mechanism more similar to inheritance would help with this.

To keep the definition from the included grammar, combine a new definition with the existing one instead of replacing it. Unlike redefinition, the combination of start and named patterns doesn't take place in the include pattern itself but rather is done at the level of the including grammar. It isn't even necessary to include a grammar to combine definitions, but the main interest of combining definitions is to combine new definitions with existing ones from included grammars.

There are two options for combining definitions: choice and interleave.

Combining by choice

When definitions are combined by choice, the result is similar to using a choice pattern between the content of the definitions. A use case for this combination would be to define a schema accepting either a library or a book element from the schema used in the previous section. In the XML syntax, combining by choice is done through a combine attribute:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
  <include href="library.rng"/>
  <start combine="choice">
    <ref name="element-book"/>
   </start>
</grammar>

In the compact syntax, combining by choice uses the |= operator (instead of =) in the definition:

include "library.rnc"
start |= element-book

Note that in both cases, the combination is done outside the inclusion. Its effect is to add a choice between the content of the start pattern. The definition becomes equivalent to:

<start>
  <choice>
    <ref name="element-library"/>
    <ref name="element-book"/>
  </choice>
</start>

or:

start = element-library | element-book

The logic behind this combination is to allow the content model corresponding to the original pattern while also allowing different content to appear. This is different from the logic behind pattern redefinitions, in which the original pattern is replaced by a new one.

Named patterns can also be combined. If you want to accept either an available attribute or element, you can write:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" 
          datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  
  <include href="library.rng"/>
   
  <define name="attribute-available" combine="choice">
    <element name="available">
      <data type="boolean"/>
    </element>
  </define>
   
</grammar>

or:

include "library.rnc"
attribute-available |= element available { xsd:boolean }

Another interesting and common case involves making this attribute optional, by combining this pattern by choice with an empty pattern:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
 
   <include href="library.rng"/>
   
   <define name="attribute-available" combine="choice">
     <empty/>
   </define>
   
 </grammar>

or:

include "library.rnc"
attribute-available |= empty

Adding a choice between a defined component and nothingness may seem like a roundabout way to make the component optional, but it works with a minimum need to modify included schemas.

Combining by interleave

You have seen how an "old" pattern can be replaced by a new one using pattern redefinition and also how to specify a choice between an old definition and a new one using a combination by choice. The last option is to combine by interleave. The logic here is to allow pieces to be added to the original content model and to let these pieces be interleaved—i.e., added anywhere before, after, and between the subpatterns of the original pattern.

Earlier, I added an email element to the content of the author element using a redefinition. You can also use a combination by interleave to add this email pattern to the content-person pattern:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" 
   datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  
  <include href="library.rng"/>
  
  <define name="content-person" combine="interleave">
    <zeroOrMore>
      <ref name="element-email"/>
    </zeroOrMore>
  </define>
  
  <define name="element-email">
    <element name="email">
      <data type="anyURI">
        <param name="pattern">mailto:.*</param>
      </data>
    </element>
  </define>

</grammar>

or, in the compact syntax:

include "library.rnc"
content-person &= element-email *
element-email =
  element email {
    xsd:anyURI { pattern = "mailto:.*" }
  }

The effect of this combination by interleave is that the content-model pattern is now equivalent to an interleave pattern embedding both the original and the new definition:

<define name="content-person">
  <interleave>
    <group>
      <ref name="attribute-id"/>
      <ref name="element-name"/>
      <optional>
        <ref name="element-born"/>
      </optional>
    </group>
    <zeroOrMore>
      <ref name="element-email"/>
    </zeroOrMore>
  </interleave>
</define>

or:

content-person =
  (attribute-id, element-name, element-born?) & element-email *

This definition allows any number of email elements before the name element, between the name element and the born element, and after the born element.

The logic here is to allow extension by adding new content anywhere in the original definition. This is neat and safe if the applications that read the documents are coded to ignore what they don't know. In our example, if I design an application to read the original content model, this application will be just fine with the new content model if it ignores the email elements that have been added.

You've seen how a combination by choice can make a pattern optional. Combination by interleave can't reverse the process, but it can make a pattern forbidden. If you don't want to end up with a schema that won't validate any instance document, you must be careful when working with a pattern to which reference is made optional, such as the element-died pattern:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
  <include href="library.rng"/>
  <define name="element-died" combine="interleave">
    <notAllowed/>
  </define>
</grammar>

or:

include "library.rnc"
element-died &= notAllowed

Here, we interleave a new pattern, notAllowed, with the content of the named pattern element-died. The effect of this operation is that this pattern will no longer match any content model. This is OK because the reference to the element-died in the definition of the author element is optional. The effect is that a document can be valid per the resulting schema only if there is no died element.

What about combining start patterns by interleave? This may seem weird or even illegal because you've seen start patterns in a context in which they define the root element of XML documents. A well-formed XML document can have only one root element, but schemas can permit a variety of different root elements in their models.

Another example in which combining by interleave is handy and very widely used is if you add attributes to a named pattern. In this case, the unordered interleave doesn't make any difference because attributes are always unordered.

Why Can't Definitions Be Defined by Group?

You have seen how to combine definitions by interleave and choice, and because group is the third compositor, you might be tempted to combine definitions by group. Unfortunately, definitions of named patterns are declarations. Since the relative order of these declarations isn't considered significant, combining definitions by group wouldn't give reliable results and has thus been forbidden. This issue doesn't arise with choice and interleave compositors, because the relative order of their children elements isn't significant for a schema.


This text is released under the Free Software Foundation GFDL.