Merging grammars

RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)

Merging grammars
Prev�	Chapter 10: Creating Building Blocks	�Next

Merging grammars

In the preceding sections we have seen how we could use an external grammar as a single pattern. This is useful in cases like those we've seen where we want to include a content model described by an external schema at a single point, not unlike when you mount a UNIX file system. The description contained in the external grammar is "mounted" at the point where you make your reference.

The main drawback to this approach is that you cannot individually reuse the definitions contained in the external schema. To do so, we need to introduce a new pattern, with a different meanining, which will let us control how two grammars are merged into a single one.

Merging without redefinition

In the simplest case, we will want to reuse patterns defined in common libraries of patterns without modifying them. Let's say we have defined a grammar with some common patterns, common.rng, which can be reused in many different schemas, such as:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
 
   <define name="element-name">
     <element name="name">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
   
   <define name="element-born">
     <element name="born">
       <data type="date"/>
     </element>
   </define>
   
   <define name="attribute-id">
     <attribute name="id">
       <data type="ID"/>
     </attribute>
   </define>
   
   <define name="content-person">
     <ref name="attribute-id"/>
     <ref name="element-name"/>
     <optional>
       <ref name="element-born"/>
     </optional>
   </define>
   
 </grammar>

Or common.rnc, in the compact syntax:

 element-name = element name { token }
 element-born = element born { xsd:date }
 attribute-id = attribute id { xsd:ID }
 content-person = attribute-id, element-name, element-born?

These schemas are obviously not meant to be used as standalone schemas: they have no start patterns and would be incorrect. However, they contain definitions which can be used to write the schema of our library. To use these definitions, we need to use include patterns and provide a supporting framework. In the XML syntax, this looks like:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
 
   <include href="common.rng"/>

   <start>
     <element name="library">
       <oneOrMore>
         <element name="book">
           <ref name="attribute-id"/>
           <attribute name="available">
             <data type="boolean"/>
           </attribute>
           <element name="isbn">
             <data type="token" datatypeLibrary=""/>
           </element>
           <element name="title">
             <attribute name="xml:lang">
               <data type="language"/>
             </attribute>
             <data type="token" datatypeLibrary=""/>
           </element>
           <oneOrMore>
             <element name="author">
               <ref name="content-person"/>
               <optional>
                 <ref name="element-died"/>
               </optional>
             </element>
           </oneOrMore>
           <zeroOrMore>
             <element name="character">
               <ref name="content-person"/>
               <ref name="element-qualification"/>
             </element>
           </zeroOrMore>
         </element>
       </oneOrMore>
     </element>
   </start>
      
   <define name="element-died">
     <element name="died">
       <data type="date"/>
     </element>
   </define>
   
   <define name="element-qualification">
     <element name="qualification">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
   
 </grammar>

The include pattern is translated to an include keyword in the compact syntax:

 include "common.rnc"
 start =
   element library {
     element book {
       attribute-id,
       attribute available { xsd:boolean },
       element isbn { token },
       element title {
         attribute xml:lang { xsd:language },
         token
       },
       element author {
         content-person,
         element-died?
       }+,
       element character {
         content-person,
         element-qualification
       }*
     }+
   }
 element-died = element died { xsd:date }
 element-qualification = element qualification { token }

Note that the name of the include pattern is slightly misleading. The include pattern here doesn't include the external grammar directly. (We have seen that this was the job of the externalRef pattern.) Instead, it includes the content of the external grammar, performing a merge of both grammars. This is exactly what we needed, to use this, and this allows us to make references to the named patterns defined in the "common.rng" grammar.

The result of this inclusion is thus equivalent to the following monolithic schema:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
 <!-- Content of the included grammar -->
   <define name="element-name">
     <element name="name">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
   <define name="element-born">
     <element name="born">
       <data type="date"/>
     </element>
   </define>
   <define name="attribute-id">
     <attribute name="id">
       <data type="ID"/>
     </attribute>
   </define>
   <define name="content-person">
     <ref name="attribute-id"/>
     <ref name="element-name"/>
     <optional>
       <ref name="element-born"/>
     </optional>
   </define>
 <!-- End of the included grammar -->
   <start>
     <element name="library">
       <oneOrMore>
         <element name="book">
           <ref name="attribute-id"/>
           <attribute name="available">
             <data type="boolean"/>
           </attribute>
           <element name="isbn">
             <data type="token" datatypeLibrary=""/>
           </element>
           <element name="title">
             <attribute name="xml:lang">
               <data type="language"/>
             </attribute>
             <data type="token" datatypeLibrary=""/>
           </element>
           <oneOrMore>
             <element name="author">
               <ref name="content-person"/>
               <optional>
                 <ref name="element-died"/>
               </optional>
             </element>
           </oneOrMore>
           <zeroOrMore>
             <element name="character">
               <ref name="content-person"/>
               <ref name="element-qualification"/>
             </element>
           </zeroOrMore>
         </element>
       </oneOrMore>
     </element>
   </start>
   <define name="element-died">
     <element name="died">
       <data type="date"/>
     </element>
   </define>
   <define name="element-qualification">
     <element name="qualification">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
 </grammar>

or, in the compact syntax:

 element-name = element name { token }
 element-born = element born { xsd:date }
 attribute-id = attribute id { xsd:ID }
 content-person = attribute-id, element-name, element-born?
      
 start =
   element library {
     element book {
       attribute-id,
       attribute available { xsd:boolean },
       element isbn { token },
       element title {
         attribute xml:lang { xsd:language },
         token
       },
       element author {
         content-person,
         element-died?
       }+,
       element character {
         content-person,
         element-qualification
       }*
     }+
   }
 element-died = element died { xsd:date }
 element-qualification = element qualification { token }

Merging and replacing definitions

In the previous example, we were lucky. The definitions of the common patterns which we included matched exactly what we needed. In the real world, this isn't always the case. It is quite handy to be able to replace definitions found in the grammar that we're including when they might conflict with other aspects of our schema design.

Let's say that we have already written this very flat version of our schema, called library.rng:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
      
   <start>
     <ref name="element-library"/>
   </start>
   
   <define name="element-library">
     <element name="library">
       <zeroOrMore>
         <ref name="element-book"/>
       </zeroOrMore>
     </element>
   </define>
   
   <define name="element-book">
     <element name="book">
       <ref name="attribute-id"/>
       <ref name="attribute-available"/>
       <ref name="element-isbn"/>
       <ref name="element-title"/>
       <oneOrMore>
         <ref name="element-author"/>
       </oneOrMore>
       <zeroOrMore>
         <ref name="element-character"/>
       </zeroOrMore>
     </element>
   </define>
   
   <define name="element-author">
     <element name="author">
       <ref name="content-person"/>
       <optional>
         <ref name="element-died"/>
       </optional>
     </element>
   </define>
   
   <define name="element-character">
     <element name="character">
       <ref name="content-person"/>
       <ref name="element-qualification"/>
     </element>
   </define>

   <define name="element-isbn">
     <element name="isbn">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>

   <define name="element-title">
     <element name="title">
       <ref name="attribute-xml-lang"/>
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>

   <define name="attribute-xml-lang">
     <attribute name="xml:lang">
       <data type="language"/>
     </attribute>
   </define>

   <define name="attribute-available">
     <attribute name="available">
       <data type="boolean"/>
     </attribute>
   </define>
      
   <define name="element-name">
     <element name="name">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
      
   <define name="element-born">
     <element name="born">
       <data type="date"/>
     </element>
   </define>
      
   <define name="element-died">
     <element name="died">
       <data type="date"/>
     </element>
   </define>
      
   <define name="attribute-id">
     <attribute name="id">
       <data type="ID"/>
     </attribute>
   </define>
      
   <define name="content-person">
     <ref name="attribute-id"/>
     <ref name="element-name"/>
     <optional>
       <ref name="element-born"/>
     </optional>
   </define>

   <define name="element-qualification">
     <element name="qualification">
       <data type="token" datatypeLibrary=""/>
     </element>
   </define>
   
 </grammar>

or, in the compact syntax, library.rnc:

 start = element-library
 element-library = element library { element-book* }
 element-book =
   element book {
     attribute-id,
     attribute-available,
     element-isbn,
     element-title,
     element-author+,
     element-character*
   }
 element-author = element author { content-person, element-died? }
 element-character =
   element character { content-person, element-qualification }
 element-isbn = element isbn { token }
 element-title = element title { attribute-xml-lang, token }
 attribute-xml-lang = attribute xml:lang { xsd:language }
 attribute-available = attribute available { xsd:boolean }
 element-name = element name { token }
 element-born = element born { xsd:date }
 element-died = element died { xsd:date }
 attribute-id = attribute id { xsd:ID }
 content-person = attribute-id, element-name, element-born?
 element-qualification = element qualification

This might be a good schema used in production to validate incoming documents from a variety of patterns, so we wouldn't want to modify it. However, we might have a new application that doesn't work at the level of a library but only at the level of a book. This application would need to validate instance documents with book root elements. Of course we wouldn't want to copy and paste the definition of our existing schema into another one since that would mean maintaining two different versions with similar content.

This is a case were we would want to redefine the start element of our schema. To do so, we would use an include pattern, embedding the definitions which must be substituted for the ones from the included grammar in the include pattern itself:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="library.rng">
     <start>
       <ref name="element-book"/>
     </start>
   </include>
 </grammar>

Or:

 include "library.rnc" {
   start = element-book
 }

Note how the new definitions are embedded directly in the include pattern: the content of the include pattern is where all of the redefinitions must be written. This short schema includes all the definitions from "library.rng" and redefines the start pattern. It validates instance documents with a book root element. Since we are performing an inclusion instead of a copy, we will inherit any modifications made to "library.rng".

We have been able to redefine the start pattern, but each named pattern can also be redefined using the same syntax. Let's say for instance that I am not happy with the definition of the element-name pattern and want to check that the name is shorter than 80 characters. If I don't want to (or can't) modify the original schema, I can include it and redefine this pattern:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
   datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
   <include href="library.rng">
     <define name="element-name">
       <element name="name">
         <data type="token">
           <param name="maxLength">80</param>
         </data>
       </element>
     </define>
   </include>
 </grammar>

Or:

 include "library.rnc" {
   element-name = element name { xsd:token{maxLength = "80"} }
 }

Here again, the grammar of "library.rnc" is merged with the grammar of the new schema (which happens to be empty) but before the merge, the definitions which are embedded in the include pattern are substituted to the original definitions.

The new definition can be as different from the original one as I want. While it might not always be good practice, I could for instance redefine attribute-available and replace the attribute by an element:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
   <include href="library.rng">
     <define name="attribute-available">
       <element name="available">
         <data type="boolean"/>
       </element>
     </define>
   </include>
 </grammar>

Or:

 include "library.rnc" {
   attribute-available = element available { xsd:boolean }
 }

That would be rather confusing (the named pattern is called attribute-available and it's now describing an element) but the schema is perfectly valid and describes instance documents where the available attribute is replaced by an available element. The same approach could also be used to remove this attribute:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="library.rng">
     <define name="attribute-available">
       <empty/>
     </define>
   </include>
 </grammar>

Or:

 include "library.rnc" {
   attribute-available = empty
 }

Note how this uses a new pattern named empty. This pattern will match only text nodes made of whitespace and it will have the same effect as if the named pattern had been removed from the schema.

include patterns have the effect of merging the content of their grammar, after replacement of the redefined patterns, with the content of the current grammar. This means that these redefinitions can make references to any definition from either the including or the included grammars. If we wanted, for instance, to add zero or more email addresses to the author element while retaining a flat structure, we could write:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
          datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
 
   <include href="library.rng">
   
     <define name="element-author">
       <element name="author">
         <ref name="content-person"/>
         <optional>
           <ref name="element-died"/>
         </optional>
         <zeroOrMore>
           <ref name="element-email"/>
         </zeroOrMore>
       </element>
     </define>
     
   </include>
   
   <define name="element-email">
     <element name="email">
       <data type="anyURI">
         <param name="pattern">mailto:.*</param>
       </data>
     </element>
   </define>
 </grammar>

Or:

 include "library.rnc" {
   element-author =
     element author { content-person, element-died?, element-email* }
 }
 element-email =
   element email {
     xsd:anyURI { pattern = "mailto:.*" }
   }

Here, in the redefinition of the element-author pattern, we are making three references to three named patterns: content-person and element-died are defined in "library.rng", i.e. the grammar which is included and the third one, element-email is defined in the top level grammar i.e. the including grammar.

Combining definitions

When we've replaced the definitions in our previous examples, the original definition was completely replaced by the new one. This can make the maintenance of these schemas more complicated than it should be. In the last example, if the included schema (library.rng) was updated and the definition of element-author changed to add a new element to include a telephone number, this addition would be lost if we did not add it explicitly in the including schema. As far as the element-author pattern is concerned, this redefinition is no better than a copy and paste. A mechanism more similar to inheritance would help with this.

If we want to keep the definition from the included grammar, we can combine a new definition with the existing one instead of replacing it. Unlike redefinition, combination of start and named patterns does not take place in the include pattern itself, but rather is done at the level of the including grammar. It isn't even necessary to include a grammar to combine definitions, but the main interest of combining definitions is to combine new definitions with existing ones from included grammars.

There are two options for combining definitions: choice and interleave.

Combining by choice

When definitions are combined by choice, the result is similar to using a choice pattern between the content of the definitions.

A use case for this would be to define a schema accepting either a library or a book element from the schema used in the previous section. In the XML syntax, combining by choice is done through a combine attribute:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="library.rng"/>
   <start combine="choice">
     <ref name="element-book"/>
   </start>
 </grammar>

In the compact syntax, combining by choice is done by using the |= operator (instead of =) in the definition:

 include "library.rnc"
 start |= element-book

Note that in both cases, the combination is done outside of the inclusion. Its effect is to add a choice between the content of the start pattern. The definition becomes equivalent to :

   <start>
     <choice>
       <ref name="element-library"/>
       <ref name="element-book"/>
     </choice>
   </start>

Or:

 start = element-library | element-book

The logic behind this combination is to allow the content model corresponding to the original pattern while also allowing different content to appear. This is different from the logic behind pattern redefinitions where the original pattern was replaced by a new one.

Named patterns can also be combined. If we wanted to accept either an available attribute or element, we could write:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
   datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
   
   <include href="library.rng"/>
   
   <define name="attribute-available" combine="choice">
     <element name="available">
       <data type="boolean"/>
     </element>
   </define>
   
 </grammar>

Or:

 include "library.rnc"
 attribute-available |= element available { xsd:boolean }

Another interesting and common case involves making this attribute optional. This can be achieved by combining this pattern by choice with an empty pattern:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
 
   <include href="library.rng"/>
   
   <define name="attribute-available" combine="choice">
     <empty/>
   </define>
   
 </grammar>

Or:

 include "library.rnc"
 attribute-available |= empty

Adding a choice between a defined component and nothingness may seem like a roundabout way to make the component optional, but it works with minimum need to modify included schemas.

Combining by interleave

We have seen how an "old" pattern could be replaced by a new one using pattern redefinition and also how we could give the choice between an "old" definition and a new one using a combination by choice. The last option is to combine by interleave. The logic here is to allow to add pieces to the original content model and to let these pieces been interleaved, i.e. added anywhere before, after and between the sub patterns of the original pattern.

Earlier, we added an email element to the content of the author element using a redefinition. We can also use a combination by interleave to add this email pattern to the content-person pattern:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0" 
   datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
   
   <include href="library.rng"/>
   
   <define name="content-person" combine="interleave">
     <zeroOrMore>
       <ref name="element-email"/>
     </zeroOrMore>
   </define>
   
   <define name="element-email">
     <element name="email">
       <data type="anyURI">
         <param name="pattern">mailto:.*</param>
       </data>
     </element>
   </define>
   
 </grammar>

Or, in the compact syntax:

 include "library.rnc"
 content-person &= element-email *
 element-email =
   element email {
     xsd:anyURI { pattern = "mailto:.*" }
   }

The effect of this combination by interleave is that the content-model pattern is now equivalent to an interleave pattern embedding both the original and the new definition, i.e.:

    <define name="content-person">
      <interleave>
        <group>
          <ref name="attribute-id"/>
          <ref name="element-name"/>
          <optional>
            <ref name="element-born"/>
          </optional>
        </group>
        <zeroOrMore>
          <ref name="element-email"/>
        </zeroOrMore>
      </interleave>
    </define>

Or:

  content-person =
    (attribute-id, element-name, element-born?) & element-email *

This definition allows any number of email elements before the name element, between the name element and the born element and after the born element.

The logic here is to allow extension by adding new content anywhere in the original definition. This is neat and safe if the applications which read the documents are coded to ignore what they don't know. In our example, if I have designed an application to read the original content model, this application will be just fine with the new content model if it ignores the email elements which have been added.

We have seen how a combination by choice can be used to turn a pattern into being optional. Combination by interleave cannot reverse the process, but it can make a pattern forbidden. If we don't want to end up with a schema which won't validate any instance document, we must be careful when working with a pattern to which reference is made optional, such as the element-died pattern:

 <?xml version="1.0" encoding="UTF-8"?>
 <grammar xmlns="http://relaxng.org/ns/structure/1.0">
   <include href="library.rng"/>
   <define name="element-died" combine="interleave">
     <notAllowed/>
   </define>
 </grammar>

Or:

 include "library.rnc"
 element-died &= notAllowed

Here, we are interleaving a new pattern, notAllowed with the content of the named pattern element-died. The effect of this operation is that this pattern will not match any content model any longer. This is OK since the reference to the element-died in the definition of the author element is optional. The effect is that a document can be valid per the resulting schema only if there is no died element.

What about combining start patterns by interleave? This may seem weird or even illegal since we have seen start patterns in a context where they are using to define the root element of XML documents. A well formed XML document can only have one root element, but schemas can permit a variety of different root elements in their models.

Another use case where combining by interleave is handy and very widely used is to add attributes to a named pattern. In this case, the fact that interleave is unordered doesn't make any difference since attributes are always unordered.

Why can't we combine definitions by group?

We have seen how to combine definitions by interleave and choice and since group is the third compositor, we might be tempted to combine definitions by group. Unfortunately, definitions of named patterns are declarations. Since the relative order of these declarations is not considered significant, combining definitions by group wouldn't give reliable results and has thus been forbidden. This issue doesn't arise with choice and interleave compositors because the relative order of their children elements is not significant for a schema.

You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.

Prev�	Up	�Next
Using External references�	Home	�A real world example: XHTML 2.0