Text and empty patterns, whitespace, and mixed content

RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)

Text and empty patterns, whitespace, and mixed content
Prev�	Chapter 6: More Complex Patterns	�Next

Text and empty patterns, whitespace, and mixed content

So far, we have only used text patterns within group patterns. It's important to remember, however, that this pattern doesn't simply mean "a text node" but rather "zero or more text nodes". This deserves some exploration.

The reason why text patterns accept zero text nodes is linked to the policy adopted by RELAX NG regarding whitespace. Whitespace processing rules are one of the fuzzier areas in XML. RELAX NG has attempted to find the "least surprising" policy which supports the most common usages. We will see more whitespace processing when we will cover datatypes, but for now, let's say that RELAX NG doesn't see any distinction between empty strings, no string at all, strings containing only whitespace before or after an element node, and to a lesser extent, a single text child element containing only whitespace.

For instance, in the following snippet:

 <foo at1="" at2=" ">
  <bar/>
  <bar></bar>
  <bar>
   <baz/>
   <baz/>
  </bar>
  <bar>
  </bar>
 </foo>

RELAX NG considers that the values of at1 and at2, the content of the first and second "bar" elements, the text between the third "bar" start tag and the first "baz" element, the text between the two "baz" elements, and even the text within the last "bar" element as not significant. RELAX NG's rules state that the content should match either "text" or "empty" patterns. Its two visible consequences for the patterns which we've seen so far are:

Since text patterns match any text node they must match strings which are either empty or containing only whitespaces and since there is no difference between empty strings and no string, text patterns match "zero strings" i.e. they are always optional.
Since empty patterns match "zero strings" and since there is no difference between no string and empty strings or strings containing only whitespaces, empty patterns match also strings either empty or containing only whitespaces.

In other words, the snippet shown above would match both content models where all the occurrences mentioned are described as text or empty patterns. And if we add the rule -already used a lot but not yet explained- that says that you don't need to explicitly express "empty" patterns between elements, these two schemas would both validate this instance document:

 <element xmlns="http://relaxng.org/ns/structure/1.0" name="foo">
  <attribute name="at1"><text/></attribute>
  <attribute name="at2"><text/></attribute>
  <oneOrMore>
   <element name="bar">
    <choice>
     <text/>
     <oneOrMore>
      <element name="baz"><text/></element>
     </oneOrMore>
    </choice>
   </element>
  </oneOrMore>
 </element>

 <element xmlns="http://relaxng.org/ns/structure/1.0" name="foo">
  <attribute name="at1"><empty/></attribute>
  <attribute name="at2"><empty/></attribute>
  <oneOrMore>
   <element name="bar">
    <choice>
     <empty/>
     <oneOrMore>
      <element name="baz"><empty/></element>
     </oneOrMore>
    </choice>
   </element>
  </oneOrMore>
 </element>

After having seen why text patterns had to be optional, we need to see why it's also useful for them to match multiple instances. When a text pattern is used with a group or choice pattern, it doesn't make any difference since text nodes are merged when they are contiguous or separated by infoset items not checked by RELAX NG such as comments or processing instructions (PIs). Within a group or a choice, there is thus no difference between a pattern which would match one or one or more text nodes. The only place where it can make a difference is thus within interleave compositors and that's the reason why this specificity has been introduced. Document oriented applications, including XHTML and DocBook, provide numerous examples of elements which accept text and embedded elements in any order (called mixed content) and in this case it would have no sense to limit the number of text nodes.

To introduce a mixed content model, let's say we want to extend the title element to include zero or more links using a elements with href attributes, such as:

   <title xml:lang="en">Being a
    <a href="http://dmoz.org/Recreation/Pets/Dogs/">Dog</a>
    Is a Full-Time
    <a href="http://dmoz.org/Business/Employment/Job_Search/">Job</a>
   </title>

The content of the new title element can be described as an interleave pattern which allows zero or more a elements and zero or more text nodes. The fact that the text pattern matches zero or more text nodes will allow us to avoid specifing its cardinality. We can just write:

    <element name="title">
     <interleave>
      <attribute name="xml:lang"/>
      <zeroOrMore>
       <element name="a">
        <attribute name="href"/>
        <text/>
       </element>
      </zeroOrMore>
      <text/>
     </interleave>
    </element>

or, using the compact syntax:

   element title {
    attribute xml:lang {text}&
    element a {attribute href {text}, text}*&
    text
  }

As this is quite verbose for a common task, RELAX NG has introduced a specific compositor named mixed which has the same meaning as "interleave including a text pattern." These schemas are strictly equivalent to:

    <element name="title">
     <mixed>
      <attribute name="xml:lang"/>
      <zeroOrMore>
       <element name="a">
        <attribute name="href"/>
        <text/>
       </element>
      </zeroOrMore>
     </mixed>
    </element>

The mixed compositor is marked using a mixed pattern in the compact syntax and would be written as:

 element title {
  mixed {
   attribute xml:lang {text}&
   element a {attribute href {text}, text} *
  }
 }

You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.

Prev�	Up	�Next
Order variation as a source of information�	Home	�Why is it called "interleave" instead of "unorderedGroup"?