Text and Empty Patterns, Whitespace, and Mixed Content

Text and Empty Patterns, Whitespace, and Mixed Content
Prev	More Complex Patterns	Next

So far, we have used text patterns only within group patterns. It's important to remember, however, that this pattern doesn't mean simply a text node but rather zero or more text nodes. This statement deserves some exploration.

The reason why text patterns accept zero text nodes is linked to the policy adopted by RELAX NG regarding whitespace. Whitespace processing rules are one of the fuzzier areas in XML. RELAX NG has attempted to find the "least surprising" policy that supports the most common usages. You'll see more whitespace processing when we study datatypes, but for now, let's say that RELAX NG doesn't see any distinction between empty strings; no string at all; strings containing only whitespace before or after an element node; and to a lesser extent, a single text child element containing only whitespace.

For instance, in the following snippet:

<foo at1="" at2=" ">
 <bar/>
 <bar></bar>
 <bar>
  <baz/>
  <baz/>
 </bar>
 <bar>
 </bar>
</foo>

RELAX NG treats as insignificant the values of at1 and at2, the content of the first and second bar elements, the text between the third bar start tag and the first baz element, the text between the two baz elements, and even the text within the last bar element. RELAX NG's rules state that the content should match either text or empty patterns. Here are two visible consequences for the patterns we've seen so far:

Because text patterns match any text node, they must match strings that are either empty or that contain only whitespace. Since there is no difference between empty strings and no string, text patterns match zero strings; i.e., they are always optional.
Because empty patterns match zero strings and because there is no difference between no string and empty strings or strings containing only whitespace, empty patterns also match strings either empty or containing only whitespace.

In other words, the snippet shown here matches both content models in which all the occurrences mentioned are described as text or empty patterns. If you add the rule—already used a lot but not yet explained—that says you don't need to explicitly express empty patterns between elements, the two schemas will both validate this instance document:

<element xmlns="http://relaxng.org/ns/structure/1.0" name="foo">
 <attribute name="at1"><text/></attribute>
 <attribute name="at2"><text/></attribute>
 <oneOrMore>
  <element name="bar">
   <choice>
    <text/>
    <oneOrMore>
     <element name="baz"><text/></element>
    </oneOrMore>
   </choice>
  </element>
 </oneOrMore>
</element>

or:

<element xmlns="http://relaxng.org/ns/structure/1.0" name="foo">
 <attribute name="at1"><empty/></attribute>
 <attribute name="at2"><empty/></attribute>
 <oneOrMore>
  <element name="bar">
   <choice>
    <empty/>
    <oneOrMore>
     <element name="baz"><empty/></element>
    </oneOrMore>
   </choice>
  </element>
 </oneOrMore>
</element>

After having seen why text patterns have to be optional, you need to see why it's also useful for them to match multiple instances. When a text pattern is used with a group or choice pattern, it doesn't make any difference because text nodes are merged when they are contiguous or separated by infoset items not checked by RELAX NG, such as comments or processing instructions (PIs). Within a group or a choice, there is no difference between a pattern that matches one or one or more text nodes. The only place it can make a difference is thus within interleave compositors, and that's the reason why this specificity has been introduced. Document-oriented applications, including XHTML, TEI, and DocBook, provide numerous examples of elements that accept text and embedded elements in any order (called mixed content), and in this case, it makes no sense to limit the number of text nodes.

To introduce a mixed content model, let's extend the title element to include zero or more links using some a elements with href attributes:

<title xml:lang="en">Being a
<a href="http://dmoz.org/Recreation/Pets/Dogs/">Dog</a>
    Is a Full-Time
 <a href="http://dmoz.org/Business/Employment/Job_Search/">Job</a>
</title>

The content of the new title element can be described as an interleave pattern that allows zero or more a elements and zero or more text nodes. The text pattern matches zero or more text nodes, which will allow us to avoid specifying its cardinality. You can just write:

 <element name="title">
  <interleave>
   <attribute name="xml:lang"/>
   <zeroOrMore>
    <element name="a">
     <attribute name="href"/>
     <text/>
    </element>
   </zeroOrMore>
   <text/>
  </interleave>
 </element>

or, using the compact syntax:

   element title {
    attribute xml:lang {text}&
    element a {attribute href {text}, text}*&
    text
  }

Because this definition is quite verbose for a common task, RELAX NG has introduced a specific mixed compositor, which has the same meaning as "interleave including a text pattern." These schemas are strictly equivalent to:

 <element name="title">
  <mixed>
   <attribute name="xml:lang"/>
   <zeroOrMore>
    <element name="a">
     <attribute name="href"/>
     <text/>
    </element>
   </zeroOrMore>
  </mixed>
 </element>

The mixed compositor is marked using a mixed pattern in the compact syntax and can be written as:

 element title {
  mixed {
   attribute xml:lang {text}&
   element a {attribute href {text}, text} *
  }
 }

Prev	Up	Next
Order Variation as a Source of Information	Home	Why Is It Called interleave?