Documents and Infosets

Documents and Infosets
Prev	Simple Foundations Are Beautiful	Next

RELAX NG is an XML-based technology. RELAX NG schemas are commonly stored in XML documents (called schema documents) and used to validate other XML documents (called instance documents). While RELAX NG works with and uses XML documents, RELAX NG processors operate at a slightly higher level of abstraction, called an infoset, rather than processing the actual text of the XML document, which is called lexical processing.

An infoset is a logical view of the XML document, rather than the document as stored in a text file. Most XML processors read (or generate) XML syntax but work internally on a representation that omits a lot of details. To take a brief example, from a lexical perspective, which looks at the actual contents of an XML document, <book id='b0836217462' available="true"/> is an empty tag containing two attributes named id and available. The value of id is delimited with single quotes, while the value of available is delimited with double quotes. Yet, from an infoset perspective, this isn't an empty tag with particular syntax; the kind of quotation marks don't matter. It's a book element with an attribute named id and a value of b0836217462, as well as an attribute named available with a value of true. Elements, attributes, and text are often referred to as nodes in this perspective, like nodes in an object tree.

There are a variety of different models for XML documents—specifications such as the Simple API for XML (SAX), the Document Object Model (DOM), and XPath all have slightly different takes on what an infoset is. As a first step toward coordinating these perspectives, the W3C created a Recommendation: the XML Information Set (Infoset), which is available at http://www.w3.org/TR/xml-infoset/. The XML Infoset defines an abstract model of XML documents that uses a hierarchical structure described in terms generic and neutral enough to be acceptable for use with a diverse range of specifications.

	Tip
	The XML Information Set describes elements as "element information items," attributes as "attribute information items," and so on. For convenience, this book uses the RELAX NG convention—inspired by XPath—that refers to element nodes, attribute nodes, and so on, rather than information items.

Schema languages work at the level of the XML Infoset, and their main goal is to define constraints on a subset of the XML Infoset. Because they work at the XML Infoset level, they can't be used to express constraints on things that don't belong to the XML Infoset. Thus such things as the order of the attributes, their quotation style, or the number of spaces between them can't be constrained by schemas. In addition, RELAX NG, like most schema languages, won't let you define constraints on XML comments, processing instructions, or entity references. Schema languages focus on a core set of features: elements, attributes, and textual content.

	Tip
	Some schema languages, notably the W3C XML Schema (WXS) and the Document Type Definitions (DTDs), also let you augment the infoset of a given instance document with additional information. Both WXS and DTDs let you specify default values for attributes. WXS also provides the ability to add additional type information (the Post-Schema Validation Infoset, or PSVI), while DTDs provide the opportunity to include entity definitions and ID information. While RELAX NG does use the infoset as a base, it doesn't perform these kinds of infoset augmentation.