Documents and Infosets

RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)

Documents and Infosets
Prev�	Chapter 2: Simple Foundations Are Beautiful	�Next

Documents and Infosets

RELAX NG is an XML-based technology. RELAX NG schemas are commonly stored in XML documents (called schema documents) and used to validate other XML documents (called instance documents). While RELAX NG works with and uses XML documents, RELAX NG processors operate at a slightly higher level of abstraction, called an infoset, rather than processing the actual text of the XML document, which is called lexical processing.

An infoset is a logical view of the XML document rather than the document as stored in a text file. Most XML processors read (or generate) XML syntax, but work internally on a representation that omits a lot of details. To take a brief example, from a lexical perspective, looking at the actual contents of the XML document, <book id='b0836217462' available="true"/> is an empty tag containing two attributes named id and available. The value of id is delimited with single quotes, while the value of available is delimited with double quotes. Yet, from an infoset perspective, this isn't an empty tag with particular syntax--the kind of quotation marks don't matter. It's a book element with an attribute named id and a value of b0836217462; then it's an attribute named available with a value of true. Elements, attributes, and text are often referred to as nodes in this perspective, like nodes in an object tree.

There are a variety of different models for XML documents - specifications like the Simple API for XML (SAX), the Document Object Model (DOM), and XPath all have slightly different takes on what an infoset is. As a first step toward coordinating these perspectives, the W3C created a Recommendation, the XML Information Set (Infoset), available at http://www.w3.org/TR/xml-infoset/. The XML Infoset defines an abstract model of XML documents that uses a hierarchical structure that is described in terms generic and neutral enough to be acceptable for use with a diverse range of specifications.

	Note
	The XML Information Set describes elements as "element information items", attributes as "attribute information items", and so on. For convenience, this book will use the XPath convention of referring to element nodes, attribute nodes, and so on, rather than information items.

Schema languages work at the level of the XML Infoset and their main goal is to define constraints on a subset of the XML Infoset. Because they work at the Infoset level, they can't be used to express constraints on things that do not belong to the XML Infoset. Thus things like the order of the attributes, their quotation style, or the number of spaces between them cannot be constrained by schemas. In addition, most schemas won't let you define constraints on XML comments, processing instructions, or entity references. Schema languages focus on a core set of features: elements, attributes, and textual content.

	Note
	Some schema languages, notably the W3C XML Schema (WXS) and the Document Type Definitions (DTDs), also let you 'augment' the infoset of a given instance document with additional information. Both WXS and DTDs let you specify default values for attributes. WXS also provides the ability to add additional type information (the Post-Schema Validation Infoset, or PSVI), while DTDs provide the opportunity to include entity definitions and ID information. While RELAX NG does use the Infoset as a base, it doesn't perform these kinds of infoset augmentation.

You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.

Prev�	Up	�Next
Chapter 2: Simple Foundations Are Beautiful�	Home	�Different types of schema languages