RELAX NG by Eric van der Vlist will be published by O'Reilly & Associates (ISBN: 0596004214)
You are welcome to use our annotation system to give your feedback.
Atoms that exactly match a character are the simplest atoms that can be used in a pattern facet. The other atoms that can be used in pattern facets are special characters, a wildcard that matches any character, or predefined and user-defined character classes.
Table 6-1 shows the list of atoms that match a single character, exactly like the characters we have already seen, but also correspond to characters that must be escaped or (for the first three characters on the list) that are just provided for convenience.
Table�1.�Special characters
\n | New line (can also be written as "
-- since we are in a XML document). |
\r | Carriage return (can also be written as "
 -- ). |
\t | Tabulation (can also be written as "	 -- ) |
\\ | Character "\" |
\| | Character "|" |
\. | Character "." |
\- | Character "-" |
\^ | Character "^" |
\? | Character "?" |
\* | Character "*" |
\+ | Character "+" |
\{ | Character "{" |
\} | Character "}" |
\( | Character "(" |
\) | Character ")" |
\[ | Character "[" |
\] | Character "]" |
The character "." has a special meaning: it's a wildcard atom that matches any XML valid characters except newlines and carriage returns. As with any atom, "." may be followed by an optional quantifier and ".*" is a common construct to match zero or more occurrences of any character. To illustrate the usage of ".*" (and the fact that pattern facet is a Swiss army knife), a pattern facet may be used to define the integers that are multiples of 10:
<define name="multipleOfTen"> <data type="integer"> <param name="pattern">.*0</param> </data> </define> |
Or:
multipleOfTen = xsd:integer {pattern = ".*0"} |
W3C XML Schema has adopted the "classical" Perl and Unicode character classes (but not the POSIX-style character classes also available in Perl), and user-defined classes are also available.
W3C XML Schema supports the classical Perl character classes plus a couple of additions to match XML-specific productions. Each of these classes are designated by a single letter; the classes designated by the upper- and lowercase versions of the same letter are complementary:
Spaces. Matches XML whitespace (space #x20, tab #x09, line feed #x0A, and carriage return #x0D).
Characters that are not spaces.
Digits ("0" to "9" but also digits in other alphabets).
Characters that are not digits.
Extended "word" characters (any Unicode character not defined as "punctuation", "separator," or "other"). This conforms to the Perl definition, assuming UTF-8 support has been switched on.
Nonword characters.
XML 1.0 initial name characters (i.e., all the "letters" and also "-"). This is a W3C XML Schema extension of Perl regular expressions.
Characters that may not be used as a XML initial name character.
XML 1.0 name characters (initial name characters, digits, ".", ":", "-", and the characters defined by Unicode as "combining" or "extender"). This is a W3C XML Schema extension of Perl regular expressions.
Characters that may not be used in a XML 1.0 name.
These character classes may be used with an optional quantifier like any other atom. The last pattern facet that we saw:
multipleOfTen = xsd:integer {pattern = ".*0"} |
constrains the lexical space to be a string of characters ending with a zero. Knowing that the base type is an xsd:integer, is good enough for our purposes, but if the base type had been an xsd:decimal (or xsd:string), we could be more restrictive and write:
multipleOfTen = xsd:integer {pattern = "-?\d*0"} |
This checks that the characters before the trailing zero are digits with an optional leading - (we will see later on in Section 6.5.2.2 how to specify an optional leading - or +).
Patterns support character classes matching both Unicode categories and blocks. Categories and blocks are two complementary classification systems: categories classify the characters by their usage independently of their localization (letters, uppercase, digit, punctuation, etc.), while blocks classify characters by their localization independently of their usage (Latin, Arabic, Hebrew, Tibetan, and even Gothic or musical symbols).
The syntax \p{Name} is similar for blocks and categories; the prefix Is is added to the name of categories to make the distinction. The syntax \P{Name} is also available to select the characters that do not match a block or category. A list of Unicode blocks and categories is given in the specification. Table 6-2 shows the Unicode character classes and Table 6-3 shows the Unicode character blocks.
Table�2.�Unicode character classes
Unicode Character Class | Includes |
---|---|
C | Other characters (non-letters, non symbols, non-numbers, non-separators) |
Cc | Control characters |
Cf | Format characters |
Cn | Unassigned code points |
Co | Private use characters |
L | Letters |
Ll | Lowercase letters |
Lm | Modifier letters |
Lo | Other letters |
Lt | Titlecase letters |
Lu | Uppercase letters |
M | All Marks |
Mc | Spacing combining marks |
Me | Enclosing marks |
Mn | Non-spacing marks |
N | Numbers |
Nd | Decimal digits |
Nl | Number letters |
No | Other numbers |
P | Punctuation |
Pc | Connector punctuation |
Pd | Dashes |
Pe | Closing punctuation |
Pf | Final quotes (may behave like Ps or Pe) |
Pi | Initial quotes (may behave like Ps or Pe) |
Po | Other forms of punctuation |
Ps | Opening punctuation |
S | Symbols |
Sc | Currency symbols |
Sk | Modifier symbols |
Sm | Mathematical symbols |
So | Other symbols |
Z | Separators |
Zl | Line breaks |
Zp | Paragraph breaks |
Zs | Spaces |
Table�3.�Unicode character blocks
AlphabeticPresentationForms | Arabic | ArabicPresentationForms-A |
ArabicPresentationForms-B | Armenian | Arrows |
BasicLatin | Bengali | BlockElements |
Bopomofo | BopomofoExtended | BoxDrawing |
BraillePatterns | ByzantineMusicalSymbols | Cherokee |
CJKCompatibility | CJKCompatibilityForms | CJKCompatibilityIdeographs |
CJKCompatibilityIdeographsSupplement | CJKRadicalsSupplement | CJKSymbolsandPunctuation |
CJKUnifiedIdeographs | CJKUnifiedIdeographsExtensionA | CJKUnifiedIdeographsExtensionB |
CombiningDiacriticalMarks | CombiningHalfMarks | CombiningMarksforSymbols |
ControlPictures | CurrencySymbols | Cyrillic |
Deseret | Devanagari | Dingbats |
EnclosedAlphanumerics | EnclosedCJKLettersandMonths | Ethiopic |
GeneralPunctuation | GeometricShapes | Georgian |
Gothic | Greek | GreekExtended |
Gujarati | Gurmukhi | HalfwidthandFullwidthForms |
HangulCompatibilityJamo | HangulJamo | HangulSyllables |
Hebrew | HighPrivateUseSurrogates | HighSurrogates |
Hiragana | IdeographicDescriptionCharacters | IPAExtensions |
Kanbun | KangxiRadicals | Kannada |
Katakana | Khmer | Lao |
Latin-1Supplement | LatinExtended-A | LatinExtendedAdditional |
LatinExtended-B | LetterlikeSymbols | LowSurrogates |
Malayalam | MathematicalAlphanumericSymbols | MathematicalOperators |
MiscellaneousSymbols | MiscellaneousTechnical | Mongolian |
MusicalSymbols | Myanmar | NumberForms |
Ogham | OldItalic | OpticalCharacterRecognition |
Oriya | PrivateUse | PrivateUse |
PrivateUse | Runic | Sinhala |
SmallFormVariants | SpacingModifierLetters | Specials |
Specials | SuperscriptsandSubscripts | Syriac |
Tags | Tamil | Telugu |
Thaana | Thai | Tibetan |
UnifiedCanadianAboriginalSyllabics | YiRadicals | YiSyllables |
We will see in the next section that W3C XML Schema has introduced an extension to Regular Expressions to specify intersections, This extension can be used to define the intersection between a block and a category in a single pattern facet.
Note | |
---|---|
Although Unicode blocks seem to be a great way to restrict text to a set of characters which you know that you'll be able to print, display, read or store in a database, they have not been designed for this purpose and one must be careful when using them for this purpose. John Cowan, who has taught courses on Unicode and enjoys obscure alphabets, wrote about this topic:
The five Latin blocks mentioned by John are BasicLatin, Latin-1Supplement, LatinExtended-A, LatinExtendedAdditional and LatinExtended-B. |
These classes are lists of characters between square brackets that accept - signs to define ranges and a leading ^ to negate the whole list--for instance:
[azertyuiop] |
to define the list of letters on the first row of a French keyboard,
[a-z] |
to specify all the characters between "a" and "z",
[^a-z] |
for all the characters that are not between "a" and "z," but also
[-^\\] |
to define the characters "-," "^," and "\," or
[-+] |
to specify a decimal sign.
These examples demonstrate that the contents of these square brackets follows a specific syntax and semantic. Like the regular expression's main syntax, we have a list of atoms, but instead of matching each atom against a character of the instance string, we define a logical space. Brackets operate in a space between the atoms and more formal character classes.
We see also two special characters that have a different meaning depending on their location. The character -, which is a range delimiter when it is between a and z, is a normal character when it is just after the opening bracket or just before the closing bracket ([+-] and [-+] are, therefore, both legal). On the contrary, ^, which is a negator when it appears at the beginning of a class, loses this special meaning to become a normal character later in the class definition.
Note | |
---|---|
Even though this is specified as valid by the W3C XML Schema recommendation, it is not supported by all the regular expression engines used by RELAX NG processors. The current version of the Jing parser (as I write these lines) supports neither [+-] nor [-+] and it is wiser to escape the character - and write either [+\-] or [\-+]. Another frequent source of confusion is about the support of the escape format #xXX (such as in #x2D. Because this format is used in the W3C XML Schema recommendation to describe characters by their Unicode value, some people have thought that it could be used in regular expressions, but this is not meant to be used that way. If you want to define characters by their Unicode values, you should use numeric entities instead (such as - if you are using the XML syntax or the syntax for escaping characters in the compact syntax \x{2D}). Note that in both cases, the reference will be replaced by the corresponding character at parse time and that the regular expression engine will see the actual character instead of the escape sequence. |
Also, some characters may or must be escaped: "\\" is used to match the character "\". In fact, in a class definition, all the escape sequences that we have seen as atoms can be used. Even though some of the special characters lose their special meaning inside square brackets, they can always be escaped. So, the following:
[\-^\\] |
can also be written as:
[\-\^\\] |
or as:
[\^\\\-] |
since the location of the characters doesn't matter any longer when they are escaped.
Within square brackets, the character "\" also keeps its meaning of a reference to a Perl or Unicode class. The following:
[\d\p{Lu}] |
is a set of decimal digits (Perl class \d) and uppercase letters (Unicode category "Lu").
Mathematicians have found that three basic operations are needed to manipulate sets and that these operations can be chosen from a larger set of operations. In our square brackets, we've already seen two of these operations: union (the square bracket is an implicit union of its atoms) and complement (a leading ^ realizes the complement of the set defined in the square bracket). W3C XML Schema extends the syntax of Perl regular expressions to introduce a third operation: the difference between sets. The syntax follows:
[set1-[set2]] |
Its meaning is that all the characters in set1 that do not belong to set2, where set1 and set2 can use all the syntactic tricks that we have seen up to now.
This operator can be used to perform intersections of character classes (the intersection between two sets A and B is the difference between A and the complement of B), and we can now define a class for the BasicLatin Letters as:
[\p{IsBasicLatin}-[^\p{L}]] |
Or, using the \P construct, which is also a complement, we can define the class as:
[\p{IsBasicLatin}-[\P{L}]] |
The corresponding definition would be:
<define name="BasicLatinLetters"> <data type="token"> <param name="pattern">[\p{IsBasicLatin}-[\P{L}]]*</param> </data> </define> |
Or:
BasicLatinLetters = xsd:token {pattern = "[\p{IsBasicLatin}-[\P{L}]]*"} |
We already used an "or" in our first example pattern facet where we wrote "1|5|15" to say that we wanted to allow either "1", "5", or "15".
These "or"s are especially interesting when used in conjunction with groups. Groups are complete regular expressions, which are, themselves, considered atoms and can be used with an optional quantifier to form more complete (and complex) regular expressions. Groups are enclosed by brackets ("(" and ")"). To define a comma-separated list of "1", "5", or "15", ignoring whitespace between values and commas, the following pattern facet could be used:
<define name="myListOfBytes"> <data type="token"> <param name="pattern">(1|5|15)( *, *(1|5|15))*</param> </data> </define> |
Or:
myListOfBytes = xsd:token {pattern = "(1|5|15)( *, *(1|5|15))*"} |
Note how we have relied on the whitespace processing of the base datatype (xsd:token collapses the whitespaces). We have not needed to worry about leading and trailing whitespaces that are trimmed. we have only tested single occurrences of spaces with the * atom before and after the comma.
You are welcome to use our annotation system to give your feedback.
[Annotations for this page]
All text is copyright Eric van der Vlist, Dyomedea. During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation GFDL.