Regular Expressions
The Regular Expression engine provided by ParserObjects has fewer features than an ordinary Regular Expression engine, such as the one provided by the System.Text.Regex
class or the popular PCRE library. However it should be useful for many common parsing needs.
An atom is a piece of input which can be matched. A quantifier modifies the atom to repeat a certain number of times.
Basic Atoms
Literal Characters
Any literal character such as a
or 5
will be matched literally. The exceptions to this rule are the special characters which need to be escaped with \
. The list of special characters includes:
\ ( ) $ | [ ] . ? + * { } |
Non-printing characters
Character atoms may be defined using normal C# character escape codes in strings, for cases where you need to match something which you cannot type with your keyboard:
var parser = Regex("[\x01-\x1F]+");
Wildcards
The character '.'
will match any single input character. It will fail at end of input.
Character Classes
The [ ]
define a character class. A character class matches any single character which fits in that class.
[abc]
literal characters'a'
,'b'
or'c'
[a-cA-C]
character ranges. Matches'a'
,'b'
,'c'
,'A'
,'B'
or'C'
[^abc]
inverted classes and[^a-c]
inverted ranges. Matches the opposite of the range.
Built-in Character Classes
Some common character classes are already pre-defined with special names:
\s
matches any whitespace (space, tab, newline, etc)\S
matches any non-whitespace (letters, digits, symbols, etc)\w
matches any word character (letters and digits)\W
matches any non-word character (symbols, whitespace, etc)\d
matches any digit. Same as[0-9]
\D
matches any non-digit. Same as[^0-9]
Quantifiers
Quantifiers go after an atom to tell the engine how many times to match that atom.
Basic Quantifiers
Quantifiers include:
'?'
match the atom zero or one times. In other words, the atom is optional.'*'
match the atom zero or more times.'+'
match the atom one or more times.
For example the pattern "ab?a"
will match "aa"
or "aba"
because the 'b'
is optional.
Range Quantifiers
The { }
braces define a range. The range may have one or two numbers, separated by a comma, to define how many copies are expected:
{3}
exactly three{3,}
at least three{,5}
at most five{3,5}
between three and five, inclusive
Groups
Capturing Groups
The ( )
define a capturing group. A capturing group matches whatever is inside the parenthesis and captures the value of it in the output.
You can access the captured groups in RegexMatch
result object from the RegexMatch()
parser:
var parser = RegexMatch("a(..)d");
var result = parser.Parse("abcd");
var captured = result.Value[1][0]; // "bc"
Notice that result[0][0]
is always the total overall match. Groups are numbered, starting with 1
, from left to right according to the position of the opening parenthesis. Groups may be nested:
var parser = RegexMatch("a(.(..).)f");
var result = parser.Parse("abcdef");
var capture0 = result.Value[0][0]; // "abcdef"
var capture1 = result.Value[1][0]; // "bcde"
var capture2 = result.Value[2][0]; // "cd"
If a group is quantified, every instance of a match will be captured:
var parser = RegexMatch("a(..)*z");
var result = parser.Parse("abcdefgz");
var capture1 = result.Value[1][0]; // "bc"
var capture1 = result.Value[1][1]; // "de"
var capture1 = result.Value[1][2]; // "fg"
Back-references
The syntax \1
or other digit from 1 - 9, defines a back-reference. A back-reference matches a value previously captured by the numbered group.
var parser = RegexMatch(@"a(..)d\1");
var result = parser.Parse("abcdbc");
var captured = result.Value[1][0]; // "bc"
Non-Capturing Cloisters
A non-capturing cloister, or non-capturing group, is defined with (?: )
. These operate the same as capturing groups except they do not capture the results and cannot be used with back-references. These are useful for when you want to apply a quantifier to an entire group but do not need to capture the value.
var parser = RegexMatch("a(?:..)?d");
var result = parser.Parse("abcd");
var numberOfGroups = result.Count; // Just 1, for the overall match
Note: It is a modest performance improvement to use a non-capturing group if you do not need to capture the value.
Zero-Width Lookahead
The regex engine can lookahead and match additional characters, but not include those as part of the match. The Positive lookahead (?= )
matches true if the pattern succeeds but does not include that value in the overall match. The Negative lookahead (?! )
matches true if the pattern does not succeed and will not include any values in the match.
var parser = Regex("a(?=b)");
var result = parser.Parse("ab");
var captured = result.Value; // "a"
var parser = Regex("a(?!b)");
var result = parser.Parse("ac");
var captured = result.Value; // "a"
Alternations
The '|'
character can be used to make alternations, which allow you to check multiple patterns at the current location and return a successful match as soon as the first one succeeds.
var parser = Regex("a|b|c");
var result = parser.Parse("b");
var captured = result.Value; // "b"
Anchors
End anchor
The '$'
character is the end anchor. It only matches true at the end of the input sequence and consumes no input. This is useful to tell the engine that you want the match to include all input items, or fail.
Parsers
There are two parsers for regex. Regex()
returns a string result, though it does include the RegexMatch
object in the result data:
var parser = Regex("abc");
var result = parser.Parse("abc");
var matchObject = result.Data<RegexMatch>();
var overallMatchString = result.Value;
The RegexMatch()
parser returns the RegexMatch
object as the result.
var parser = RegexMatch("abc");
var result = parser.Parse("abc");
var matchObject = result.Value;
var overallMatchString = result.Value[0][0];
Which of the two parsers to use is dependent on what you want to do with the result you obtain.
Parser Recursing
The Regex parser can call out to other existing parsers by name in order to make complicated matches. The (?{name})
syntax allows you to specify the name of the parser to invoke. You can pass in a list of parsers that you want to be available to the regex engine, and the engine can invoke them by name (you MUST name your parser, or the regex engine won’t be able to find it).
var a = MatchChar('a').Named("a");
var b = MatchChar('b').Named("b");
var regex = Regex("(?{a})(?{b})c", a, b);
// Success!
var result = regex.Parse("abc");
Behaviors
The Regex parser always attempts to match the pattern starting at the current position of the input sequence. It will not skip over input characters looking for a match which occurs later. Because of this the Regex parser does not support the beginning of input anchor (^
) in patterns. It also does not support the \b
word-boundary because it cannot look backwards in the input sequence to determine if the previous character was a word or non-word character.
The Regex parser is greedy. It will attempt to match as much input as it can and return the first match which satisfies the entire pattern.
The regex parser does backtrack. This means quantified atoms will attempt to match as much as possible and, if subsequent matches fail the quantified atoms will “return” some matched inputs and try again. This can lead to pathological behavior where multiple quantified atoms are forced to try large numbers of permutations before a successful match can be found. This page will not attempt to explain the variety of cases where a regex may be driven to pathological behavior.