Programming Parsers
ParserObjects contains parsers for some common constructs from modern programming languages. Notice that several languages “borrow” or “inherit” similar rules from related languages. In these cases, parsers will not typically be reproduced for each language. A C++-style identifier is identical to a C-style identifier, for example, so the methods will not be duplicated. These parsers are provided for ease and convenience only, and do not guarantee completeness or correctness across all possible forms or versions.
C Parsers
The C-Style parsers are used to parse some common constructs from C and programming languages derived from it.
using static ParserObjects.Parsers.C;
Comments
A C-style comment starts with /*
and ends with */
and can go across multiple lines. This is also known as a “multi-line comment” in C, C++, C# and other similar languages.
var parser = Comment();
Hexadecimal Numbers
A C-style hexadecimal string has an optional negative sign -
, starts with the prefix 0x
and is followed by 1 to 8 hexadecimal digits. The HexadecimalString
parser returns the matched string, while the HexadecimalInteger
parser parses the string and returns an int
.
var parser = HexadecimalString();
var parser = HexadecimalInteger();
Values can be in the range -0x80000000
to 0x7FFFFFFF
inclusive. Without the negative sign, the range is 0x80000000
to 0x7FFFFFFF
. The hex values are parsed as an unsigned integer, then cast to an int
, then the negative sign is applied, if any.
Integer Literals
A C-style integer literal may take several different forms: a hexadecimal literal, an octal literal or a decimal literal. These values may be in the range [int.MinValue, int.MaxValue]
. If there are additional digits beyond these limits, the parser will return successfully and the remaining digits will be left in the input sequence to be consumed by the next subsequent parse operation.
// Between int.MinValue and int.MaxValue, inclusive
var parser = IntegerString();
var parser = Integer();
// Between long.MinValue and long.MaxValue, inclusive
var parser = LongIntegerString();
var parser = LongInteger();
The Integer
and LongInteger
parsers recognize the following formats:
- Hexadecimal literals, like the
HexadecimalInteger()
parser described above. Values must start with0x
, contain up to 8 or 16 digits, and may be in the range-0x80000000
to0x7FFFFFFF
or-0x8000000000000000
to0x7FFFFFFFFFFFFFFF
. - Decimal literals. Values may be in the range
-2147483648
to2147483647
or-9223372036854775808
to9223372036854775807
. - Octal literals starting with
0
Floating-Point Number Literals
A C-style floating point number has a whole and fractional part separated by a .
. The DoubleString
parser returns the literal string, while the Double
parser returns the parsed double
.
var parser = DoubleString();
var parser = Double();
Identifiers
A C-style identifier may start with an underscore (_
) or a letter, and may be followed by zero or more underscores, letters or digits.
var parser = Identifier();
String
A C-style string uses double-quotes and backslash-escapes with a few predefined escape sequences, hex codes, octal codes, and unicode code points. The String
parser parses the literal string and returns the whole thing as-written, including quotes and escapes. The StrippedString
parser removes the quotes and replaces the escape sequences with the characters they represent.
var parser = String();
var parser = StrippedString();
C++ Parsers
using static ParserObjects.Parsers.Cpp;
Comments
A C++-style comment, also known as a “single line comment” starts with the prefix “//
” and continues to the end of the current line.
var parser = Comment();
JavaScript Parsers
using static ParserObjects.Parsers.JS;
Numbers
A JavaScript number has a complicated set of rules and may be an integer, a floating point value or use scientific notation. The NumberString
parser returns the literal parsed string while the Number
parser returns the parsed double
value.
var parser = NumberString();
var parser = Number();
Strings
JavaScript-style strings may be single- or double-quoted, they use backslash-escapes including hex escapes and unicode code points. The String
parser returns the whole literal string, including quotes and escapes. The StrippedString
parser returns the value of the string, without the quotes and with the backslash escapes converted into their actual byte forms.
var parser = String();
var parser = StrippedString();
SQL Parsers
using static ParserObjects.Parsers.Sql;
Comments
An SQL comment starts with the prefix “--
” and continues to the end of the line.
var parser = Comment();
Identifiers
There are many different syntaxes for SQL identifiers depending on vendor and version. However the ParserObjects SQL identifier parser attempts to parse several common variations.
var parser = Identifier();
This will parse:
- Undelimited identifiers which may start with a letter,
'_'
,'@'
or'#'
and be followed by any number of these characters, digits, or'$'
. - T-SQL style
[]
delimited identifiers which may contain most non-bracket characters including whitespace, - Oracle style
''
delimited identifiers which may contain most non-quote characters including whitespace, and - Oracle style
""
delimited identifiers which may contain most non-quite characters including whitespace, and will use""
for embedded quotes
This parser may not be standards-compliant with any particular database vendor, but it should serve as a good approximation for many common use-cases.
Guid Parsers
using static ParserObjects.Parsers.Guids;
Guids come in four formats, based on the C# Guid.ToString(format)
output:
Guid N
‘N’ Guids are 32 hexadecimal digits: 12345678ABCD1234ABCD1234567890AB
var p = GuidN();
Guid D
‘D’ Guids are 32 hexadecimal digits separated by dashes: 12345678-ABCD-1234-ABCD-1234567890AB
var p = GuidD();
Guid B and P
‘B’ Guids are the same as ‘D’ Guids with curly brackets: {12345678-ABCD-1234-ABCD-1234567890AB}
.
‘P’ Guids are the same as ‘D’ Guids with parenthesis: (12345678-ABCD-1234-ABCD-1234567890AB})
.
var pb = GuidB();
var pp = GuidP();
Formatted Date/Time
You can parse formatted Date and Time strings by specifying a format using mostly the same formatting rules as Microsoft’s Custom Date/Time Format Strings. These parsers take a format string and return a Parser.
The DateAndTime
parser takes a format string and returns a DateTimeOffset
which may include both date and time information.
The Date
parser takes a format string and returns a DateTime
value without Time information.
The Time
parser takes a format string and returns a TimeSpan
value which represents time of day but does not have date information.
var dateAndTime = DateAndTime("YYYY-MM-dd HH:mm:ss.fff");
var date = Date("MM/dd/YY");
var time = Time("HH:mm:ss.fff");
Format specifiers can be in any number or order, can have any values you want or omit any you don’t want. The following specifiers are available:
YYYY
the 4-digit yearMM
the 2-digit month number (with leading zero, if the value is less than 10)MMM
The 3-character month abbreviation for the current locale."Jan"
,"Feb"
,"Mar"
, etc. These values are not case-sensitive.MMMM
the full month name for the current locale."January"
,"February"
,"March"
, etc. These values are not case-sensitive.dd
the 2-digit day of monthHH
andhh
both are equivalent, the 2-digit hour with leading zero.H
andh
both are equivalent, the 2-digit hour without leading zero.mm
the 2-digit minute with leading zerom
the 2-digit minute without leading zeross
the 2-digit second with leading zeros
the 2-digit second without leading zerof
a digit of fractions of a second."f"
is 10ths of a second,"ff"
is hundredths of a second, and"fff"
is milliseconds. The precision only goes down to milliseconds, additional digits are ignored.- Any other character is taken as a literal and that character is skipped during parsing.
Notice that ambiguities can arise when we parse values without separators and without leading zeros. For example:
var parser = Time("Hms");
var result = parser.Parse("11111");
This result is ambiguous because the source might intend the value to be 11:11:01
or 01:11:11
or 11:01:11
. Because of the greedy nature of the .List()
parser used in the implementation of the Time
parser, it will be parsed as 11:11:01
. Likewise the input value 1111
will throw an error here because H
will greedily match "11"
, m
will greedily match "11"
and there will be no input left to match s
.
Note: The 2-digit specifiers with leading zeros (HH
, mm
, etc) are more efficient to parse than the specifiers which may omit leading zeros (H
, m
, etc). Where possible, prefer the variants with leading zeros.