Regular Expression¶
CookCC follows closely to the Flex patterns, so this page is contains a number of text copied/pasted from the referenced article. There are some minor differences though since CookCC allows both double quoted strings as well as single quoted strings (since version 0.3.1).
Some examples can be found in test cases for NFA and CCL.
Format | Description |
---|---|
x |
matches the character x |
. |
matches anything except \n character |
[xyz] |
a character class which includes x , y , and
z |
[abj-oZ] |
a “character class” with a range in it; matches an
a , a b , any letter from j through o ,
or a Z |
[^A-Z] |
a “negated character class”, i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter. |
[^A-Z\n] |
any character EXCEPT an uppercase letter or a newline |
r* |
zero or more r’s, where a is any regular expression |
r+ |
one or more r’s |
r? |
zero or one r’s (that is, “an optional r”) |
r{2,5} |
anywhere from two to five r’s |
r{2,} |
two or more r’s |
r{4} |
exactly 4 r ’s |
{name} |
Replace with the <shortcut> pattern of the same
name. |
"[xyz]" |
the literal string: [xyz] |
'[xyz]' |
the literal string: [xyz] , since version 0.3.1. |
\x |
Escape character, see the section below. |
\0 |
the character with 0 (ASCII code 0) |
(r) |
match an r |
rs |
the concatenation of the regular expression r
followed by the regular expression s |
r|s |
either an r or an s . |
r/s |
r followed by s , but only obtains the r
part. At present, CookCC can only handle cases where
either r or s is fixed in length. |
^r |
an r at beginning of line, or the start of scan.
It will increase the DFA size slightly and slightly
decrease the lexer performance. |
r$ |
an r at the end of the line (\n ), but not
including the \n character. Note that this
pattern does not work well on files in dos format
because it does not consider \r character. |
<<EOF>> |
an end of file character. This character is artificial and will not be actually seen in the input. If this character is the entire pattern, return value from yyText () is null. |
Escape Characters¶
Code | Description |
---|---|
\b |
backspace |
\e |
escape \033 |
\f |
form feed |
\n |
line feed |
\r |
carriage return |
\t |
tab |
\s |
space ' ' |
\\ |
back slash () |
\" |
double quote (“) |
\' |
single quote (‘) |
\123 |
the character with octal value 123 (maximum 3 digits) |
\x2a |
the character with hexadecimal value 2a (maximum 2 digits) |
\u002a |
the unicode character with hexadecimal value 2a (exactly 4 digits) |
Escape sequences with \
proceeding any character that is not listed
above would treat treat that character as the literal character, rather
than special characters.
POSIX Character Classes¶
Class | Equivalent |
---|---|
[:alnum:] |
[a-zA-Z0-9] |
[:alpha:] |
[a-zA-Z] |
[:blank:] |
[ \t] |
[:cntrl:] |
[\x00-\x1f\x7f] |
[:digit:] |
[0-9] |
[:graph:] |
[[:alnum:][:punct:]] |
[:lower:] |
[a-z] |
[:print:] |
[[:graph:] ] |
[:punct:] |
[!"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~] |
[:space:] |
[ \t\n\x0b\f\r] |
[:upper:] |
[A-Z] |
[:xdigit:] |
[a-fA-F0-9] |
[:word:] |
[a-zA-Z0-9_] . Technically, it is not part of POSIX.
Added in 0.4. |
For example, the following character classes are all equivalent:
[[:alnum:]] |
[[:alpha:][:digit:]] |
[[:alpha:]0-9] |
[a-zA-Z0-9] |
---|---|---|---|
Character Class Manipulations¶
A character class can be manipulated in several ways:
Operation | Description | Examples |
---|---|---|
{+} |
Merging two character classes. | [A-Z]{+}[a-z] is the same as
[A-Za-z] |
{-} |
Remove all characters in the second character class from the first character class | [A-Za-z]{-}[a-z] is the same as
[A-Z] . [a-z]{-}[aeiou]
represents all lower case consonant
letters. |
^ |
Not operation | [[:^alpha:]] is the same as
[^[:alpha:]] and
[\0-\xff]{-}[a-zA-Z] .
[^ \t\r\n] represents all
non-space characters. |
Tips¶
- Unless one is generating a full DFA table, it is usually a good idea
not to generate new character classes. For example,
(a|b)
can be better written as[ab]
which would reduce the DFA table size forecs
andcompressed
format quite a bit. This approach is particularly necessary when there are lots of DFA states.- In 0.4, a special handling is added to handle
(a|b)
like expression so that it is treated as[ab]
.
- In 0.4, a special handling is added to handle
- Long strings of keyword matches such as
"while"
,"continue"
can increase the DFA table size quite a bit. Consider using hash tables or perfect hash instead if the tools are available in the target language. - On the other hand, specifying long patterns without causing backups can increase the lexer performances at the expense of more DFA states.