Lexer¶
CookCC Lexer has the following features.
Lexer Table Format¶
CookCC supports DFA tables for 8-bit and 16-bit characters. 16-bit character tables are intended for unicode support. Currently, the following table formats are supported.
Format | Description |
---|---|
full |
A full table. Very memory intensive. |
ecs |
A much smaller table using equivalent classes. |
compressed |
An even smaller table in most cases at some performance cost. |
Line Mode¶
Added in 0.4+.
This mode is mostly for interactive mode scanning where \n
immediately triggers the current longest pattern to be matched. It is
very similar to matching <<EOF>>
character where \n
must be
consumed in the current line.
Multi-line patterns will not work in this mode.
In this mode, the lexer will not block and read the character on the next line before fully processing the patterns on the existing line. Thus, it is perfectly suitable for interactive procesing.
There is a slight performance hit due to one extra comparison per character, but usually it is not an issue in interactive mode.
Trail Context¶
CookCC at present only handles either fixed head (e.g., abc/xyz
or
abc/x*z
) or tail (e.g., a.*b/xyz
) trail contexts.
CookCC Warnings¶
CookCC generates warnings in the following cases.
- patterns that cause backup,
- patterns that were never reached,
- states that have incomplete patterns, or
- having multi-line patterns in line mode.
Backup¶
This situation happens when a pattern proceeds to match a relatively long string without intermediate states that are acceptable.
You can take a look at a simple example that cause such a problem.
Backups can cause slight performance degradations, depending the target language. For Java, the difference is not so noticeable.
Incomplete States¶
This situation happens when patterns concerning part of the character sets have been specified. By default, CookCC internally add states that simply dumps the characters not matched by the user patterns to the standard output.
One way to avoid such warning is by adding a pattern .|\n
as the
last pattern for the state. This is in fact the way internally CookCC
does. However, it then runs into the potential problem of having
patterns that can never be matched.
CookCC also requires user to specify <<EOF>>
conditions for all
states, just in case of an unexpected end of file. For example, you are
probably not expecting an EOF when parsing a block comment. If not
specified, the default action is to exit from lexer with a value of 0.
Here are some examples that cause such a problem.
TODO List¶
The following features are yet to be implemented. These features are difficult to implement and I do not have any experiences using them, so they are quite low in the priority list.
Feature | Description |
---|---|
yyMore | Make the current string available for the next time. |
REJECT | Reject a token and go to the next available accept case. |
Variable trail context | Both the head and tail are variable length. |
Marked sub-expression | Perl-like matching that automatically extract sub-expressions as well. |
Some of them can be worked around by utilizing Java’s Pattern class to perform the secondary match.