Lexer

CookCC Lexer has the following features.

Lexer Table Format

CookCC supports DFA tables for 8-bit and 16-bit characters. 16-bit character tables are intended for unicode support. Currently, the following table formats are supported.

Format Description
full A full table. Very memory intensive.
ecs A much smaller table using equivalent classes.
compressed An even smaller table in most cases at some performance cost.

Line Mode

Added in 0.4+.

This mode is mostly for interactive mode scanning where \n immediately triggers the current longest pattern to be matched. It is very similar to matching <<EOF>> character where \n must be consumed in the current line.

Multi-line patterns will not work in this mode.

In this mode, the lexer will not block and read the character on the next line before fully processing the patterns on the existing line. Thus, it is perfectly suitable for interactive procesing.

There is a slight performance hit due to one extra comparison per character, but usually it is not an issue in interactive mode.

Trail Context

CookCC at present only handles either fixed head (e.g., abc/xyz or abc/x*z) or tail (e.g., a.*b/xyz) trail contexts.

CookCC Warnings

CookCC generates warnings in the following cases.

Backup

This situation happens when a pattern proceeds to match a relatively long string without intermediate states that are acceptable.

You can take a look at a simple example that cause such a problem.

Backups can cause slight performance degradations, depending the target language. For Java, the difference is not so noticeable.

Incomplete States

This situation happens when patterns concerning part of the character sets have been specified. By default, CookCC internally add states that simply dumps the characters not matched by the user patterns to the standard output.

One way to avoid such warning is by adding a pattern .|\n as the last pattern for the state. This is in fact the way internally CookCC does. However, it then runs into the potential problem of having patterns that can never be matched.

CookCC also requires user to specify <<EOF>> conditions for all states, just in case of an unexpected end of file. For example, you are probably not expecting an EOF when parsing a block comment. If not specified, the default action is to exit from lexer with a value of 0.

Here are some examples that cause such a problem.

Some Patterns Can Never Be Matched

By default, patterns specified earlier have precedence patterns specified later. Thus, for some patterns, the matchable strings could always be matched by other patterns first.

Here are some examples that cause such a problem.

Multi-Line Patterns in Line Mode

When line mode is used in lexer. Multi-line patterns simply cannot matched.

When this warning is given, other warnings may not be accurate until this warning is fixed.

Here are some examples that cause such a problem.

TODO List

The following features are yet to be implemented. These features are difficult to implement and I do not have any experiences using them, so they are quite low in the priority list.

Feature Description
yyMore Make the current string available for the next time.
REJECT Reject a token and go to the next available accept case.
Variable trail context Both the head and tail are variable length.
Marked sub-expression Perl-like matching that automatically extract sub-expressions as well.

Some of them can be worked around by utilizing Java’s Pattern class to perform the secondary match.