Parser

Introduction

The generated parser is LALR(1).

The grammar is specified as a series of terminals and non-terminals. For example,

<rhs>VARIABLE '=' expr ';'</rhs>

Extended Grammar

Extended grammar are mostly for convenience. New operators ?, *, +, |, ( and ) are added. They are described below. Some of the extended grammar requires a small runtime library `cookcc-rt.

The tests for the extended grammar have some examples.

Optional Operator

Optional operator ? is used to indicate a symbol is optional. For example, in the following grammar

<grammar rule="G">
    <rhs>A B?</rhs>
</grammar>

It is equivalent to

<grammar rule="G">
    <rhs>A @1</rhs>
</grammar>
<grammar rule="@1">
    <rhs>B</rhs>
    <action>$$ = $1;</action>
    <rhs></rhs>
    <action>$$ = null;</action>
</grammar>

With (), it is possible to enclose multiple symbols with ? operator. For example, in the following

<grammar rule="G">
    <rhs>A (B C)?</rhs>
</grammar>

It is equivalent to

<grammar rule="G">
    <rhs>A C</rhs>
</grammar>
<grammar rule="@1">
    <rhs>B C</rhs>
    <action>$$ = new ASTNode (); $$.add ($1); $$.add ($2);</action>
    <rhs></rhs>
    <action>$$ = null;</action>
</grammar>

The collection used here is org.yuanheng.cookcc.ASTNode which is part of cookcc-rt runtime.

List Operator

List operator + basically repeats a symbol or a set of symbols for one or more iterations. It is very common for grammars to have repeated terminal / non-terminals, so it should be helpful to avoid some tedious works. For example, in the following grammar

<grammar rule="G">
    <rhs>A B+</rhs>
</grammar>

It is equivalent to

<grammar rule="G">
    <rhs>A @1</rhs>
</grammar>
<grammar rule="@1">
    <rhs>B</rhs>
    <action>$$ = new ASTListNode (); $$.add ($1);</action>
    <rhs>@1 B</rhs>
    <action>$$.add ($2);</action>
</grammar>

The collection used here is org.yuanheng.cookcc.ASTListNode which is part of cookcc-rt runtime.

It is possible to repeat a set of symbols, but its use should be rare. For example, in the following grammar

<grammar rule="G">
    <rhs>A (B C)+</rhs>
</grammar>

It is equivalent to

<grammar rule="G">
    <rhs>A @1</rhs>
</grammar>
<grammar rule="@1">
    <rhs>B C</rhs>
    <action>$$ = new ASTListNode (); $$.add ($1); $$.add ($2);</action>
    <rhs>@1 B C</rhs>
    <action>$$.add ($2); $$.add ($3);</action>
</grammar>

It should be noted a single ASTListNode is used to contain all the repeated symbols.

To avoid extra ASTListNode being created, the following two rules are equivalent.

<rhs>A B+</rhs>
<rhs>A (B)+</rhs>

Optional List Operator

Optional list operator * basically repeats a symbol or a set of symbols for zero or more iterations. It is very similar to + operator except that the number of repeats can be zero.

Grouping Operator

( and ) are used enclose symbols. The behavior are the following.

  • It has to enclose at least a symbol. Thus simply () is not allowed.
  • It is usually used in conjunction with other extended grammar operators. See notes in other extended grammar operators.
  • You can use ( and ) simply for grouping one or more symbols without using other extended grammar operators. For example: A (B) C. An object org.yuanheng.cookcc.ASTNode is used to collect the symbol values.
  • Nesting such as ((A)) is allowed, but its use probably does not make practical sense.

Or Operator

Or operator | is used to make several possible choices. For example, in the following grammar

<grammar rule="G">
    <rhs>A (B | C)</rhs>
</grammar>

It is equivalent to

<grammar rule="G">
    <rhs>A @1</rhs>
</grammar>
<grammar rule="@1">
    <rhs>B</rhs>
    <action>$$ = $1;</action>
    <rhs>C</rhs>
    <action>$$ = $1;</action>
</grammar>

It is not possible to specify empty rule with | operator. Instead, you should combine with ? operator for this need. For example, in the following grammar, A could be followed by B, C, or D.

<grammar rule="G">
    <rhs>A (B | C) ? D</rhs>
</grammar>

cookcc-rt Runtime

cookcc-rt is a tiny Java runtime library for cookcc. It is only required by certain extended grammars mentioned above.

Parser Table Format

Currently, the following table formats are supported.

Format Description
ecs Good when there are not a lot symbols and states.
compressed A smaller table in most cases at some performance cost.

Default Reduce

the command line option -defaultreduce is specified, DFA states that contain a reduceable item would convert all 0 (i.e. error) entries to reduce. This approach can make the compressed table more compact, at the expense of slightly more difficult error recovery.

Analysis Output

When the command line option -analysis is specified, a file named cookcc_parser_analysis.txt is generated in the current directory that contains the detail of the parser. It can be useful in analyzing the grammar.

Error Recovery

<parser> has an option recovery which would turn on/off the error recovery code depending whether or not the value is true or false. This value is by default true.

Turning off error recovery can be useful since in many cases we do not really care much about the corrupted data, and error recovery can be slow.

The exact behavior of error recovery depends on the specific implementation of output language.

Error Recovery in Java

When the recovery option is set to false, the parser simply returns with a value of 1 to indicate that an error has occurred.

The option parseerror controls whether or not the code generator should generatet he yyParseError function. Set this option to false if you want to the parser to use your own function.

Otherwise, the behavior of the parser is the following.

When a token not belonging to one of the lookahead (i.e. cannot either reduce or shift) is encountered. yyParseError function is called. If this function returns true, the parser stops and returns a value 1. If the function returns false (by default), an error token is pushed onto the lookahead stack and an internal error flag _yyInError is set. Then the parsing is resumed.

If the error token can be shifted, then a grammar dealing with error recovery is found. Otherwise, the parser would start discarding a state on the stack until a grammar that can handle the “error” token is reached.

With the error token shifted on to the stack. The state should be immediately reduceable if the grammar does not require any tokens after the error token. Otherwise, it means the grammar is looking for a specific terminal. Then the input is continuously consumed until the desired token is found, or the end of file is reached.

Additionally, yyPeekLookahead () is provided to check the cause of the error (only accurate if the user didn’t specify any terminals after error in the grammar). yyPopLookahead () is provided to remove the possible offending token. However, this function should be called only once.

Here are some test cases that demonstrate these behaviors.