XML

CookCC contains can generator both lexers and parsers. The input XML (DTD) thus contains a lexer section and a parser section. Only one of the two section is required. The file extension is *.xcc.

There are plenty of examples shown in the test cases.

Overview

A sample XML looks like:

<?xml version = "1.0" encoding="UTF-8"?>
<!DOCTYPE cookcc PUBLIC "-//CookCC//1.0" "https://raw.githubusercontent.com/coconut2015/cookcc/master/src/resources/cookcc.dtd">
<cookcc unicode="false">

    <tokens>VARIABLE INTEGER WHILE IF PRINT</tokens>
    <tokens type="nonassoc">IFX</tokens>
    <tokens type="nonassoc">ELSE</tokens>
    <tokens type="left"><![CDATA[GE LE EQ NE '>' '<']]></tokens>
    <tokens type="left">'+' '-'</tokens>
    <tokens type="left">'*' '/'</tokens>
    <tokens type="nonassoc">UMINUS</tokens>

    <lexer>
        <!-- lexer section -->
    </lexer>
    <parser start="program">
        <!-- parser section -->
    </parser>

    <code name="default"><![CDATA[
        /* code section, can appear any where directly under the <cookcc> tag. */
    ]]></code>
</cookcc>

Some XML editors such as the one in IntelliJ IDEA can perform on-the-fly XML syntax check, automatic tag and attribute suggestions. CookCC also checks the validity of the input using the DTD before parsing.

XML Tag Explanations

<cookcc>

This is the document root of the XML file. It has a single attribute to indicate where or not the lexer parses unicode. By default, unicode is false.

<code>

The <code> tag can specify any pieces of code. There can be multiple of such tags and it can locate any where directly under <cookcc> tag. The name of the code needs to be unique. If the name is not specified, it is assumed to be "default". The exact usage of that code depends on the template being used. For Java output, you can take a look at the Java output template to get a rough idea.

<tokens>

Simply a list of string names. They can be separated by spaces, tabs or new lines.

The type attribute is used by the parser to determine the associativity of the tokens. There are three types: "left", "right", and "nonassoc". If the type is not specified, it is assumed to be "nonassoc".

Tokens inside the same <tokens> tag have the same precedence level. Tokens specified in later <tokens> tags have higher precedence levels.

<lexer>

A sample XML for the <lexer> section looks like:

<lexer table="ecs">
    <shortcut name="nonws">[^ \t\r\n]</shortcut>
    <shortcut name="word">{nonws}+</shortcut>
    <rule>
        <pattern>{word}</pattern>
        <pattern>{word}[ \t\r\n]*</pattern>
        <action>
            ++wordCount;
        </action>
    </rule>
    <state name="INITIAL,TEST">
        <rule>
            <pattern>.|\n</pattern>
            <action>
                // ignore
            </action>
        </rule>
        <rule state="ANOTHER_STATE">
            <pattern><![CDATA[<<EOF>>]]></pattern>
            <action>
                return 0;  /* exit lexer loop */
            </action>
        </rule>
    </state>
</lexer>

Options for the lexer are specified as attributes of the <lexer> tag.

Attribute Description
table The DFA table format. Can be one of "ecs", "full", "compressed" options. Command line options can override the choice here.
bol Instruct the lexer to keep track of the BOL (beginning of line) information even when there are no patterns use that information.
warnbackup Generate warning of backup lexer states if set to true. Default is false.
yywrap Indicates that yyWrap () function should be called when EOF is encountered. Default is false.
linemode Instruct the lexer to match patterns one line at a time. This mode is primarily useful for interactive modes where inputs are delimited by '\n' character. Multi-line patterns will generate warnings since they cannot be matched in this mode.

<shortcut>

This tag is used to specify frequently used subset of patterns. In the above example, when the pattern {word} is seen, it is replaced with ({nonws}), which is in turn replaced with ([^ \t\r\n]). So the actual pattern is [^ \t\r\n]+.

<shortcut> tags can only be specified as immediate children of <lexer>.

<state>

<state> tags are used to indicate the state conditions. It has only one attribute name to specify a comma separated list of state names. All rules specified under this tag are automatically added to this particular state. If the name attribute is not specified, it is assumed to be {{{INITIAL}}, which is required as the initial state at the start of the lexer.

<rule>

Rule tags are used to specify patterns and their associated action codes. It can have multiple <pattern> children, but one and only one <action> child.

Attribute Description
state A comma separated list of state names that this rule is in. If the current rule is already under a <state> tag, then the rule is added to all of them.

<pattern>

Attribute Description
bol Specify that the pattern only works at BOL (beginning of line).
nocase Specify that the pattern does case insensitive match.

Although multiple patterns may be under the same rule and share the action code, in actual generated code, the action code is replicated for each pattern. This is to avoid the problem that some patterns may work at BOL while some other patterns may not. To avoid action code replication, try put them inside a single <pattern> tag with | in between.

<action>

It contains the code to be executed when the pattern is matched.

<parser>

A sample XML for <parser> section looks like:

<parser start="program">
    <type format="((Node){0})">stmt expr stmt_list</type>
    <type format="((String){0})">VARIABLE</type>
    <type format="((Integer){0})">INTEGER</type>
    <grammar rule="program">
        <rhs>function</rhs>
        <action>return 0;</action>
    </grammar>
    <grammar rule="function">
        <rhs>function stmt</rhs>
        <action>interpret ($2);</action>
        <rhs></rhs>
    </grammar>
    <grammar rule="stmt">
        <rhs>';'</rhs>
        <action>$$ = new SemiColonNode ();</action>

        <rhs>expr ';'</rhs>
        <action>$$ = $1;</action>

        <rhs>PRINT expr ';'</rhs>
        <action>$$ = new PrintNode ($2);</action>

        <rhs>VARIABLE '=' expr ';'</rhs>
        <action>$$ = new AssignNode ($1, $3);</action>

        <rhs>WHILE '(' expr ')' stmt</rhs>
        <action>$$ = new WhileNode ($3, $5);</action>

        <rhs precedence="IFX">IF '(' expr ')' stmt</rhs>
        <action>$$ = new IfNode ($3, $5, null);</action>

        <rhs>IF '(' expr ')' stmt ELSE stmt</rhs>
        <action>$$ = new IfNode ($3, $5, $7);</action>

        <rhs>'{' stmt_list '}'</rhs>
        <action>$$ = $2;</action>
    </grammar>
</parser>

Options for the parser are specified as attributes of the <parser> tag.

Attribute Description
start Specify the start non-terminal. If this attribute is not specified, the LHS of the first grammar is used.
recovery Should the parser try to generate error recovery routines. This attribute is default true. Set this attribute to false for speedy exit from the parser in case of error.
parseerror Should the parser generate the error function since the user is going to supply one. This attribute is default true.

See the parser recovery page for more information on error recovery.

<type>

This tag is used to specify the necessary code that should be used to cast / retrieve members of arguments {0}. In the example above, $1 was automatically converted to ((Node)$1) if $1 is a stmt, expr, or stmt_list. $1 itself is internally translated to the appropriate variable/function call.

(Note in the Java code generator, the format does not apply to $$).

<grammar>

The attribute value of rule is a non-terminal. All the productions in <rhs> are for this particular terminal.

<rhs>

This tag represents the production for the non-terminal of the parent grammar tag. Its action code should be immediately followed. If not, there are no actions performed for this particular production.

The attributes for the rhs tag are

Attribute Description
precedence Specify the precedence for the production to the precedence of a particular terminal.

<action>

This tag is used to specify the code that should be called for the production in the immediate <rhs> above.

$$ represents the value for the LHS non-terminal. $1, $2 etc represent the object values of the symbols at the position for productions specified in the <lhs> tag, starting from 1.