Java

One of the cool feature of CookCC is being able to directly specify the lexer/parser right in the Java code using annotation, without going into an obscure / proprietary input file.

Why Java Annotation

You can look at CookCC presentation slides for more detailed comparisons and a quick tutorial.

Java Annotation vs Lex/Yacc

The main benefit of using Java annotation is that you can take full advantages of modern IDEs without having to deal with proprietary text files.

  • syntax highlighting
  • context sensitive hints
  • code usage analysis
  • refactoring
  • auto-completion
  • instant error checking
  • etc

So it takes a lot of pain away from writing lexer / parsers.

This approach can be extended to other languages, such as Python or C#, even C / C++.

Java Annotation vs JavaDoc

Although it is possible to use JavaDoc doclet to extract annotations, the annotation capabilities of Java 1.5 is simply a lot easier to use and deal with. For this reason, I settled to use Java 1.5. This however, does not mean the code generated has to be run under JVM 1.5+. One can always target the output class files for earlier versions of JVM.

Annotation Processing API Changes

It should be note that Annotation Processing Tool (APT) was deprecated in Java 7. The newer java compiler based processing API was available since Java 6.

CookCC 0.3.3 only supports the older APT. CookCC 0.4+ supports the newer API (although I kept the old API code, but it was not used for any newer features introduced in 0.4+). If you cannot use Java 6 and later JDK, then you will have to use CookCC 0.3.3.

Overview

The following example is a simple calculator script interpreter adapted from A Compact Guide to Lex & Yacc.

Setup

First, add CookCC jar file to your project path. This jar file is only required for building the parser, and setting up the ant task. It is not required at runtime.

First Step: Annotate a Class

@CookCCOption is used to mark a class that uses the generated lexer/parser. In the following example, we mark the Calculator class.

import org.yuanheng.cookcc.*;

@CookCCOption (lexerTable = "compressed", parserTable = "compressed")
public class Calculator extends Parser
{
    // code
}

The generated class is actually the parent class of Calculator, in this case, Parser class (which needs to be in the same package as Calculator). Since we haven’t really generated this class yet, so what we needed to do to work on the Calculator class without the error in editor (assuming that you are using a decent Java IDE such as Eclipse or IntelliJ IDEA) is to create an empty Parser.java like the following code.

/* Copyright (c) 2008 by Heng Yuan */
import org.yuanheng.cookcc.CookCCByte;

/**
 * @author Heng Yuan
 * @version $Id$
 */
public class Parser extends CookCCByte
{
}

Notice that we have the file header (copyright notice) and the class header. CookCC Java input will keep these in the generated class. It also keep the scope of the class, public in this case. In general as a good practice though, the generated class should be in the same package of Calculator, and in the package scope.

CookCCByte is class that contains all the possible generated functions (not all of them will be available depending on the options and the patterns/rules), so that code in Calculator can use them in advance. Since we are not dealing with Unicode, so we extend CookCCByte class. For lexers that deal with Unicode, extend CookCCChar class.

Note that all the CookCC annotations and CookCCByte are merely required for compile time, they are not required for runtime. The generated class no longer extends CookCCByte.

Second Step: Mark a Token Enum

If you are only going to use a lexer, you can skip this section.

Within Calculator class, you can specify the token Enum class. It is not required to have this Enum as a nested class of Calculator since @CookCCOption can be used to specify the Enum class defined elsewhere. However, I personally find that having such nested declaration makes it more visual.

Use @CookCCToken to mark a Enum declaration. All the names defined here are all treated as terminals by CookCC.

To specify the precedence and the type, mark a token with @TokenGroup and set its type to LEFT, RIGHT or NONASSOC. If the type is not specified, it is assumed to be NONASSOC. All unmarked tokens would inherit the precedence and type of the previous token.

You can use static import to avoid typing TokenType.LEFT and only need to type LEFT. I dislike static import, and I copy/paste code anyways. So it doesn’t really matter.

@CookCCToken
static enum Token
{
    @TokenGroup
    VARIABLE, INTEGER, WHILE, IF, PRINT, ASSIGN, SEMICOLON,
    @TokenGroup
    IFX,
    @TokenGroup
    ELSE,

    @TokenGroup (type = TokenType.LEFT)
    GE, LE, EQ, NE, LT, GT,
    @TokenGroup (type = TokenType.LEFT)
    ADD, SUB,
    @TokenGroup (type = TokenType.LEFT)
    MUL, DIV,
    @TokenGroup (type = TokenType.LEFT)
    UMINUS
}

One of the unfortunate drawback of using Enum token is that it is necessary to give a name to terminals such as '=' '<', etc. On the other hand, it is good to have them defined since it is easier to use them in abstract syntax trees (ASTs).

(Currently, CookCC does not have a tree generator yet. Hopefully it can be added in the near future.)

Lexer Section

Specifying Shortcuts

@Shortcut is used to specify a single frequently used pattern can be re-used in actual lexical patterns. Multiple @Shortcut can be defined using @Shortcuts annotation. Just specify it on any functions. The order of shortcut is not important and it is possible to contain references to other shortcuts. Just be careful not to create cyclic references.

@Shortcuts ( shortcuts = {
    @Shortcut (name="nonws", pattern="[^ \\t\\n]"),
    @Shortcut (name="ws", pattern="[ \\t]")
})
@Lex (pattern="{nonws}+", state="INITIAL")
void matchWord ()
{
    m_cc += yyLength ();
    ++m_wc;
}

Specifying Lexical Patterns

@Lex is use to specify a single lexical pattern. @Lexs is used to specify multiple lexical patterns that share a common action.

There are three types functions that can be marked with these two annotations. They each has different meanings.

None of the functions can be private, since they are called from the generated class. They can be protected, public or in the package scope (if the generated class is in the same package as this class).

Case 1: Function returns void

This is the most simple case. The lexer would call this function and then move on to matching the next potential pattern. For example:

@Lex (pattern = "[ \\t\\r\\n]+")
protected void ignoreWhiteSpace ()
{
}

Note that it is necessary to use double backslashes here for escape sequences because Java itself also interpret escape sequences. This is perhaps one of the main drawback using Java annotation to specify the lexer. Fortunately usually the lexer is fairly easy to get it working correctly. IntelliJ IDEA also has a nice feature which pasting code with escape sequence such as [ \t\r\n]+ inside a pair of double quotes would automatically adds the extra backslashes.

Case 2: Function returns a non-int value

In this case, @Lex needs to contain the terminal token that would be returned. The return value from the function is going to be the value associated with this terminal.

We have to specify the terminal in String due to the technical limitation.

@Lex (pattern="[0-9]+", token="INTEGER")
protected Integer parseInt ()
{
    return Integer.parseInt (yyText ());
}

@Lexs (patterns = {
    @Lex (pattern = "while", token = "WHILE"),
    @Lex (pattern = "if", token = "IF"),
    @Lex (pattern = "else", token = "ELSE"),
    @Lex (pattern = "print", token = "PRINT")
})
protected Object parseKeyword ()
{
    return null;
}
Case 3: Function returns an int value

In this case, the lexer would return the value. For example:

@Lex (pattern="[(){}.]")
protected int parseSymbol ()
{
    return yyText ().charAt (0);
}

Be extra careful if the return value is used as terminals in the parser. Values not in the valid used terminals can result in the early termination of the parser.

Note that when <<EOF>> is encountered, it is necessary to return a value or the lexer would get into an infinite loop. There are a number of ways of doing so:

@Lex (pattern = "<<EOF>>", token = "$")
protected void parseEOF ()
{
}

Or you can simply do

@Lex (pattern = "<<EOF>>")
protected int parseEOF ()
{
    return 0;
}

This is because $ terminal corresponds to 0.

Parser Section

Specifying Parser Rules

@Rule specifies a single grammar rule. @Rules can be used to specify multiple rules that share the same function action.

There are also three cases of functions marked using @Rule

Case 1: Function returns void

In this case, the value associated with the non-terminal of the LHS is null.

@Rule (lhs = "function", rhs = "function stmt", args = "2")
protected void parseFunction (Node node)
{
    interpret (node);
}
Case 2: Function returns a non-int value

In this case, the return value is automatically associated with the non-terminal on the LHS.

@Rule (lhs = "stmt", rhs = "SEMICOLON")
protected Node parseStmt ()
{
    return new SemiColonNode ();
}
Case 3: Function returns an int value

This function is used by the grammar start non-terminal to signal the exit of the parser with the particular value. It can be used by error processing functions as well.

@Rule (lhs = "program", rhs = "function")
protected int parseProgram ()
{
    return 0;
}

Passing Arguments

args of the @Rule annotation is a list of indexes (separated by comma or space) of the symbols which the method expects as arguments. The indexing value starts from 1 for the production on the RHS.

For example:

@Rule (lhs = "stmt", rhs = "VARIABLE ASSIGN expr SEMICOLON", args = "1 3")
protected Node parseAssign (String var, Node expr)
{
    return new AssignNode (var, expr);
}

This will assign the value of symbol VARIABLE to the method parameter String var and the value of symbol expr to the method parameter Node expr.

Note that the indexes need not be in any specific order. This would be equivalent (indexes and method parameters swapped):

@Rule (lhs = "stmt", rhs = "VARIABLE ASSIGN expr SEMICOLON", args = "3 1")
protected Node parseAssign (Node expr, String var)
{
    return new AssignNode (var, expr);
}

As you can see, one does not have to mess with $$, $1 etc, and does not have to deal with type information specified elsewhere. This approach is much more intuitive.

Examples

There are a number of examples in test cases.