Java¶
One of the cool feature of CookCC is being able to directly specify the lexer/parser right in the Java code using annotation, without going into an obscure / proprietary input file.
Why Java Annotation¶
You can look at CookCC presentation slides for more detailed comparisons and a quick tutorial.
Java Annotation vs Lex/Yacc¶
The main benefit of using Java annotation is that you can take full advantages of modern IDEs without having to deal with proprietary text files.
- syntax highlighting
- context sensitive hints
- code usage analysis
- refactoring
- auto-completion
- instant error checking
- etc
So it takes a lot of pain away from writing lexer / parsers.
This approach can be extended to other languages, such as Python or C#, even C / C++.
Java Annotation vs JavaDoc¶
Although it is possible to use JavaDoc doclet to extract annotations, the annotation capabilities of Java 1.5 is simply a lot easier to use and deal with. For this reason, I settled to use Java 1.5. This however, does not mean the code generated has to be run under JVM 1.5+. One can always target the output class files for earlier versions of JVM.
Annotation Processing API Changes¶
It should be note that Annotation Processing Tool (APT) was deprecated in Java 7. The newer java compiler based processing API was available since Java 6.
CookCC 0.3.3 only supports the older APT. CookCC 0.4+ supports the newer API (although I kept the old API code, but it was not used for any newer features introduced in 0.4+). If you cannot use Java 6 and later JDK, then you will have to use CookCC 0.3.3.
Overview¶
The following example is a simple calculator script interpreter adapted from A Compact Guide to Lex & Yacc.
Setup¶
First, add CookCC jar file to your project path. This jar file is only required for building the parser, and setting up the ant task. It is not required at runtime.
First Step: Annotate a Class¶
@CookCCOption
is used to mark a class that uses the generated
lexer/parser. In the following example, we mark the Calculator
class.
import org.yuanheng.cookcc.*;
@CookCCOption (lexerTable = "compressed", parserTable = "compressed")
public class Calculator extends Parser
{
// code
}
The generated class is actually the parent class of Calculator
, in
this case, Parser
class (which needs to be in the same package as
Calculator
). Since we haven’t really generated this class yet, so
what we needed to do to work on the Calculator
class without the
error in editor (assuming that you are using a decent Java IDE such as
Eclipse or IntelliJ IDEA) is to create an empty Parser.java
like the
following code.
/* Copyright (c) 2008 by Heng Yuan */
import org.yuanheng.cookcc.CookCCByte;
/**
* @author Heng Yuan
* @version $Id$
*/
public class Parser extends CookCCByte
{
}
Notice that we have the file header (copyright notice) and the class
header. CookCC Java input will keep these in the generated class. It
also keep the scope of the class, public
in this case. In general
as a good practice though, the generated class should be in the same
package of Calculator
, and in the package scope.
CookCCByte
is class that contains all the possible generated
functions (not all of them will be available depending on the options
and the patterns/rules), so that code in Calculator
can use them in
advance. Since we are not dealing with Unicode, so we extend
CookCCByte
class. For lexers that deal with Unicode, extend
CookCCChar
class.
Note that all the CookCC annotations and CookCCByte
are merely
required for compile time, they are not required for runtime. The
generated class no longer extends CookCCByte
.
Second Step: Mark a Token Enum¶
If you are only going to use a lexer, you can skip this section.
Within Calculator
class, you can specify the token
Enum
class. It is not required to have this Enum as a nested class of
Calculator
since @CookCCOption
can be used to specify the Enum
class defined elsewhere. However, I personally find that having such
nested declaration makes it more visual.
Use @CookCCToken
to mark a Enum declaration. All the names defined
here are all treated as terminals by CookCC.
To specify the precedence and the type, mark a token with
@TokenGroup
and set its type to LEFT, RIGHT or NONASSOC. If the type
is not specified, it is assumed to be NONASSOC. All unmarked tokens
would inherit the precedence and type of the previous token.
You can use static
import
to avoid typing TokenType.LEFT
and only need to type LEFT
. I
dislike static import, and I copy/paste code anyways. So it doesn’t
really matter.
@CookCCToken
static enum Token
{
@TokenGroup
VARIABLE, INTEGER, WHILE, IF, PRINT, ASSIGN, SEMICOLON,
@TokenGroup
IFX,
@TokenGroup
ELSE,
@TokenGroup (type = TokenType.LEFT)
GE, LE, EQ, NE, LT, GT,
@TokenGroup (type = TokenType.LEFT)
ADD, SUB,
@TokenGroup (type = TokenType.LEFT)
MUL, DIV,
@TokenGroup (type = TokenType.LEFT)
UMINUS
}
One of the unfortunate drawback of using Enum token is that it is
necessary to give a name to terminals such as '='
'<'
, etc. On
the other hand, it is good to have them defined since it is easier to
use them in abstract syntax trees (ASTs).
(Currently, CookCC does not have a tree generator yet. Hopefully it can be added in the near future.)
Lexer Section¶
Specifying Shortcuts¶
@Shortcut
is used to specify a single frequently used pattern can be
re-used in actual lexical patterns. Multiple @Shortcut
can be
defined using @Shortcuts
annotation. Just specify it on any
functions. The order of shortcut is not important and it is possible to
contain references to other shortcuts. Just be careful not to create
cyclic references.
@Shortcuts ( shortcuts = {
@Shortcut (name="nonws", pattern="[^ \\t\\n]"),
@Shortcut (name="ws", pattern="[ \\t]")
})
@Lex (pattern="{nonws}+", state="INITIAL")
void matchWord ()
{
m_cc += yyLength ();
++m_wc;
}
Specifying Lexical Patterns¶
@Lex
is use to specify a single lexical pattern. @Lexs
is used
to specify multiple lexical patterns that share a common action.
There are three types functions that can be marked with these two annotations. They each has different meanings.
None of the functions can be private
, since they are called from the
generated class. They can be protected
, public
or in the package
scope (if the generated class is in the same package as this class).
Case 1: Function returns void¶
This is the most simple case. The lexer would call this function and then move on to matching the next potential pattern. For example:
@Lex (pattern = "[ \\t\\r\\n]+")
protected void ignoreWhiteSpace ()
{
}
Note that it is necessary to use double backslashes here for escape
sequences because Java itself also interpret escape sequences. This is
perhaps one of the main drawback using Java annotation to specify the
lexer. Fortunately usually the lexer is fairly easy to get it working
correctly. IntelliJ IDEA also has a nice feature which pasting code with
escape sequence such as [ \t\r\n]+
inside a pair of double quotes
would automatically adds the extra backslashes.
Case 2: Function returns a non-int value¶
In this case, @Lex
needs to contain the terminal token that would be
returned. The return value from the function is going to be the value
associated with this terminal.
We have to specify the terminal in String due to the technical limitation.
@Lex (pattern="[0-9]+", token="INTEGER")
protected Integer parseInt ()
{
return Integer.parseInt (yyText ());
}
@Lexs (patterns = {
@Lex (pattern = "while", token = "WHILE"),
@Lex (pattern = "if", token = "IF"),
@Lex (pattern = "else", token = "ELSE"),
@Lex (pattern = "print", token = "PRINT")
})
protected Object parseKeyword ()
{
return null;
}
Case 3: Function returns an int value¶
In this case, the lexer would return the value. For example:
@Lex (pattern="[(){}.]")
protected int parseSymbol ()
{
return yyText ().charAt (0);
}
Be extra careful if the return value is used as terminals in the parser. Values not in the valid used terminals can result in the early termination of the parser.
Note that when <<EOF>>
is encountered, it is necessary to return a
value or the lexer would get into an infinite loop. There are a number
of ways of doing so:
@Lex (pattern = "<<EOF>>", token = "$")
protected void parseEOF ()
{
}
Or you can simply do
@Lex (pattern = "<<EOF>>")
protected int parseEOF ()
{
return 0;
}
This is because $
terminal corresponds to 0.
Parser Section¶
Specifying Parser Rules¶
@Rule
specifies a single grammar rule. @Rules
can be used to
specify multiple rules that share the same function action.
There are also three cases of functions marked using @Rule
Case 1: Function returns void¶
In this case, the value associated with the non-terminal of the LHS is null.
@Rule (lhs = "function", rhs = "function stmt", args = "2")
protected void parseFunction (Node node)
{
interpret (node);
}
Case 2: Function returns a non-int value¶
In this case, the return value is automatically associated with the non-terminal on the LHS.
@Rule (lhs = "stmt", rhs = "SEMICOLON")
protected Node parseStmt ()
{
return new SemiColonNode ();
}
Case 3: Function returns an int value¶
This function is used by the grammar start non-terminal to signal the exit of the parser with the particular value. It can be used by error processing functions as well.
@Rule (lhs = "program", rhs = "function")
protected int parseProgram ()
{
return 0;
}
Passing Arguments¶
args
of the @Rule
annotation is a list of indexes (separated by
comma or space) of the symbols which the method expects as arguments.
The indexing value starts from 1 for the production on the RHS.
For example:
@Rule (lhs = "stmt", rhs = "VARIABLE ASSIGN expr SEMICOLON", args = "1 3")
protected Node parseAssign (String var, Node expr)
{
return new AssignNode (var, expr);
}
This will assign the value of symbol VARIABLE
to the method
parameter String var
and the value of symbol expr
to the method
parameter Node expr
.
Note that the indexes need not be in any specific order. This would be equivalent (indexes and method parameters swapped):
@Rule (lhs = "stmt", rhs = "VARIABLE ASSIGN expr SEMICOLON", args = "3 1")
protected Node parseAssign (Node expr, String var)
{
return new AssignNode (var, expr);
}
As you can see, one does not have to mess with $$
, $1
etc, and
does not have to deal with type information specified elsewhere. This
approach is much more intuitive.
Examples¶
There are a number of examples in test cases.