XML¶
CookCC contains can generator both lexers and parsers. The input XML
(DTD)
thus contains a lexer section and a
parser section. Only one of the two section
is required. The file extension is *.xcc
.
There are plenty of examples shown in the test cases.
Overview¶
A sample XML looks like:
<?xml version = "1.0" encoding="UTF-8"?>
<!DOCTYPE cookcc PUBLIC "-//CookCC//1.0" "https://raw.githubusercontent.com/coconut2015/cookcc/master/src/resources/cookcc.dtd">
<cookcc unicode="false">
<tokens>VARIABLE INTEGER WHILE IF PRINT</tokens>
<tokens type="nonassoc">IFX</tokens>
<tokens type="nonassoc">ELSE</tokens>
<tokens type="left"><![CDATA[GE LE EQ NE '>' '<']]></tokens>
<tokens type="left">'+' '-'</tokens>
<tokens type="left">'*' '/'</tokens>
<tokens type="nonassoc">UMINUS</tokens>
<lexer>
<!-- lexer section -->
</lexer>
<parser start="program">
<!-- parser section -->
</parser>
<code name="default"><![CDATA[
/* code section, can appear any where directly under the <cookcc> tag. */
]]></code>
</cookcc>
Some XML editors such as the one in IntelliJ IDEA can perform on-the-fly XML syntax check, automatic tag and attribute suggestions. CookCC also checks the validity of the input using the DTD before parsing.
XML Tag Explanations¶
<cookcc>
¶
This is the document root of the XML file. It has a single attribute to
indicate where or not the lexer parses unicode. By default, unicode
is false.
<code>
¶
The <code>
tag can specify any pieces of code. There can be multiple
of such tags and it can locate any where directly under <cookcc>
tag. The name of the code needs to be unique. If the name is not
specified, it is assumed to be "default"
. The exact usage of that
code depends on the template being used. For Java output, you can take a
look at the Java output
template
to get a rough idea.
<tokens>
¶
Simply a list of string names. They can be separated by spaces, tabs or new lines.
The type
attribute is used by the parser to determine the
associativity of the tokens. There are three types: "left"
,
"right"
, and "nonassoc"
. If the type is not specified, it is
assumed to be "nonassoc"
.
Tokens inside the same <tokens>
tag have the same precedence level.
Tokens specified in later <tokens>
tags have higher precedence
levels.
<lexer>
¶
A sample XML for the <lexer>
section looks like:
<lexer table="ecs">
<shortcut name="nonws">[^ \t\r\n]</shortcut>
<shortcut name="word">{nonws}+</shortcut>
<rule>
<pattern>{word}</pattern>
<pattern>{word}[ \t\r\n]*</pattern>
<action>
++wordCount;
</action>
</rule>
<state name="INITIAL,TEST">
<rule>
<pattern>.|\n</pattern>
<action>
// ignore
</action>
</rule>
<rule state="ANOTHER_STATE">
<pattern><![CDATA[<<EOF>>]]></pattern>
<action>
return 0; /* exit lexer loop */
</action>
</rule>
</state>
</lexer>
Options for the lexer are specified as attributes of the <lexer>
tag.
Attribute | Description |
---|---|
table |
The DFA table format. Can be one of "ecs" ,
"full" , "compressed" options. Command
line options can override the choice here. |
bol |
Instruct the lexer to keep track of the BOL (beginning of line) information even when there are no patterns use that information. |
warnbackup |
Generate warning of backup lexer states if set to true. Default is false. |
yywrap |
Indicates that yyWrap () function should be
called when EOF is encountered. Default is false. |
linemode |
Instruct the lexer to match patterns one line at
a time. This mode is primarily useful for
interactive modes where inputs are delimited by
'\n' character. Multi-line patterns will
generate warnings since they cannot be matched in
this mode. |
<shortcut>
¶
This tag is used to specify frequently used subset of patterns. In the
above example, when the pattern {word}
is seen, it is replaced with
({nonws})
, which is in turn replaced with ([^ \t\r\n])
. So the
actual pattern is [^ \t\r\n]+
.
<shortcut>
tags can only be specified as immediate children of
<lexer>
.
<state>
¶
<state>
tags are used to indicate the state conditions. It has only
one attribute name
to specify a comma separated list of state names.
All rules specified under this tag are automatically added to this
particular state. If the name
attribute is not specified, it is
assumed to be {{{INITIAL}}, which is required as the initial state at
the start of the lexer.
<rule>
¶
Rule tags are used to specify patterns and their associated action
codes. It can have multiple <pattern>
children, but one and only one
<action>
child.
Attribute | Description |
---|---|
state |
A comma separated list of state names that this rule is
in. If the current rule is already under a <state>
tag, then the rule is added to all of them. |
<pattern>
¶
Attribute | Description |
---|---|
bol |
Specify that the pattern only works at BOL (beginning of line). |
nocase |
Specify that the pattern does case insensitive match. |
Although multiple patterns may be under the same rule and share the
action code, in actual generated code, the action code is replicated for
each pattern. This is to avoid the problem that some patterns may work
at BOL while some other patterns may not. To avoid action code
replication, try put them inside a single <pattern>
tag with |
in between.
<action>
¶
It contains the code to be executed when the pattern is matched.
<parser>
¶
A sample XML for <parser>
section looks like:
<parser start="program">
<type format="((Node){0})">stmt expr stmt_list</type>
<type format="((String){0})">VARIABLE</type>
<type format="((Integer){0})">INTEGER</type>
<grammar rule="program">
<rhs>function</rhs>
<action>return 0;</action>
</grammar>
<grammar rule="function">
<rhs>function stmt</rhs>
<action>interpret ($2);</action>
<rhs></rhs>
</grammar>
<grammar rule="stmt">
<rhs>';'</rhs>
<action>$$ = new SemiColonNode ();</action>
<rhs>expr ';'</rhs>
<action>$$ = $1;</action>
<rhs>PRINT expr ';'</rhs>
<action>$$ = new PrintNode ($2);</action>
<rhs>VARIABLE '=' expr ';'</rhs>
<action>$$ = new AssignNode ($1, $3);</action>
<rhs>WHILE '(' expr ')' stmt</rhs>
<action>$$ = new WhileNode ($3, $5);</action>
<rhs precedence="IFX">IF '(' expr ')' stmt</rhs>
<action>$$ = new IfNode ($3, $5, null);</action>
<rhs>IF '(' expr ')' stmt ELSE stmt</rhs>
<action>$$ = new IfNode ($3, $5, $7);</action>
<rhs>'{' stmt_list '}'</rhs>
<action>$$ = $2;</action>
</grammar>
</parser>
Options for the parser are specified as attributes of the <parser>
tag.
Attribute | Description |
---|---|
start |
Specify the start non-terminal. If this attribute is not specified, the LHS of the first grammar is used. |
recovery |
Should the parser try to generate error recovery
routines. This attribute is default true . Set
this attribute to false for speedy exit from the
parser in case of error. |
parseerror |
Should the parser generate the error function since
the user is going to supply one. This attribute is
default true . |
See the parser recovery page for more information on error recovery.
<type>
¶
This tag is used to specify the necessary code that should be used to
cast / retrieve members of arguments {0}
. In the example above,
$1
was automatically converted to ((Node)$1)
if $1
is a
stmt
, expr
, or stmt_list
. $1
itself is internally
translated to the appropriate variable/function call.
(Note in the Java code generator, the format does not apply to $$
).
<grammar>
¶
The attribute value of rule
is a non-terminal. All the productions
in <rhs>
are for this particular terminal.
<rhs>
¶
This tag represents the production for the non-terminal of the parent
grammar
tag. Its action code should be immediately followed. If not,
there are no actions performed for this particular production.
The attributes for the rhs
tag are
Attribute | Description |
---|---|
precedence |
Specify the precedence for the production to the precedence of a particular terminal. |
<action>
¶
This tag is used to specify the code that should be called for the
production in the immediate <rhs>
above.
$$
represents the value for the LHS non-terminal. $1
, $2
etc
represent the object values of the symbols at the position for
productions specified in the <lhs>
tag, starting from 1.