<lexer>

A sample XML for the <lexer> section looks like:

<lexer table="ecs">
    <shortcut name="nonws">[^ \t\r\n]</shortcut>
    <shortcut name="word">{nonws}+</shortcut>
    <rule>
        <pattern>{word}</pattern>
        <pattern>{word}[ \t\r\n]*</pattern>
        <action>
            ++wordCount;
        </action>
    </rule>
    <state name="INITIAL,TEST">
        <rule>
            <pattern>.|\n</pattern>
            <action>
                // ignore
            </action>
        </rule>
        <rule state="ANOTHER_STATE">
            <pattern><![CDATA[<<EOF>>]]></pattern>
            <action>
                return 0;  /* exit lexer loop */
            </action>
        </rule>
    </state>
</lexer>

Options for the lexer are specified as attributes of the <lexer> tag.

Attribute Description
table The DFA table format. Can be one of "ecs", "full", "compressed" options. Command line options can override the choice here.
bol Instruct the lexer to keep track of the BOL (beginning of line) information even when there are no patterns use that information.
warnbackup Generate warning of backup lexer states if set to true. Default is false.
yywrap Indicates that yyWrap () function should be called when EOF is encountered. Default is false.
linemode Instruct the lexer to match patterns one line at a time. This mode is primarily useful for interactive modes where inputs are delimited by '\n' character. Multi-line patterns will generate warnings since they cannot be matched in this mode.

<shortcut>

This tag is used to specify frequently used subset of patterns. In the above example, when the pattern {word} is seen, it is replaced with ({nonws}), which is in turn replaced with ([^ \t\r\n]). So the actual pattern is [^ \t\r\n]+.

<shortcut> tags can only be specified as immediate children of <lexer>.

<state>

<state> tags are used to indicate the state conditions. It has only one attribute name to specify a comma separated list of state names. All rules specified under this tag are automatically added to this particular state. If the name attribute is not specified, it is assumed to be {{{INITIAL}}, which is required as the initial state at the start of the lexer.

<rule>

Rule tags are used to specify patterns and their associated action codes. It can have multiple <pattern> children, but one and only one <action> child.

Attribute Description
state A comma separated list of state names that this rule is in. If the current rule is already under a <state> tag, then the rule is added to all of them.

<pattern>

Attribute Description
bol Specify that the pattern only works at BOL (beginning of line).
nocase Specify that the pattern does case insensitive match.

Although multiple patterns may be under the same rule and share the action code, in actual generated code, the action code is replicated for each pattern. This is to avoid the problem that some patterns may work at BOL while some other patterns may not. To avoid action code replication, try put them inside a single <pattern> tag with | in between.

<action>

It contains the code to be executed when the pattern is matched.