Carburetta manual for version 0.8.20
Table of Contents
Introduction
Carburetta is a parser generator for C and C++. By specifying a grammar for a language, Carburetta can generate a parser for that language, saving the effort of doing so manually. In addition, because the size of the grammar for a language is much smaller than the code for manually parsing such a language, the resulting parser is much easier to maintain.
Why another Parser generator?
The problem with parser generators is they can get in the way of getting things done, consequently many compiler projects end up hand-crafting their parser to work around the limitations.
Carburetta aims to solve this by offering the following benefits:
- Quick to start, a single file can contain the scanner, the parser, and the C main() function. A faster start means a shorter time to iterative development.
- Fused scanner and parser, by fusing scanner and parser into a single
scan()
function that does both, we gain efficiency and re-use the datatypes (an identifier is not the same as an integer, and with a fused scanner and parser, they can be unique in the data they hold.) This makes for a simpler codebase. - Avoids repeating yourself, patterns and productions can easily share code.
- Meticulously managed lifetime, explicitly managed lifetime from construction to destruction for user defined types, both C and C++.
-
Ready for abuse, the generated code handles the complexity, so your code won’t have to:
- Return out of the parsing function from within action code, with defined behavior
- Throw exceptions and use RAII patterns with defined behavior such as retrying the failing constructor code upon reentry
- Recovery from no-memory situations
-
Flexible
- Whether you want a fused scanner and parser, or only use a parser and provide your own lex stage, or even want to have a hybrid of the two, it can be done.
- You can work using the default return codes, redefine the return codes to your liking, or solve whatever needs to be done within the
scan()
orparse()
function and not return at all. - Maintain the source location of each symbol, either by line and column, or range, or some hybrid in-between. Whether you want colored highlighting of source code error areas or just the line and column of each symbol, or some more sophisticated IDE integration, it can all be done as no prior pattern is imposed on you. Carburetta does not mandate what the source location management should look like, but you have the tools to implement what you actually need with relative ease yourself.
Existing compiler projects like clang and gcc have implemented their own hand-crafted parsers. Hand-crafting a parser ensures the feature set and performance required and avoids having to deal with the complexity of working around the limitations of a generated parser, but represents a tremendous amount of work, and code, to maintain. While compilers have started to implement their own manual parsers, other parser generators have moved in more general (less efficient) directions like ambiguous grammar parsing.
Carburetta is a step in the opposite direction. The parsing algorithm is LALR, straightforward and predictable. The feature set is designed to maximize flexibility, while at the same time reducing complexity; a combination that is intended to make developers far more productive than the limited confines of their programming language. Carburetta simplifies creating productivity boosters, from little jigs, tools, code scanners, and source code generators, to entire domain specific languages, scripting language interpreters and compilers.
Commandline options
Carburetta is a commandline tool that takes a single input file and generates a single C/C++ output file and an optional header file. The input file typically has a .cbrt
extension and will be discussed in detail later.
Carburetta is invoked as follows:
carburetta <inputfile.cbrt> <flags>
where <inputfile.cbrt>
the path to the carburetta input file, and <flags>
is one or more of:
-c, --c [<c_filename>]
: Generate a C file and output it to the specified filename. If no filename is provided, the output will be directed to the standard output (stdout).-h, --h [<h_filename>]
: Generate a C header file and output it to the specified filename. If no filename is provided, a C filename must be present, which will be used to derive a filename for the header file.-8, --x-utf8
: Generate a parser that reads input as UTF-8. This is the default. See the section on Unicode for more information.-r, --x-raw
: Generate a parser that reads input as raw bytes (Latin-1).-H, --help
: Print a help message displaying information about the available flags and usage.-v, --version
: Print version information.
Carburetta will then generate the requested C/C++ implementation and optionally header file(s) from the inputfile.
Structure of a Carburetta .cbrt input file
Carburetta takes a grammar input file and uses it to generate the specified parser. A grammar input file consists of several sections that describe the grammar, optionally the scanner and the C code to be executed. These are used to generate the C/C++ output file to be used as input by a C/C++ compiler, and optionally a header file to help integrate the parser with the rest of your code.
There are several distinct sections:
- Prologue, the start of a .cbrt input file is directly copied to the top of the output and typically contains the include files and declarations needed by the rest of the parser code. Any section delimiter ends the prologue.
- Scanner, the scanner section (started by the
%scanner%
section delimiter,) describes the patterns and corresponding actions. If no scanner section is present, then no scanner will be generated. - Grammar, the parser section (started by the
%grammar%
section delimiter,) describes the productions and corresponding actions that make up the language. A parser is always generated, if no grammar section is present, then no input will match the language. - Epilogue, following the scanner and/or grammar sections, an epilogue section (started by the
%%
section delimiter and ending at the end of the input file,) contains the C code that drives the parser and interfaces it with the rest of your code. - Intermissions, from time to time you may want small helper functions to reduce repetition or better express functionality used in your action code. Carburetta allows you to append code to the prologue section from within the scanner and/or grammar sections by including the code between
%%
section delimiters. (This differs from the epilogue section in that a second%%
section delimiter returns you to the scanner or grammar context you were in.)
Throughout any section there may be Carburetta directives. Carburetta directives are modeled after C preprocessor directives where the #
of a preprocessor directive is now replaced for a %
. Lines with a Carburetta directive start with whitespace followed by a single %
and an identifier to indicate the directive. Carburetta directives are consumed and do not appear in the output.
Prologue
The prologue is regular C/C++ code, intended to ensure all declarations needed for the parser are available before the generated parser code appears in the output. A file consisting of code and no section delimiters (that is to say, a file that is all prologue and nothing else) will be emitted to output with no parser or scanner.
Epilogue
The last C/C++ code (starting from the %%
section delimiter to the end of the file) is the epilogue. It differs from the prologue and intermission sections in that the C/C++ code appears after the parser generated code in the output file. It can therefore rely on its symbol declarations being available.
The epilogue code is typically used to “drive” the scanner/parser from the interface the caller wants to the interface that Carburetta has made available.
Intermissions
Intermissions are short sections of code that may appear in the middle of scanner and prologue sections. They are started by a %%
section delimiter and end by another %%
section delimiter. After the second section delimiter, the original section continues where it left off.
The code in intermission sections is appended to the prologue. As the prologue precedes code to generated by Carburetta, you cannot access the parser from intermission code. The intent for the intermission sections is to give you a way of adding code that may be associated with a particular (set of) patterns or productions (for instance a small helper function) near the code that needs it, to help improve the cohesion of your code.
Header
When you configure Carburetta to generate a header file, there may be declarations that are external to the scanner or parser that nonetheless are used by it. To prevent having to manually include those dependencies prior to including the header, you can use the %header%
section. The %header%
section is a prologue section that is only included in the generated header (the C file ignores it.)
Suppose, for instance, that we pass in a parameter CompilerContext
which perhaps houses symbol tables and the like. This CompilerContext
may be in a separate include file. We could ensure this file is included in the header as follows:
%param CompilerContext &cc
%header%
#include "compiler_context.h"
%%
// .. other prologue code.
Note that types declared to symbols using %type
or %class
are not exposed in the header, these are housed as members of the struct <prefix>sym_data
structure and Carburetta goes out of its way to ensure the declaration of this <prefix>sym_data
is limited to the C/C++ implementation file. Also note that code included in the %header%
section is not included in the generated C/C++ implementation file (you may want to include the header file in the implementation file using the prologue section.)
Finally, any code in the %header%
section is inserted after the #ifndef
include guard, but before any extern "C"
. (See the section on %externc
and %no_externc
for more information on if and when extern "C"
is emitted.)
Grammar section
The grammar section is where the grammar of the language to be parsed is specified. The grammar is defined by productions, and each production may have an associated action. An action is the code to be executed when the production matches the parsed input.
%grammar%
%nt list
%token COMMA IDENT
list: IDENT;
list: list COMMA IDENT;
In the above example, a grammar section is started using the %grammar%
section delimiter. There are two productions for the non-terminal list, one of which is recursive. All identifiers are symbols. Symbols may be split into terminals and non-terminals. All symbols must be declared explicitly. COMMA
and IDENT
are declared as terminals using the %token
directive. list
is declared as a non-terminal using the %nt
directive. The difference between a terminal and a non-terminal is that the latter may consist of other symbols in the form of productions.
Each production is said to reduce to a non-terminal. Here both productions list: IDENT;
and list: list COMMA IDENT
reduce to a non-terminal list.
We will examine the %grammar
section in far greater detail later.
Scanner section
The scanner section is where the lexical structure of the language may be specified. Put differently, it describes what sequence of characters form valid terminals in the language. It does so by describing patterns.
The scanner section is defined by patterns. Each pattern consists of the terminal symbol that is defined by it, a regular expression that describes the pattern to be matched and an action that is executed when the pattern matches the input. Multiple patterns can match a single terminal.
%scanner%
%token COMMA IDENT
COMMA: \, ;
IDENT: [a-zA-Z_][a-zA-Z_0-9]* ;
A Simple Calculator
Let’s examine an example that shows the various sections in a single .cbrt Carburetta input file.
#include <stdio.h>
#include <stdlib.h>
%scanner%
%prefix calc_
: [\ \n]+; /* skip spaces and newlines */
PLUS: \+; MINUS: \-; ASTERISK: \*; SLASH: /;
PAR_OPEN: \(; PAR_CLOSE: \);
INTEGER: [0-9]+ { $$ = atoi($text); }
%token PLUS MINUS ASTERISK SLASH PAR_OPEN
%token PAR_CLOSE INTEGER
%nt grammar expr term factor value
%grammar%
%type grammar expr term factor value \
INTEGER: int
%constructor $$ = 0;
%params int *final_result
grammar: expr {
printf("Outcome: %d\n", $0);
*final_result = $0;
}
expr: term { $$ = $0; }
expr: expr PLUS term { $$ = $0 + $2; }
expr: expr MINUS term { $$ = $0 - $2; }
term: factor { $$ = $0; }
term: term ASTERISK factor { $$ = $0 * $2; }
term: term SLASH factor { $$ = $0 / $2; }
factor: value { $$ = $0; }
factor: MINUS factor { $$ = -$1; }
factor: PAR_OPEN expr PAR_CLOSE { $$ = $1; }
value: INTEGER { $$ = $0; }
%%
void report_error(const char *msg) {
fprintf(stderr, "%s", msg);
}
%%
value: error {
$$ = 0;
report_error("Syntax error: value expected;\n");
}
%%
int main(int argc, char **argv) {
int value, r;
static char s[300];
fprintf(stderr, "Enter an expression (-1 to exit):\n");
struct calc_stack stack;
calc_stack_init(&stack);
do {
if (!fgets(s, sizeof(s), stdin)) {
perror("Error reading");
r = EXIT_FAILURE;
break;
}
calc_set_input(&stack, s, strlen(s), 1);
do {
r = calc_scan(&stack, &value);
} while (r == _CALC_SYNTAX_ERROR);
if (r) {
fprintf(stderr, "Unexpected error\n");
r = EXIT_FAILURE;
break;
}
} while (value != -1);
calc_stack_cleanup(&stack);
return r;
}
We’ll examine each part of the example in the following sections.
The Prologue
The prologue consists of the code from the start of the file up to, but not including, the %scanner%
directive.
#include <stdio.h>
#include <stdlib.h>
%scanner%
The text is copied directly to the output file. Here we include stdio.h
for printf, fgets and fprintf, and we include stdlib.h
for atoi. Carburetta directives are valid anywhere in the input file, including in the code. Here, there are none, however, were there to be directives present, their entire line would be consumed by Carburetta and you would not find them in the output.
The Scanner
The scanner consists of a few patterns we need to parse simple infix formulas.
%scanner%
%prefix calc_
: [\ \n]+; /* skip spaces and newlines */
PLUS: \+; MINUS: \-; ASTERISK: \*; SLASH: /;
PAR_OPEN: \(; PAR_CLOSE: \);
INTEGER: [0-9]+ { $$ = atoi($text); }
%token PLUS MINUS ASTERISK SLASH PAR_OPEN
%token PAR_CLOSE INTEGER
%nt grammar expr term factor value
%grammar%
The %scanner%
section delimiter starts the scanner section.
%prefix
is a directive that tells Carburetta how to prefix all the identifiers with external linkage (that might conflict with other identifiers in other source code at link time) as well as identifiers that you may want to use from other source code (such as #define
identifiers for the symbol constants.)
A pattern consists of symbol: regex { action }
, or symbol: regex ;
if it has no action. The symbol is optional. The first pattern : [\ \n]+;
is such a pattern where the optional symbol has been left out. Because there is no symbol on this pattern, and there is no action on this pattern, any match of the regular expression will result in the input matched to be silently ignored. Scanning will continue after the input.
Regular expressions in Carburetta have a few differences from the usual regular expressions as their goal is to compile into a simple scanning state transition table. Additionally, to help layout complex regular expressions, whitespace (spaces that only have an aesthetic effect, no functional effect) is permitted. To encode the space character as part of the pattern to be matched, in Carburetta, it must be escaped.
The [\ \n]+
regular expression is a character set [..]
matching a space \
or a newline \n
. The character-set is to occur one or more times (denoted by +
) for the pattern to match.
Note that C and C++ style comments can appear anywhere in a Carburetta file. If they appear in the prologue, intermission, or epilogue sections, or if they appear in any of the pattern or production actions, then they will be reproduced in the output file. The comment /* skip spaces and newlines */
is therefore accepted as a comment, but as it is not part of any action or code section, it is not output.
The grammar for Carburetta itself is not dependent on whitespace or newlines, the lines:
PLUS: \+; MINUS: \-; ASTERISK: \*; SLASH: /;
PAR_OPEN: \(; PAR_CLOSE: \);
therefore defines the patterns for the terminal symbols PLUS, MINUS, ASTERISK, SLASH, PAR_OPEN and PAR_CLOSE as the single characters +
, -
, *
, /
, (
, and )
. Note that, of these characters, only the /
is considered a plain character in Carburetta’s regular expression syntax, all other characters have to be escaped.
The pattern for the INTEGER
terminal demonstrates an action:
INTEGER: [0-9]+ { $$ = atoi($text); }
The code { $$ = atoi($text); }
is executed whenever the regular expression [0-9]+
(“one or more digits”) matches. In the snippet of code, $$
and $text
are special identifiers. Special identifiers are replaced during compilation by Carburetta to expressions (or sometimes statements) that provide access to aspects of the running parser and/or the current pattern or production. In this action snippet, the $$
special identifier is replaced with an expression that represents the typed symbol value of an INTEGER. What that type of the INTEGER symbol’s value is hasn’t yet been declared. The $text
special identifier accesses the currently matched pattern as a const char *
null-terminated pointer. We pass it to the C library atoi
(ASCII to int) function to get its integer value. Although the regular expression guarantees that the action snippet will account 1 or more digits and only digits in $text
, production code should be checking for an overflow (a number too large to fit an int.)
%token PLUS MINUS ASTERISK SLASH PAR_OPEN
%token PAR_CLOSE INTEGER
%nt grammar expr term factor value
The %token
directive declares terminal (“token”) symbols. These correspond to the patterns we’ve seen earlier, though it is legal for tokens to exist that do not correspond to any patterns, and it is also legal for many patterns to match the same non-terminal symbol (this can greatly simplify value evaluation during parsing.)
Note that these appeared after the patterns that already used them. Carburetta does not require any particular order to these declarations. Carburetta does however require that all symbols used in patterns are declared using %token
.
The %nt
directive declares the non-terminals, these will appear in the productions of the grammar section.
Finally, the %grammar%
section delimiter ends the current (scanner) section and starts the grammar section.
The Grammar
Whereas the scanner section consist of patterns, the grammar section consists of productions.
%grammar%
%type grammar expr term factor value INTEGER: int
%constructor $$ = 0;
%params int *final_result
grammar: expr {
printf("Outcome: %d\n", $0);
*final_result = $0;
}
expr: term { $$ = $0; }
expr: expr PLUS term { $$ = $0 + $2; }
expr: expr MINUS term { $$ = $0 - $2; }
term: factor { $$ = $0; }
term: term ASTERISK factor { $$ = $0 * $2; }
term: term SLASH factor { $$ = $0 / $2; }
factor: value { $$ = $0; }
factor: MINUS factor { $$ = -$1; }
factor: PAR_OPEN expr PAR_CLOSE { $$ = $1; }
value: INTEGER { $$ = $0; }
%%
void report_error(const char *msg) {
fprintf(stderr, "%s", msg);
}
%%
value: error {
$$ = 0;
report_error("Syntax error: value expected;\n");
}
%%
The %grammar%
section delimiter starts the grammar section. As mentioned at the scanner, symbols (both terminals and non-terminals) can have a type associated with them. This type is how the symbols may carry data.
%type grammar expr term factor value INTEGER: int
%constructor $$ = 0;
The type of the non-terminals grammar
, expr
, term
, factor
and value
, as well as the terminal INTEGER
is defined to be int
using the %type
directive. For the INTEGER
we have seen how its pattern action assigned an integer to it using $$ = atoi($text);
, this is only possible because we here assign the type int
to the INTEGER symbol.
Carburetta is not aware about C or C++ as a language, the type assigned in the %type
directive is merely copy and pasted in the appropriate places to both carry that type as data for a symbol, and to be able to access it using $$ for the destination symbol (in pattern or production) and $0, $1, $n, in production symbol access. If there is an error, it may lead to a lot of compilation errors in the Carburetta output.
Most initialization is shared by all values of a type, the line %constructor $$ = 0;
indicates that the last type (in this case int
from the %type
directive) is to always be initialized by assigning it to 0. This %constructor executes before any of the production or pattern actions, consequently, were a production or pattern not to have any action, we can still rely on its symbol data being properly initialized according to its type.
%params int *final_result
Carburetta will generate a parse function and, if a %scanner%
section defines patterns, a scan function. With the %prefix
set to calc_
earlier, these will be named calc_parse
and calc_scan
respectively. Code uses the generated parser by calling one of these functions. However, the code in the various actions may need access to more than the standard arguments passed in.
For this reason, the argument lists for both the calc_parse
and calc_scan
functions are expanded using the %params
directive. Here we indicate an argument int *final_result
is to be appended.
Because all action code runs from within the generated parse and scan functions, all action code can rely on access to the final_result
parameter. This example will use the extra argument to return additional data.
grammar: expr {
printf("Outcome: %d\n", $0);
*final_result = $0;
}
The first production in the grammar section also identifies the non-terminal (in this case grammar
) which identifies the entire grammar to be matched. This production indicates a grammar
is an expr
(expression.) Upon reduction of the grammar production, an action executes.
The printf("Outcome: %d\n", $0);
prints the value of the first symbol (counted from 0) expr
. The expr
symbol was assigned int
as type, the $0
special identifier will therefore have int
as type.
Note that the syntax for Carburetta itself does not rely on line endings to terminate an action, nor does it have any deeper understanding of C/C++, rather, it will count braces to determine where the action ends. Here the action is across multiple lines. For actions or code associated with directives (e.g. the %constructor $$ = 0;
earlier is such a directive) the end of the line does determine the end of the code, because a directive always ends at the end of the line it appears on (it models the behavior of the C/C++ preprocessor in this regard.) If code is better expressed across multiple lines for a directive, then line continuations may be used (where the line ends in a backslash \
that is immediately followed by a newline, that line is continued on the next line and both backslash \
and the newline vanish from the evaluation of the line) - again, modeling C/C++ preprocessor behavior.
The code *final_result = $0;
assigns the value of the expr
symbol at index 0 to the value pointed to by final_result
. The final_result
variable was defined using the %params%
directive earlier as an argument to the calc_scan
and calc_parse
functions that Carburetta will generate. This is the way the example communicates with its driver code.
expr: term { $$ = $0; }
expr: expr PLUS term { $$ = $0 + $2; }
expr: expr MINUS term { $$ = $0 - $2; }
term: factor { $$ = $0; }
term: term ASTERISK factor { $$ = $0 * $2; }
term: term SLASH factor { $$ = $0 / $2; }
factor: value { $$ = $0; }
factor: MINUS factor { $$ = -$1; }
factor: PAR_OPEN expr PAR_CLOSE { $$ = $1; }
value: INTEGER { $$ = $0; }
These productions define the main grammar in terms of the operators and their operands. By careful attention to how the productions are structured, they impose the appropriate operator priorities (so 1 + 2 * 3
evaluates to 7
and not 9
).
%%
void report_error(const char *msg) {
fprintf(stderr, "%s", msg);
}
%%
Sometimes it is more convenient to have helper functions close to the grammar where they are used. For instance, productions with optional symbols are not supported, so you may end up with many productions that are very similar to each other to iterate through all possible combinations of optional symbols. Handling such a pattern may best be done with a single common function, such a common function is very specific to - and tightly coupled with - the area of the grammar being implemented. For this reason, it can be useful to have brief sections of code, intermissions, where the common functionality can be written close to the area that uses it.
The example here is a bit contrived. Starting at the first %%
section delimiter and ending at second %%
section delimiter is an intermission section. The code in this section is appended to the prologue, prior to any code being generated. Any code in an intermission section cannot, therefore, rely on the implementation of the parser, unless Carburetta is instructed to generate a header file, and this header file is included in the prologue. In this example, no such header is included.
The report_error()
function prints the message as a plain string to stderr
and returns.
%%
value: error {
$$ = 0;
report_error("Syntax error: value expected;\n");
}
%%
Here a production reduces to the non-terminal value
, but it uses a symbol that we have not declared earlier: error
. The error
symbol is implicitly declared and available. The error
symbol is part of Carburetta’s error recovery mechanisms. If an error occurs (a symbol cannot be matched in the current state of the parser,) the parser, after returning an error to the caller, will examine if it can backtrack from its current state to an earlier state where it can accept the error
symbol. If such a state exists, and the current input terminal symbol can be accepted after the error symbol, then an error symbol is fabricated out of thin air, accepted and parsing continues. (If, while backtracking, no earlier state can accept an error
symbol, then the error is returned to the caller, and parsing ends.)
Suppose we give the calculator the input 1 +
, then the parser will see the end of the input where a value is expected. Examining its state, it discovers it can also pass an error token, which can be thought of as 1 + error
, the error reduces to a value on the production above, and its action is executed.
The action assigns 0 to the non-terminal value
, and invokes the report_error
we defined in the intermission section earlier, parsing then continues.
At the end, we have a %%
section delimiter, which indicates there is another code section (either an intermission section, or an epilogue section, depending on the section itself.)
The Epilogue
The prologue, scanner and grammar sections have all been defined, Carburetta can now generate a working scanner and parser. What remains, is to call it. Here, we do this in the epilogue:
%%
int main(int argc, char **argv) {
int value, r;
static char s[300];
fprintf(stderr, "Enter an expression (-1 to exit):\n");
struct calc_stack stack;
calc_stack_init(&stack);
do {
if (!fgets(s, sizeof(s), stdin)) {
perror("Error reading");
r = EXIT_FAILURE;
break;
}
calc_set_input(&stack, s, strlen(s), 1);
do {
r = calc_scan(&stack, &value);
} while (r == _CALC_SYNTAX_ERROR);
if (r) {
fprintf(stderr, "Unexpected error\n");
r = EXIT_FAILURE;
break;
}
} while (value != -1);
calc_stack_cleanup(&stack);
return r;
}
The difference between an intermission code section and the epilogue code section is that the epilogue code section is never terminated. After the %%
section delimiter that opens it (which may appear in both %scanner%
and %grammar%
sections,) there is no other %%
section delimiter before the end of the input file, making this code section the epilogue.
An epilogue code section can rely on all declarations and definitions of the Carburetta generated scanner and parser to have preceded it in the output code file, irrespective of any headers included before. For very small, single file, parsers, it is therefore a good place to put main()
.
All Carburetta generated declarations use calc_
as a prefix to ensure they do not collide with any of your other code. This calc_
prefix was declared using the %prefix
directive earlier. The main()
function here relies on the following declarations specifically:
struct calc_stack
, the stack structure maintains the state of both scanner and parser. It must be initialized using itscalc_stack_init()
function.calc_stack_init()
, the initializer (“constructor”) function for thestruct calc_stack
- this is guaranteed to never fail and ensures that other functions, including its counterpartcalc_stack_cleanup()
can safely be called.calc_stack_cleanup()
, the cleanup (“destructor”) function for thestruct calc_stack
. Like the constructor, the destructor cannot fail. It frees the resources used by the scanner and parser. calc_stack_cleanup() will invoke type destructor code specified using the%destructor
directive. For this reason, code specified in the%destructor
directive is the only code that must be able to execute outside the parse and scan functions. For this reason, such code cannot rely on the effects of directives that affect those functions (like the%params
directive earlier.)calc_set_input()
tells the scanner in what buffer it can find its input. This is a pointer to a buffer and the size input at that pointer. Carburetta does not copy this buffer but uses it directly, consequently it is imperative that the buffer remains valid from the momentcalc_set_input()
is called until the next timecalc_set_input()
is called, orcalc_stack_cleanup()
is called.calc_scan
, the scan function generated based on both the%scanner
and%grammar
sections. Note that the scan function houses both the scanner and parser, the two are fused in the same function. This ensures that extra arguments are available and is more efficient than keeping scanner and parser separate. The code (and trade-off) is that it makes the scan function much larger and does not decouple the scanner from the parser from a design perspective. Because the function is generated, we care less about these concerns, though they may make debugging more difficult._CALC_SYNTAX_ERROR
, one of the possible return codes from the code generated by default. While there are directives that redefine what should happen in error or exceptional conditions, the default behavior is to return an error code to the caller. The two error codes that are relevant for this example is_CALC_SYNTAX_ERROR
, returned when a syntax error occurs, and the value 0, which is returned if the parser is finished with the input (either due to an earlier error that could not be recovered from, or because of the completed processing of all input.) Note that the leading underscore_
is to avoid collision with any of the symbol definitions, andCALC_
, as a prefix, is derived from the same%prefix
directive used for all the lowercase identifiers.
The main function first initializes a struct calc_stack
using the calc_stack_init()
function (and before the exit, this calc_stack_init()
is paired with calc_stack_cleanup()
to always release resources). It then enters a loop where a line of input is requested from the user (using fgets on stdin), and this line of input is passed into the scanner using the calc_scan
function that was generated by Carburetta.
With calc_set_input()
we pass a pointer to the statically allocated buffer s
. The buffer s
is allocated for 300 bytes, but we only pass in the number of bytes of the null terminated string that we have read there before (the amount we should pass in is the size of the input, not the size of the buffer.)
Input is final if it marks the last bit of input. Input is not final if we are processing in a stream-like fashion (where we read a buffer, process the buffer, then read again into the buffer, to process again, and so on.) In such a situation, suppose we have an identifier on the edge of one buffered input, we would have to know the start of the next buffered input to see if the identifier straddles both buffers. When reading in a stream like fashion, we would repeatedly call calc_set_input()
whenever _CALC_FEED_ME
was returned by the calc_scan()
function, or when our %on_feed_me
defined code executes if that directive is defined.
In this (simple) example, each formula we would like to calculate ends on its input line, so we pass a 1 to calc_set_input()
in its is_final_input
flag.
calc_scan(&stack, &value)
takes 2 arguments. The first is a pointer to the struct calc_stack
that holds the parser state. The second is an argument that we requested be there using the %params
directive.
Carburetta’s scanner and parser now have all they need, but we specified an additional argument using the %params int *final_result
directive. We satisfy this argument by passing in a pointer to the value
local variable. If the grammar completes parsing, the value of the calculated formula will be stored in the value variable.
As mentioned earlier, _CALC_SYNTAX_ERROR
is not the only return value, there are a few. If we get a syntax error, we keep parsing, hoping to recover on an error
production. If such a production cannot be executed (and recovery fails) then, after returning _CALC_SYNTAX_ERROR
, the scan function will return 0 to indicate failed completion. If such a production can be found, then after returning _CALC_SYNTAX_ERROR
, the parsing resumes, recovered, to successful completion, at which point 0 is returned. The difference between failure and success therefore lies in the execution of an error production and finishing the grammar to its first, topmost, grammar
production.
If the value evaluates to -1
then 0 is returned from main() and the calc example is done.
The Grammar Section
The grammar section describes the structure of the language in productions, and the actions those productions execute when matched. The grammar section is started by the %grammar%
section delimiter on a singe line.
The productions form a set of rewriting rules; where a symbol is defined to be equivalent to a sentence of other symbols, both terminals and non-terminals. A symbol is deemed a terminal if it comes from the scanner (or user) input. Such a terminal symbol may be defined in terms of productions. A symbol is deemed a non-terminal if it is defined by zero or more productions. (A non-terminal that has no productions that define it will never match.)
The First Production
The destination non-terminal of the first production is used as the non-terminal that defines the entire grammar. For instance:
A: B
A: C
The first production above is A: B
, the destination non-terminal of that production is A
. The entire grammar to be matched is therefore equal to matching A
. Valid matches for the entire grammar is therefore not just the symbol B
(from the first production A: B
) but also the symbol C
(from the second production A: C
.) Th goal is to match the destination symbol from the first production, not the first production itself.
Declaring symbols
Terminals from the scanner don’t necessarily need to be used at all in the grammar, similarly, non-terminals don’t have to be defined by any production. If we let terminals and non-terminals be defined purely by their use in patterns and productions, then there is a risk of difficult to find bugs: we infallible humans make typos, if we let those typos go undetected and instead define arbitrary new symbols that may never match the symbol without the typo, then, even though our grammars may never match for hard to detect typo reasons, they will compile just fine.
It is for this reason that all terminals and non-terminals need to be explicitly declared, so Carburetta can detect this class of mistakes.
Case-insensitivity
Carburetta treats both terminal and non-terminal symbol identifiers as case insensitive. Lower-case a-z
and upper-case A-Z
are indistinguishable from its perspective. Additionally, the _
(underscore) and -
(dash) are equivalent.
%token
Terminals (also known as tokens) must be declared using the %token
directive, we have already seen an example of such a directive in the calculator example:
%token PLUS MINUS ASTERISK SLASH PAR_OPEN
%token PAR_CLOSE INTEGER
These terminal symbols are all uppercase with underscore word separators, however, this is convention only. Terminal symbols are written uppercase by convention, to help readability of the grammars.
It is legal for terminal symbols to be declared, but not used anywhere in the grammar. Such unused terminal symbols are declared in the range of terminals using a #define in the output, just like other terminal symbols.
%nt
Non-terminals must be declared using the %nt
directive, here is the example from the calculator earlier:
%nt grammar expr term factor value
These non-terminal symbols are all lowercase (and usually with dashes -
if word separation is needed,) however, this is convention only. Non-terminal symbols are written lowercase and with dashes to help readability of the grammars.
Much like terminals, it is legal for non-terminals to be declared, but not used anywhere in the grammar. Such unused non-terminal symbols are declared in the range of non-terminals using a #define in the output, just like other terminal symbols.
Implicit symbols
In addition to the terminals and non-terminals that are specified using %token
and %nt
respectively, Carburetta needs two other symbols to function, which it implicitly declares.
%error_token
The error token we have seen in the calculator example earlier. By default, it is declared as error
and transformed to a C/C++ identifier using the same rules as for other symbols. Consequently, in the case of the calculator, the corresponding C/C++ identifier is declared as CALC_ERROR
. It is rare to have to use this in non-generated code directly, unless directly accessing the innards of the error recovery mechanisms.
Should the default name error
conflict with a desired symbol in the target language, it can be re-defined to another identifier using the %error_token
directive. For instance, in the calculator:
%error_token gremlin
value: gremlin {
$$ = 0;
report_error("Syntax error: value expected;\n");
}
Here we have defined the error token to be gremlin
using %error_token gremlin
, and consequently used it instead of error
. Note that multiple uses of %error_token
result in only the last use becoming the error symbol, earlier invocations of the %error_token
become regular terminal symbols. Future versions of Carburetta may instead issue an error.
%end_token
There has to be some way for the parser to recognize the end of the input. This is done by a special input-end
symbol; like the error
symbol, it is implicitly declared, and obeys the same transformation rules to a C/C++ identifier as other symbol names. In the case of the calculator example, the corresponding C/C++ identifier is declared as CALC_INPUT_END
. Even though it is in the range of the terminal symbols, it should not be directly used in the grammar (as it can never be shifted.) Future versions of Carburetta may issue an error when this occurs.
When using the parse function directly, the driver code must indicate the end of the input stream to be parsed by passing in the input-end
symbol. When using the scan function, this is done implicitly by setting the scan function’s is_final
argument to nonzero.
Should the symbol name input-end
conflict with a desired symbol name in the target language, it can be re-assigned using the %end_token
directive:
%end_token EOF
Here the default input-end
symbol name is redefined to EOF
. Multiple invocations of %end_token
result in only the last specified symbol name to be used as the end of input symbol, the other as regular terminals. Future versions of Carburetta may issue an error instead.
The Type of a Symbol
Symbols can carry data and that data must have a C/C++ type; we will explore here how we can assign and use types on symbols. There are two categories of data:
data specific to the symbol, for instance, an identifier might carry a string type, a number-literal might carry a float type, but the identifier has no need for a float, and the number-literal has no need for a string.
data common to all symbols, for instance, we might want to store the line and column of each symbol in the input, so that more helpful error messages may be generated.
The data specific type is the symbol type, the data type common to all symbols is the common type. We will first examine the symbol type, followed by the common type.
%type
To assign a type to any symbol (terminal or non-terminal) use the %type
directive:
%type grammar expr term factor value INTEGER: int
Above the type int
is assigned to the non-terminals grammar
, expr
, term
, factor
, value
, and to the terminal INTEGER
.
The type specified may be more elaborate than a simple int
:
%type sym: struct { \
char *sym_; \
struct sym *lookup_; \
}
Here an entire anonymous (nameless) structure is used as the type for the symbol sym
, declared over multiple lines using line continuations (the backslash-immediately followed by a newline combination at the end of each line continues the directive to the next line.)
Carburetta merely copy and pastes these declarations without any deeper understanding of C/C++. From its perspective, it just wants to append an identifier to declare a local variable or structure member variable. There are however cases where “simply appending an identifier” is insufficient. To specifically declare where Carburetta should insert an identifier in the type to make a valid local or member variable declaration, the $
special identifier may be used:
%type value: int $[5]
Here the symbol value
is assigned the array type int[5]. The $
special identifier indicates where Carburetta should write the name of variables and members to complete its declarations in its own generated code. Using the $
special identifier, types like arrays and function pointers can also be declared.
Note: for C++, the %class
directive is typically preferred over %type
as it fills in the constructor, destructor and move-semantics that are typically desired for C++ code. It is discussed in its own section later.
%token_type
When the tokens are passed into the parser directly and Carburetta’s scanner features are not used, it is common (for convenience sake) to have one single type for all tokens. Such a type could be specified using %type [all the terminals]: the_type
but producing this list may be exhaustive and prone to mistake if the list of terminal symbols changes while the language is in development. For convenience, a shorthand is available, %token_type
.
%token_type
defines the type for all terminal symbols that do not have a type assigned yet by %type
, it is therefore a nice catch-all for those cases where all terminals share the same type.
%constructor
While the actions of each individual pattern or production may be unique, the initialization of the data to a known good state is commonly shared. To avoid repetition amongst rules, a snippet of initialization code may be specified using the %constructor
directive. The code specified with %constructor
will be executed whenever a new instance of the data is about to be used.
The %constructor
directive applies a snippet of code to the most recently declared type. A type is declared using %type
, %token_type
, or %common_type
, the %constructor
directive applies itself to the most recent such type declared.
%type grammar expr term factor value INTEGER: int
%constructor $$ = 0;
All int
data used by the symbols grammar, expr, term, factor, value and INTEGER will be initialized to 0 prior to their use. If you return
from the scan() or parse() function to the caller during the execution of a %constructor
, it is deemed to have succeeded. Re-entry of the scan() or parse() function will continue after
%raii_constructor
Like %constructor
the directive %raii_constructor
is executed whenever the most recent type is executed. The difference between the two lies in what happens when the parse or scan function is exited while the constructor code executes, using a return
or a throwing a C++ exception.
In the case of %constructor
, the assumption is that constructor code never fails, but only initializes the data to values such that the destructor code (which we’ll see later) can safely be called. As such, when you return from the parse or scan function (which can be done by using a regular return
), the assumption is that the %constructor
code has fully initialized the data, and upon re-entry to the parse or scan function, execution will continue after the %constructor
code. Similarly, if the %constructor
code returns to the caller, and the caller decides to call your_prefix_stack_cleanup()
(e.g. calc_stack_cleanup()
in the earlier example) then the corresponding %destructor
for the type is called.
When writing C code that expects some part of the caller to do initialization in collaboration with the %constructor
code, this is the desired behavior. However, when writing C++ code according to the Resource Acquisition Is Initialization (RAII) idiom, this is not the desired behavior, at all.
When using RAII in C++, constructors are not guaranteed to succeed but may fail. When they fail, they throw an exception. In the case of Carburetta, because its code is written for both C and C++ compilers (i.e. it is written in C but can be compiled as C++) it transparently passes exceptions to the caller, and cannot distinguish between an intentional return
and an exception throwing constructor. For this usage, %raii_constructor
is available.
Using %raii_constructor
, when control exits to the caller (via either return
or an exception), the constructor is assumed to have failed. Re-entry to the parse() and scan() function will continue execution at the start of the %raii_constructor
code that failed (instead of continuing execution after the constructor code in the case of %constructor
.) Should the caller invoke the stack cleanup function your_prefix_stack_cleanup()
(calc_stack_cleanup()
in the earlier example) then the corresponding %destructor
is not called.
// some user-defined class Symbol
%type sym: Symbol
// placement new, Symbol constructor may throw.
%raii_constructor new (&$$) Symbol();
// only called if %raii_constructor completes without exceptions
%destructor $$.~Symbol();
%token_action
After construction, in the case of new terminals in a parse() function, there needs to be an opportunity to build up the terminal in a manner specific to the type being used as a terminal, rather than the type being the result of a production action. For this purpose, %token_action
can be used.
The code specified in %token_action
only executes for terminals, for non-terminals it is skipped (as the production action should fulfill the same role.)
Note that %token_action
executes only in the parse() function (where no scanner exists) as the scan() function (where a scanner already exists) has better construction options like common actions.
%token_action for %common_type
If the %token_action
is assigned to the %common_type
, it executes after the common constructor. In this situation, there is no symbol specific data, and both $$
and $
special identifiers refer to the common data for the terminal.
%token_action for %token_type or %type
If the %token_action
is assigned to a symbol specific type, it executes after the common constructor, after the common %token_action
action and after the symbol specific type constructor. The $$
special identifier will reference the symbol specific data type that %token_action is assigned to, the $
special identifier will reference the %common_type
data.
%move
The %move
directive is used to specify how a type can be moved from one location $0
to another $$
. If no %move
directive is specified, a datatype is moved by means of memcpy
, and the %destructor
is still called on the source.
Example of how %move
might be used to move a C string around:
%type IDENT: char*
%constructor $$ = NULL;
%move $$ = $0; $0 = NULL;
%destructor if ($$) free($$);
Note that, without the use of %move
, the char *
would be freed twice, once at the source ($0
in the %move
directive,) and once, much later, at the destination. Previous versions (before 0.8.12) did not have %move
but copied the data over with memcpy semantics. The %destructor
was not called on the source.
There are three specific cases where %move
is used:
- A token has been scanned and typed data has been associated with it. This is prior to it hitting the parser. The
%move
directive specifies how the data is moved from the scanner to the parser. In the generated code, this location is “slot 0” on the stack. - A production is being reduced, the
$$
that the production reduces to cannot yet be on the stack, because the symbols of the production are still on the stack. The%move
directive specifies how the data is moved from the$$
slot to the stack, after those symbols have been deconstructed (popped). In the generated code, this location is “slot 1” on the stack. - The input is sufficiently complex that internal buffers need to be resized to fit the larger stack. This is done by allocating a new buffer (new_buf_ in the generated code) and moving the data from the old buffer to the new buffer element by element. The
%move
directive specifies how the data is moved from the old buffer to the new buffer. Note that if no%move
directive is specified for any type, Carburetta may opt to userealloc()
instead of allocating a new buffer and moving the data.
%destructor
To specify cleanup code for datatypes, the %destructor
directive may be used. Use of %destructor
is, however, contingent on a few limitations:
- Destructor code may never fail, or return to the caller, be it via exception or
return
statement. A destructor must always succeed. - Destructor code cannot rely on
%locals
or%params
.
Reasons for both is that the %destructor
code snippets are also executed from the your_prefix_stack_cleanup()
(e.g. calc_stack_cleanup()
) where %locals
and %params
directives have no effect, and that cleanup is expected to complete in one call. Consequently, the cleanup function doesn’t have any continuation logic. Returning from the cleanup function directly from %deconstructor
therefore causes memory possibly other leaks.
%class (C++)
The %class
directive is an alternative to using %type
when the target language is C++. Using it automatically fills in constructors, destructors and move operators for the type to follow C++ conventions. So for instance, the following:
%class sym: Symbol
is roughly equivalent to:
%type sym: Symbol
%raii_constructor new (&$$) Symbol();
%move $$ = std::move($0);
%destructor $$.~Symbol();
That is to say, the %class
directive causes the default constructor to be used as %raii_constructor
and C++ move semantics to be used as %move
. The default destructor is used for %destructor
. Note that the %raii_constructor
is used instead of %constructor
because, if the constructor throws (which it is permitted to do) you want to consider the construction to have failed (thus re-entry into <prefix>scan()
or <prefix>parse()
will continue at retrying the constructor.)
The reason the above code is “roughly” equivalent to calling placement new is because (through the magic of templates) arrays are still permitted, so for instance, the following:
%class sym: Symbol[10]
is a valid directive that associates an array of 10 Symbol objects to each sym on the stack. The array will be initialized using the default constructor, moved element by element using std::move, and destroyed using the default destructor (element by element.)
Finally, note that use of %class
is invalid for C code, the generated code will no longer compile on a C compiler, and instead make the compiler issue the error use of %class directive requires compilation as C++
.
The Common Type of All Symbols
Symbol types are used for assigning data to symbols specific to that symbol. There is often a need to assign a small amount of data to all symbols (for error tracking, or recording sections of matched code, etc.) The type for this data is the common type of all symbols, it is specified using the %common_type
directive.
%common_type
To specify the C data type common to all symbols, we can use the %common_type
directive. For instance, to specify a line and column number to go with every terminal and non-terminal:
%common_type struct { int line_, col_; }
%common_class (C++)
The %common_class
directive is an alternative to using %common_type
when the target language is C++. It is the common data type way of specifying a C++ class and having the %raii_constructor
, %move
and %destructor
filled in automatically. The %common_class
directive is therefore to %common_type
as the %class
directive is to %type
. As, to follow the example from before:
%common_class Symbol
is roughly equivalent to:
%common_type Symbol
%raii_constructor new (&$$) Symbol();
%move $$ = std::move($0);
%destructor $$.~Symbol();
See the discussion at %class
above for the reasons why the above is “roughly” equivalent.
Actions
When a production is matched, some defined action is to take place, consisting of a C/C++ code snippet:
grammar: expr {
printf("Outcome: %d\n", $0);
*final_result = $0;
}
Here the grammar: expr
production is matched, and multiple lines of code will execute whenever that occurs. A special identifier $0
is used to access the value of the first symbol counted from left starting at 0. The action finishes when the number of curly open braces matches the number of curly closing braces.
$$
The special identifier $$
indicates the destination symbol data. It is of the type previously declared for the non-terminal using %type
. If no such type was declared, then accessing destination symbol data is invalid and using $$
will result in an error.
$0, $1, ..
Symbols in the production may be accessed using $n
where n is the zero-based index of the symbol counted from left on the production. Both terminals and non-terminals may be accessed in this fashion, provided they have a declared symbol data type using %type
or %token_type
.
The type of the symbol matches the declared type, and is an lvalue. As an lvalue, you can take pointers. In the earlier example, the expr
symbol is assigned the type int
:
%type grammar expr term factor value INTEGER: int
Consequently $0
is of type int. Being an lvalue, &$0
(the address of the int) is of type int *
and is legal. Note that an int
lvalue and type int *
implies you may change such value, &$0
is not a const
pointer.
Why are $0, $1, .., $n not const?
Why is it useful to have the inputs to a production action not be const, but modifiable lvalues? Consider the following list-building example:
%%
struct list_item {
struct list_item *next_;
};
%%
%type list list-item: struct list_item *
%constructor $$ = NULL;
/* some user-defined list cleanup function */
%destructor if ($$) list_free($$);
list: list-item {
$$ = $0;
$0 = NULL;
}
list: list COMMA list-item {
$$ = $1;
$0->next_ = $1;
$0 = NULL;
$1 = NULL;
}
Suppose in the above the symbol list-item
resolves to a struct list_item *
with the next_
member pointing to itself, and list
resolves to a tail-cyclic list (that is, the symbol value always points to the last member of the list, so we can always insert at head and tail with only a next_
pointer.)
In this example, it is very convenient to be able to use both the list-item and list symbol values to form a new list without any of the allocation. This only works if we set the symbol data values that we use to NULL, so the destructor knows cleanup is not necessary.
$
A single $
special identifier accesses the destination common data. This is only valid if a type for common data was declared using %common_type
. It should not be confused with its $$
destination symbol specific data counterpart.
${0}, ${1}, .., ${n}
The special identifiers ${0}, ${1}, .., ${n} access the common data of the symbols on the production. Calling them special identifiers is a misnomer as the text between the curly braces is actually C/C++ code that must evaluate to the index of the symbol you’d like to access. This expression is evaluated during runtime execution of the parser, the goal is to offer dynamic runtime programmatic access to the symbols’ common data. (Note that offering this dynamic runtime access for symbol type specific data ($0
, $1
) is not possible as the datatype for each of those depends on the symbol referenced.)
$len
The special identifier $len
evaluates to an integer expression containing the number of symbols in the production. While action for each production has a fixed number of symbols (as each production has a fixed number of symbols), use of $len
can simplify code in combination with the common data references (${0}
, ${1}
). For instance, the following is legal:
%common_type const char *
all-together: one two three {
int n;
for (n = 0; n < $len; ++n) {
printf("%s ", ${n});
}
}
one: { $ = "one"; }
two: { $ = "two"; }
three: { $ = "three"; }
The above code will print "one two three "
whenever the production for all-together
is matched. Each of the one
, two
, and three
non-terminals will assign the corresponding string to its common data. This common data is then accessed dynamically by looping a variable n
from 0 to $len
exclusively. Note that, in the above, $len
and ${n}
do not resolve when Carburetta compiles the code, but only when the parser is executing at runtime (the n
in ${n}
could be any expression, and is simply copy and pasted by Carburetta to the appropriate place, without any deeper understanding.)
Common actions
Common actions are actions shared across multiple, but not necessarily all, productions. A common action is specified by a production to the special identifier $
as non-terminal with no symbols:
$: {
printf("Before the regular action\n");
}
regular: sym {
printf("After the common action\n");
}
$:{}
Every common action defined applies to the productions that follow it, until another common action is defined. To define no common action, use an empty common action $:{}
which will be detected as “no common action” in the code.
Because common actions are shared amongst productions, there is no way for the code to access the symbol type specific data (as the symbols may vary amongst the productions.) There is however access to the common data, as an example:
%common_type struct { int line_, col_; }
$: {
if ($len > 0) {
$.line_ = ${0}.line_;
$.col_ = ${0}.col_;
}
else {
$.line_ = 0;
$.col_ = 0;
}
}
The above example illustrates how the common action can be used to automatically pass along the location of a non-terminal; here the location (as specified as a line_ and col_ pair) of any non-terminal whose productions follow the common action is defined to be the location of the first symbol (terminal or non-terminal.) If there are no symbols on the production, we set line_
and col_
to 0 (which is an invalid location of line and column numbers start at 1.)
$discard
When executing a common action, it may be desirable to have it prevent the main production action from executing. For this purpose, the special identifier $discard
can be used. $discard
will evaluate to a C/C++ statement, and its presence ensures that the subsequent production action will not execute.
Note however that %constructor
, %raii_constructor
and %destructor
code continues to execute irrespective of $discard
.
A typical use-case for $discard
is to distinguish between the mere parsing of a section of input, and the actual evaluation or execution of that input. For example, imagine parsing an if statement for an interpreter. We can execute it without needing to resort to an abstract-syntax-tree or other form of pseudocode, if we can first evaluate the condition of the if statement before executing the true branch of the if statement. Depending on the result of the condition, we may then either execute the true branch (by parsing it again, now without executing $discard
in a common action), or skip over it (by continuing parsing after the branch we parsed before.)
Resolving Ambiguity using %prefer %over
Many grammars may have ambiguities in them that have to be resolved, here is an example:
stmt: expr SEMICOLON;
stmt: if PAR_OPEN expr PAR_CLOSE stmt;
stmt: if PAR_OPEN expr PAR_CLOSE stmt ELSE stmt;
This is the “dangling else ambiguity” that exists in many languages. When compiling the above, Carburetta will give an error:
Error, reduce/shift conflict found:
example.cbrt(2): stmt: IF PAR_OPEN expr PAR_CLOSE stmt * (reduce)
example.cbrt(3): stmt: IF PAR_OPEN expr PAR_CLOSE stmt * ELSE stmt (shift)
(Use %prefer %over directives to force resolution of conflicts)
This happens because the grammar does not distinguish between this interpretation:
versus this interpretation:
Grammatically speaking they are both valid interpretations. The error message indicates this, the *
in the error message indicates where the parsing is encountering a problem. Here, it cannot decide between:
example.cbrt(2): stmt: IF PAR_OPEN expr PAR_CLOSE stmt * (reduce)
reducing the if production on line 2 (this is the second interpretation above,) and:
example.cbrt(3): stmt: IF PAR_OPEN expr PAR_CLOSE stmt * ELSE stmt (shift)
shifting the else to make an if-else (this is the first interpretation above.)
To successfully let Carburetta compile the grammar, it has to be told what to do here, using %prefer
and %over
directives:
stmt: expr SEMICOLON;
stmt: if PAR_OPEN expr PAR_CLOSE stmt;
stmt: if PAR_OPEN expr PAR_CLOSE stmt ELSE stmt;
%prefer stmt: IF PAR_OPEN expr PAR_CLOSE stmt * ELSE stmt
%over stmt: IF PAR_OPEN expr PAR_CLOSE stmt *
Now it will prefer the shifting of the ELSE symbol over the reduction of the inner if. The net effect is that, because an ELSE will always prefer to shift, the ELSE will always bind to the nearest IF. This is the designed behavior of most languages with this ambiguity.
Note that the %prefer
%over
conflict resolution directives are syntactically intentionally close to the error message they resolve, however, it is important to first understand why the conflict occurs before spraying conflict resolution directives over the code. Usually ambiguities can be difficult to understand (in the case above, the ambiguity only occurs with nested if-else combinations, not a single if-else statement.)
Error Recovery
When a parsing error occurs (the next input terminal cannot match) the generated Carburetta parser will examine its stack to determine if there is some prior state where an error non-terminal can be transitioned. Put differently, if we remove zero or more symbols (both terminals and non-terminals) we have already processed, can we then arrive at a situation where we can accept an error
terminal as if it were input?
This is true if there is an error production somewhere alongside the valid productions. An extract from the earlier example:
expr: term { $$ = $0; }
term: factor { $$ = $0; }
factor: value { $$ = $0; }
value: INTEGER { $$ = $0; }
value: error {
$$ = 0;
report_error("Syntax error: value expected;\n");
}
Here, whenever a value
is expected (which also means whenever a factor
, term
, or expr
is expected,) and we instead find something that doesn’t match any of these (i.e. we find a parsing error), then the value: error
error production means the error
terminal can be accepted whenever we are able to accept an expr
, term
, factor
or value
. In such a situation, the error is recoverable and the parser enters error recovery mode.
Irrespective of whether the parser will try to recover the error, it first returns _<PREFIX>SYNTAX_ERROR
(for instance _CALC_SYNTAX_ERROR
in the example earlier) before continuing processing upon re-entry. Note that, after returning _<PREFIX>SYNTAX_ERROR
it will not return the error again until at least 3 terminals have shifted. A quick succession of parsing errors will therefore not result in many errors issued. Note that this “error muting” behavior has no effect on error recovery and attempts to select error productions in error recovery mode.
If an error occurs and the parser does not enter error recovery mode, then, after returning _<PREFIX>SYNTAX_ERROR
, the next input terminal is attempted to be matched (so only input is discarded and nothing else.) further processing may be pointless as this simpler behavior might result in a lot of errors. To distinguish between a _<PREFIX>SYNTAX_ERROR
where no recovery mode is entered and one where recovery is possible, the caller may use the <prefix>can_recover()
generated function. If <prefix>can_recover()
returns non-zero, the caller may decide to give up.
Error recovery mode
In error recovery mode, the parser discards input terminals until a terminal is found that can be accepted after an error terminal is accepted. In the case of the earlier example, this might be the closing parenthesis of an equation from the production factor: PAR_OPEN expr PAR_CLOSE
or the end of input if no parentheses are in the input. Note that, if both are possible (suppose we have nested parentheses in the equation (1 + (2 * error
and the next input terminal is )
) then the smallest edit (the least amount of symbols discarded/popped from the stack) will be preferred. Therefore, in the earlier example, if (1 + (2 * error
and the next input terminal is )
, then the interpretation for recovery will be (1 + (2 * error)
and not (1 + error)
.
Once the parser discovers an input terminal that is acceptable after hypothetically shifting an error token at some point on its stack, it will synchronize by actually popping the stack to the position where the error token can be shifted, actually shifting the error token, and then continuing processing as normal (shifting or reducing based on the terminal as usual.) When the error production is reduced (and its action executed) the parser is therefore no longer in error recovery mode.
If the parser fails to recover before the end of input is reached, 0 is returned from the scan or parse function as if the parsing is complete - thus you will break out of the function in the end.
The Scanner Section
The scanner section describes how text is lexically split up into tokens. These tokens are then passed into the parser as the terminals of the grammar. The scanner section is started by the %scanner%
section delimiter on a single line.
The scanner section consists of a set of patterns, each of which contain an optional terminal identifier to indicate what token is being matched by the pattern, a regular expression to describe the actual pattern, and an optional action to indicate the code that should execute.
%token IF ELSE IDENT
IF: if;
ELSE: else;
IDENT: [a-zA-Z_][a-zA-Z0-9_]* {
printf("An identifier: \"%s\"\n", $text);
}
: /\*([^\*]|[\r\n]|(\*+([^\*/]|[\r\n])))*\*/;
In the scanner section above, there are four patterns; the first two are keyword patterns. The IF terminal is matched by the regular expression if
, which matches the plain keyword if
. Similarly for the ELSE terminal, it is matched by the else
keyword. Neither of these patterns have any action associated with it, and so if they match, their corresponding terminal will silently be passed to the parser.
The third pattern is for the IDENT terminal. It is matched by the regular expression [a-zA-Z_][a-zA-Z0-9_]*
(any character in the range a to z, or A to Z, or an underscore, followed by zero or more letters in the range a to z, or A to Z, or 0 to 9, or _). Note that this pattern also matches the “if” and “else” keywords. The order of the patterns is very important as it determines which pattern gets the match when multiple regular expressions can match. The first pattern that matches is the pattern whose action is executed and whose terminal is passed onto the parser.
The fourth and final pattern has no terminal associated with it, nor an action. If it is matched, nothing will be passed to the parser and no code will execute as part of any action. The regular expression /\*([^\*]|[\r\n]|(\*+([^\*/]|[\r\n])))*\*/
matches C-style comments. By forming the pattern this way, matching a C-style comment will result in the comment to be ignored and scanning to continue.
Regular expressions
A regular expression consists of a set of elements and operators working on an element. At the topmost level, the bar |
may be used to separate sequences of elements into alternatives. For instance, foo|bar
will match either the string "foo"
or the string "bar"
.
Each element in a sequence may be followed by one or more operators:
*
: signifies zero or more of the preceding element+
: signifies one or more of the preceding element?
: signifies zero or one of the preceding element.
An element can be one of the following:
-
Regular printable characters equal or lower than ASCII value 127 represent themselves except the following:
.
- the dot is already used as an “any” character (see below) - to specify a dot character, use an escape\.
^
- the caret is a reserved character - to specify a caret, use an escape\^
$
- the dollar sign is a reserved character - to specify a dollar, use an escape\$
(
- used to group elements - to specify an opening parenthesis, use an escape\(
)
- used to group elements - to specify a closing parenthesis, use an escape\)
[
- used to specify a set of characters - to specify an opening square brace, use an escape\[
]
- used to specify a set of characters - to specify a closing square brace, use an escape\]
-
any whitespace is ignored - to specify a whitespace character, use an appropriate escape sequence. In particular, the following characters are ignored:
(space), use
\
instead.\x09
(tab), use\t
instead.\x0B
(vertical tab), use\v
instead.\x0C
(formfeed), use\f
instead.\x0D
(carriage return), use\r
instead.\x0A
(newline), use\n
instead.
The
.
(dot) character, matches any character but\x0A
(newline).An escaped character, any character in the following set
{}[](),.^$*|?+:;-\'"
, when preceded with a backslash, represents itself. Note this includes a few characters that would also be recognized as regular characters.-
One of the following escape sequences:
\a
- 0x07 the bell\b
- 0x08 the backspace\f
- 0x0C the formfeed page break\n
- 0x0A the newline\r
- 0x0D the carriage return\t
- 0x09 the tab\v
- 0x0B the vertical tab\xh
- a single digit hexadecimal escape - the escape is single digit if it is not followed by a hexadecimal character. Represents the character 0xh (range 0 to 15.)\xhh
- a double digit hexadecimal escape, representing the character 0xhh (range 0 to 255.)\n
- a single digit octal escape - the escape is single digit if it is not followed by an octal (0-7) character. Represents the character 0xn (range 0 to 7.)\nn
- a double digit octal escape, representing the octal numeric value 0nn (range 0 to 63.)\nnn
- a triple digit octal escape, representing the octal numeric value 0nnn (range 0 to 255, exceeding this range is an error.)\uhhhh
- a four digit hexadecimal escape, representing the character 0xhhhh (range 0 to 65535.) Ranges outside 0..255 are only meaningful when using Unicode.\u{h+}
- a hexadecimal escape, representing the character 0xhhhh (range 0 to 0x10FFFF.) Ranges outside 0..255 are only meaningful when using Unicode.
-
One of the anchor sequences:
^
- matches when the character that follows it is at the start of the line.$
- matches when the character that precedes it is at the end of a line (either the end of input, or just before a newline character.)\A
- matches only when the character is the first in the input (having an offset of 0.)\Z
- matches only when the character that precedes it was the last character in the input.
A character set, consisting of ranges and individual characters between square braces. For instance,
[a-zA-Z_]
specifies the set of characters that include the lower and uppercase alphabet, and the underscore_
character. (To specify the dash itself, use a backslash:[a-zA-Z_\-]
is the same set, now including the dash.) Note that character sets can also contain unicode character set categories. See the Unicode Section below for more.-
A unicode character set category, either regular, or inverted:
\p{category}
- a character set category, represent all the codepoints specified in the category. For instance,\p{L}
represents all the letters in the unicode standard. See the Unicode Section below for a list of supported categories.\P{category}
- an inverted character set category, representing all the codepoints not specified in the category. As with\p
, see the Unicode Section below for a list of supported categories.
An inverted character set, consisting of a character set but with the
^
character as its first member. This negates the specified range. For instance[^a-zA-Z_]
matches any character but a to z, A to Z or underscore.A regular expression between parentheses acts as a single element. For instance, the regular expression
foo?bar
matches"fobar"
and"foobar"
, whereas(foo)?bar
matches"bar"
and"foobar"
.
Unicode Support
The generated scanner supports unicode by default (specifying the --x-utf8
commandline flag is deprecated.) There is also a raw (Latin-1) mode by specifying the --x-raw
flag on the commandline.
The \p{category}
and \P{category}
regular and inverted unicode character set categories support the following categories:
Group: \p{C}
\p{Other}
\p{Cc}
\p{Control}
\p{Cf}
\p{Format}
\p{Co}
\p{Private_Use}
\p{Cs}
\p{Surrogate}
Group: \p{L}
\p{Letter}
\p{Ll}
\p{Lowercase_Letter}
\p{Lm}
\p{Modifier_Letter}
\p{Lo}
\p{Other_Letter}
\p{Lt}
\p{Titlecase_Letter}
\p{Lu}
\p{Uppercase_Letter}
Group: \p{M}
\p{Mark}
\p{Mc}
\p{Spacing_Mark}
\p{Me}
\p{Enclosing_Mark}
\p{Mn}
\p{Nonspacing_Mark}
Group: \p{N}
\p{Number}
\p{Nd}
\p{Decimal_Number}
\p{Nl}
\p{Letter_Number}
\p{No}
\p{Other_Number}
Group: \p{P}
\p{Punctuation}
\p{Pc}
\p{Connector_Punctuation}
\p{Pd}
\p{Dash_Punctuation}
\p{Pe}
\p{Close_Punctuation}
\p{Pf}
\p{Final_Punctuation}
\p{Pi}
\p{Initial_Punctuation}
\p{Po}
\p{Other_Punctuation}
\p{Ps}
\p{Open_Punctuation}
Group: \p{S}
\p{Symbol}
\p{Sc}
\p{Currency_Symbol}
\p{Sk}
\p{Modifier_Symbol}
\p{Sm}
\p{Math_Symbol}
\p{So}
\p{Other_Symbol}
Group: \p{Z}
\p{Separator}
\p{Zl}
\p{Line_Separator}
\p{Zp}
\p{Paragraph_Separator}
\p{Zs}
\p{Space_Separator}
In the above, each Group contains all the categories in its list. Categories can also be specified individually, e.g. \p{Ll}
or \p{Letter_Number}
. Consequently, \p{L}
is equivalent to (\p{Ll}|\p{Lm}|\p{Lo}|\p{Lt}|\p{Lu})
. For readability, each category also has a long form, so \p{L}
is equivalent to \p{Letter}
.
The categories are case sensitive, and must not contain spaces.
Finally, note that the categories seek to follow the unicode standard as closely as possible, whereas other regular expression engines may deviate. For instance, the category \p{Nonspacing_Mark}
is written as \p{Non_Spacing_Mark}
in some other engines.
Please see unicode annex #44, section 5.7.1 “General category values” for the source, reports of discrepencies would definitely be appreciated.
Future direction
UTF-8 is now the default mode of operation. Decoding rules will move from a compile-time flag to a runtime switchable feature, so whereas now Carburetta can generate two different kinds of scanners (one UTF-8, the other raw, enabled through the --x-raw
flag,) the intent for the future is to switch between encodings at runtime. Consequently, you would be able to write some code that detects the encoding of the input file, and then switch the scanner to the appropriate mode. While this is not yet implemented, it will impact the future behavior of the raw mode.
Input patterns support UTF-8 encoded codepoints beyond 127. Alongside UTF-8 and RAW (Latin-1), it is intended that the scanner will also support UTF-16 as a runtime switchable option. Such switching will, however, not be automatic (i.e. the scanner will not automatically detect the encoding of the input file.)
Currently the scanner, when in RAW mode, will recognize 0xA
(LF), 0xD
(CR) and 0x85
(NEL) as end of line conditions (e.g. for $
matching in patterns, and line count maintenance.) In UTF-8 mode, only 0xA
(LF) is recognized. Due to the difficulties of fitting a multi-codepoint line ending lookahead into a state engine that mixes both UTF-8 parsing in one state engine and pattern matching in another, the new behavior is likely to become the 0xA
-only match (so RAW support will lose the 0xD
and 0x85
matches as the RAW support switches over to the new multi-state-engine scanner logic.)
Care needs to be taken to monitor future releases for changes in this area.
Notable limitations
Note that the above implies there is currently no support for features like:
numbered quantification (
{n}
for matching somethingn
times is not supported.)C-style string literal constants (using string literals
"a string with whitespace"
as part of the regular expression.)Inputfiles are not deemed UTF-8 encoded, so all unicode characters in patterns must be escaped using the
\uhhhh
or\u{h+}
escape sequences. Inputfiles are raw (Latin-1 or ISO-8859-1) encoded, and patterns are ASCII (7-bit.) This means that any other UTF-8 in string literals in actions or other areas will be passed through unaffected.
Such features may however appear in the future.
Actions
The role of actions in patterns is to populate the symbol specific data as well as the common data. To do this, the action has access to the text matched by the regular expression, and line, column and offset at the start and end of the match.
$$
In the action, $$
points to the symbol type specific data for the terminal being formed. It’s type is determined by the terminal used as part of the pattern. A pattern that has no terminal cannot resolve $$
in its action.
$
In the action, $
points to the common data for the terminal being formed. It’s type is determined by the %common_type
directive. If a pattern has no terminal, or if %common_data
is defined nowhere in the input file, then $
cannot resolve in an action.
$text
In the action, $text
points to a null ('\0'
) terminated string that holds the matched string. The type is char *
and - although this is something of an anti-pattern, may be modified during processing. Note however that the same text string is passed on to all the code associated with this pattern’s match, consequently it is good behavior to restore the buffer to its original state prior to completing the action code.
The $text
buffer is used internally for on-going partial matches and lookaheads, consequently after the action, the $text
may no longer be accessed. If the contents of the $text
buffer is needed later, a copy must be made, and the action is the only reliable place to do it.
$len
For efficiency, or if the pattern might match null ('\0'
) terminators, the length of the match in bytes is available in $len
. The type of $len
is size_t
.
$discard
The $discard
special identifier has a similar behavior in the scanner section as it does in the grammar section: it permits cancellation of any subsequent actions from any of the earlier actions. For any given match, this is the sequence of possible actions: 1) %constructor
code for the %common_type
is executed, this will always execute. 2) %constructor
code for the symbol specific type is executed, this will always execute. 3) The common action code $: {..}
for the pattern is executed; use of $discard
is legal here, this will only execute if no $discard
was performed earlier. 4) The action code for the pattern is executed; use of $discard
is legal but irrelevant here, this will only execute if no $discard
was performed earlier.
Irrespective of the actions executed, the terminal will still be passed on to the parser.
Source locations $line and $endline, $column and $endcolumn, $offset and $endoffset
The source code location of matched input is maintained in the background and available to pattern actions. To access the 1-based line number of the first character of the match, use $line
. To access the 1-based column number of the first character of the match, use $column
. Finally, to access the byte offset of the first character of the match, use $offset
.
If the range of source code locations of the matched text is needed, $line
, $column
and $offset
have a counterpart in $endline
, $endcolumn
and $endoffset
respectively. These give the source code location of the first character after the match (this is one character beyond the last character.) So, for example, if the match has a single newline that it ends with, the $endline
will be one greater than $line
, and $endcol
will be 1.
The type of $line
, $col
, $endline
, and $endcol
is int
. The type of $offset
and $endoffset
is size_t
.
$chg_term(terminal)
To change the terminal matched for another terminal, use the $chg_term(terminal)
special identifier. This is dangerous and should be used very carefully. The terminal
argument represents the terminal, previously declared with %token
, using the same nomenclature (eg. the search for the matching terminal is case-insensitive and the use of dashes (-
) is permitted.)
Use of $$
will continue to point to the data of the old terminal
matched. You can however rely on this data being in the same memory address as the data for the newly set terminal
– if both terminals have the same datatype, things will work flawlessly.
If the datatypes differ, extra care needs to be taken as the remainder of the scanner (prior to passing it to the parser) will continue to execute as if it is an old terminal type (eg. if the <prefix>stack_cleanup()
function is invocated before the parser assumes control, then the cleanup code for the old terminal
will be executed if control was not passed to the parser. This only occurs if the caller returns.
As mentioned, this is a dangerous feature. Its use-case is as a last resort to resolve ambiguities in languages where some dependency on the symbol table influences the interpretation of tokens. For instance, in the C language, the typedef-name is ambiguous with a regular identifier. In such a scenario, a combination of $chg_term()
and <prefix>stack_accepts()
can help resolve the ambiguity.
Common Actions
As with productions in the grammar section, the patterns in the scanner section can have common actions defined. These execute before the main pattern action itself.
$: {
printf("Before the regular action\n");
}
IDENT: [a-zA-Z_][a-zA-Z0-9]* {
printf("After the common action\n");
}
$:{}
Specifying a common action in the scanner section replaces any existing common action and applies it to all following patterns. Specifying an empty common action $:{}
clears the common action, any following patterns will not have any common action until another one is specified.
Modes
The generated scanner may have modes. A mode is a state that the scanner is explicitly put in, where only some subset of patterns can be matched.
%mode
To specify the existence of a mode, use the %mode
directive:
%mode START BODY
The above directive creates a START
and a BODY
mode. The identifier for these is case insensitive, following the same rules as for symbol identifiers (eg. -
and _
may be used interchangeably.) Each mode will generate a corresponding mode constant M_<prefix>MODE
in the generated code, where MODE
is the upper-cased identifier specified in the %mode
directive.
In addition to the modes explicitly created, a default mode (named default
) is always created. Patterns that do not explicitly specify a mode will match the default
mode.
By default, the mode is set to the DEFAULT mode, to change the mode, use the <prefix>set_mode()
function with one of the M_<prefix>MODE
constants, for instance:
The above code sets the mode to the START
mode, note that this mode will survive calls to <prefix>stack_reset()
; when processing multiple input files, you will have to manually reset the initial mode to the desired value.
$set_mode(mode)
Using <prefix>set_mode()
with one of the M_<prefix>MODE
constants can be inconvenient because it creates a coupling to the <prefix>
value set with %prefix
(or its default value) and remembering the M_<prefix>...
generated code encodings (which you would otherwise rarely encounter.)
To help simplify things, the pattern action code may use the $set_mode(mode)
special identifier. This will look up the mode
identifier (which can be represented in the same manner as in the %mode
directive, that is, case-insensitive and optionally using dashes (-
)), and translate it in the generated code to a <prefix>_set_mode(stack, M_<PREFIX>MODE)
invocation with the appropriate prefix and constant encoding.
Note that, while $set_mode()
is specific to the scanner, it is available from any action. When using it from parser code, you should be aware that the parser operates on a lookahead. That is to say, the symbol that follows the production that is being reduced has already been parsed. In pattern action code, there is no such lookahead: $set_mode()
will immediately impact the next pattern to be matched.
$is_mode(mode)
To check if the current mode is mode
, use the $is_mode(mode)
special identifier. Analogous to $set_mode()
, this will look up the mode
identifier (which can be represented in the same manner as in the %mode
directive, that is, case-insensitive and optionally using dashes (-
)), and translate it in the generated code to an inline comparison of the current mode to the appropriate M_<PREFIX>MODE
constant.
The value is 1 if the mode is the same, or 0 otherwise.
The $is_mode(mode)
special identifier is available from any action.
Specifying a pattern’s mode
To specify a mode for a pattern, prefix the pattern with a mode designator:
%token IDENT DIGITS
%mode FIRST SECOND
<FIRST> IDENT: [a-zA-Z_][a-zA-Z0-9]* {
printf("Only matches if the mode is FIRST\n");
}
<FIRST, SECOND> DIGITS: [0-9]+ {
printf("Only matches if the mode is FIRST or SECOND, "
"but not DEFAULT\n");
}
<FIRST, SECOND, DEFAULT> IDENT: [a-zA-Z_][a-zA-Z0-9]* {
printf("Can match in all modes, including DEFAULT. "
"However, if the mode is FIRST, the first "
"rule takes precedence because both patterns "
"will match and it is declared earlier in the "
"carburetta input file.\n");
}
Grouped mode specification
In addition to the single pattern mode designator, patterns may be grouped under a mode designator:
%token IDENT DIGITS
%mode GROUP
<GROUP, DEFAULT> {
IDENT: [a-zA-Z_][a-zA-Z0-9]* {
printf("Matches when the mode is GROUP or DEFAULT\n");
}
DIGITS: [0-9]+ {
printf("Matches when the mode is GROUP or DEFAULT\n");
}
}
Note however, within a grouped mode designator <MODE>{...}
no further mode designators may be specified, mode designators cannot be nested.
The Generated Code
We’ll now examine the code that is generated so it can be integrated into the larger codebase. Potentially two files can be generated, the C file, containing the implementation of the parser and, if a scanner section exists, the scanner. The H file is optional, and describes the functions and structures exported by the C file as well as all symbol constants.
%prefix, <prefix> and <PREFIX>
All symbol constants, all struct
declarations, all global tables as well as all functions with external linkage, have a prefix prepended to them. By default, this prefix is carburetta_
for lowercase names and CARBURETTA_
for upper-case names. This can however be changed using the %prefix
directive. From the earlier example:
%prefix calc_
Here %prefix is used to set the prefix applied to all identifiers to calc_
, and this replaces the default prefix of carburetta_
. In this manual, the placeholders <PREFIX>
or <prefix>
are used to indicate upper or lowercase prefix strings in identifiers.
%externc and %no_externc
By default, all header files generated by carburetta are wrapped in an extern "C" { ... }
block. This is to allow the generated code (when you compile it as C or C++) to be used from both C++ and C code. However, if you are compiling the generated code as C++, and do not include the generated header file in the prologue section, then the C++ compiler will create decorated names for the functions, while the code using the header file will expect undecorated (C-style) linkage names. This will cause a linker error in a very common scenario for first time users.
For this reason, when Carburetta sees you using %class
or %common_class
, it will not generate extern "C" { ... }
for the header, as it knows you’ll be compiling it as C++.
You may want to influence this default behavior, for instance, if you’re not using %class
or %common_class
and you’re still compiling it as C++, you want a way to disable the extern "C" { ... }
block. There are two directives to help out:
%externc
, this will force Carburetta to generate theextern "C" { ... }
block.%no_externc
, this will force Carburetta to not generate theextern "C" { ... }
block.
%externc
and %no_externc
override the default behavior, irrespective of the presence or absence of %class
and %common_class
. You should only use these directives once in a Carburetta file.
Transformation to C/C++ identifiers
Sometimes, when only using the parser, and not using the scanner, the symbol identifiers must be used directly from your driver code. The identifier declaration that is generated in the output code derives directly from the identifier in the grammar.
The rule for generating a C/C++ identifier is as follows:
All letters are converted to uppercase
All dashes are converted to underscores
The prefix from the
%prefix
directive is converted to uppercase and prepended, if no such directive exists, the default prefixCARBURETTA_
is prepended.
In the calculator example, where the %prefix calc_
was used, the terminals PLUS
, and PAR_OPEN
become CALC_PLUS
and CALC_PAR_OPEN
. Similarly, the non-terminals grammar
and expr
become CALC_GRAMMAR
and CALC_EXPR
from the perspective of the code. Were we to have declared a symbol two-sides
then it would become CALC_TWO_SIDES
(with the dash converted to an underscore.)
Note that, unless a header is generated and included in the prologue, these symbol declarations are only available in the epilogue code section, after all scanner and parser code is generated, and not in the prologue and intermission code sections.
C/C++ File
The C/C++ output file is intended to compile as both valid C and C++, depending on whether the code snippets are C++ specific or can be compiled as C.
The C/C++ file consists of 3 sections: * Prologue, containing the code from the prologue section, as well as code from any intermission sections. * The generated code, containing the code for the parser and, if a scanner section is present, the scanner. * The epilogue, containing code from the epilogue section, if one is present.
Prologue and epilogue are fully copies of the user’s code in the .cbrt input file, in the remainder of this section, we will focus on the generated code. Aside from the primary programming interface (the parse and scan functions,) the discussion will include data structures and declarations that are very likely to change from version to version. Reason for this is to help give readers an understanding of how the parser works, both to aid in debugging and give flexibility in customized use. Being likely to change, the decision to use some of the internals of the parser directly should be weighed against he maintenance cost of working against a volatile baseline.
Include files in the generated C code
These includes may change from version to version.
The generated code relies on the standard C library providing the following features: realloc()
, free()
, NULL
, size_t
, memcpy()
, size_t
, and SIZE_MAX
. To access these, the following standard C library files are included: stdlib.h
, string.h
, stddef.h
and stdint.h
.
There currently is no way to redirect Carburetta to use non-standard C library implementations.
Known C++ limitations
The generated code is intended to compile as both valid C and C++. However, there are some limitations in the generated code that may cause problems when compiling as C++.
Symbol data objects are part of an array. If using a C++ class as a symbol’s data type with %type
, but no associated %move
directive, then may move around in memory using memcpy()
or realloc()
may be used to resize this array. When that happens, the operator=()
is not called for individual elements. Therefore, the object cannot rely on its own pointer to stay consistent. Care must be taken to include %move
for objects that are not trivially copyable.
To help address this, for C++, you can use the %class
directive to specify the type. Doing so will cause %raii_constructor
, %move
and %destructor
to be automatically added to the symbol data type. This will ensure that the object is properly constructed, moved and destroyed when the array is resized, according to C++ programmer expectations. (Please see the sections on %class
and %move
for details.)
Data structures and types
The C file generates several data structures, an understanding of which is either useful to understand the internal theory of operation, or required to interface in the case of struct <prefix>stack
.
struct <prefix>sym_data`
This structure is likely to change from version to version.
The <prefix>sym_data
structure describes a single state and its corresponding data. In an LR parser, states have a many to one relationship to the symbol; consequently, we can always derive the top symbol from the state it has brought the parser to, and therefore the symbol data associated with it.
The <prefix>sym_data
structure has the following fields: * state_
: an int representing the row of the state in <prefix>parse_table
. * common_
: optional, if the %common_type
directive is present, this is where the common data for a symbol is described. * union {}
: optional, if any %type
or %token_type
is present. An anonymous union holding the data for the state. Being a union, only 1 member is valid, depending on the symbol the state corresponds to. Because this union may hold C++ classes, the live-ness of those classes depends on state_ and is actively maintained via invoking %constructor
or %raii_constructor
, and %destructor
snippets as appropriate.
Because this structure contains user declared types need only to be declared in the prologue, the structure is not available from the header, as the prologue is not available there.
<prefix>scan_table_rex[]
This data structure is likely to change from version to version - not generated for utf8 scanners
A static constant global table housing the deterministic finite automaton (DFA) for the input scanner. Each row represents a state of the DFA, each column indexes the cell containing the next state for any input character from that state. The cell may have a value of 0 if the symbol at that state is invalid, or a positive number representing the next state. Given that an input character has 8 bits, this table has 256 columns.
<prefix>scan_table_grouped_rex_[]
This data structure is likely to change from version to version - only generated for utf8 scanners
A static constant global table housing the deterministic finite automaton (DFA) for the UTF-8 based input scanner. Because unicode has many codepoints, the next codepoint is not used directly in the DFA, instead, the codepoint is grouped into a range of codepoints, and the DFA is indexed by this group instead. This has the desirable effect of compressing the DFA, as many codepoints are in the same group. Each row represents a state of the DFA, each column indexes the cell containing the next state for any input character from that state.
The cell may have a value of 0 if the codepoint group at that state is invalid, or a positive number representing the next state. For this reason, row 0 is unused. Note also that column 0 is always invalid as the codepoint group 0 is reserved for “all codepoints not used in the scanner.”
<prefix>num_scan_table_grouped_columns_
This data structure is likely to change from version to version - only generated for utf8 scanners
The width of the <prefix>scan_table_grouped_rex_[]
table, in columns.
<prefix>utf8_decoder_
This data structure is likely to change from version to version - only generated for utf8 scanners
A decision table to scan a sequence of bytes and determine the codepoint group they represent. The table is indexed by the byte value, and the value in the table is the next state if positive, alternatively if negative the one’s complement represents the codepoint group used to index the <prefix>scan_table_grouped_rex_[]
table. If the one’s complement is 0 (that is to say, the value is -1) then the sequence is not a valid UTF-8 codepoint. Given that an input byte has 8 bits, this table has 256 columns.
<prefix>scan_actions_rex[]
This data structure is likely to change from version to version.
Indicates the action for each DFA state in the <prefix>scan_table
; if the number is zero, there is no action, if the number is positive, it corresponds to a (1-based) pattern index, counted from the top of the .cbrt input file, in order of appearance.
This structure is used by the <prefix>lex()
function to discover and record the current longest valid match.
<prefix>minimum_sym
The meaning and availability of this constant is likely to change from version to version.
Contains the minimum symbol value. This value is used to 0-base the column index into <prefix>parse_table
.
<prefix>num_columns
This meaning and availability of this constant is likely to change from version to version.
Contains the number of values on a row of the <prefix>parse_table
, used to index the table by row.
<prefix>num_productions
This value is no longer available
<prefix>parse_table[]
This data structure is likely to change from version to version.
Each row of this table is <prefix>num_columns
wide and corresponds to an LR state in the parser, the column index is the next symbol (minus <prefix>minimum_sym
) at that state. The total number of columns is num_columns
, this value is likely higher than you might expect it to be due to implicit symbols created for errors, end of input and for the synthetic S production (a production added to represent the grammar in its entirety.) The initial state is 0.
Cells in the table correspond to the action to take at a particular state when it encounters a particular symbol. For any action value greater than 0, the action is to shift the terminal and go to the state equal to the action value. For any action value less than 0, the action is to reduce a production. The index of the production is equal to production# = -action - 1
, the first production (number 0) is the synthetic S production. The synthetic S production represents the entire grammar (the synthetic S production is S': N
where N
is the non-terminal that the first user defined production reduces to,) consequently, if -action - 1 == 0
then parsing has successfully completed. Finally, if the action is neither less than 0, nor greater than 0 but equals 0, then the input symbol is an error (the parser upon encountering it, will then proceed to error recovery.)
<prefix>production_lengths[]
This data structure is likely to change from version to version.
For each production, the array contains the number of symbols on the rule side of the production (excluding the nonterminal reduced to.) The array is indexed by the number of the production in the same order as found in the source, starting at 1. Index 0 is reserved for the synthetic S production that is generated by Carburetta itself.
<prefix>production_syms[]
This data structure is likely to change from version to version.
For each production, the array contains the ordinal value of the nonterminal symbol reduced to. It is indexed the same way as <prefix>production_lengths
. The intent of this array is for it to be used to analyze the contents of the stack at runtime.
struct <prefix>stack
The <prefix>stack
structure manages all state of the parser during its operation; in object oriented terminology, it would be the “parser object.”
The lifetime of a struct <prefix>stack
is loosely as follows: * Allocated either as a local variable, or using malloc. * <prefix>stack_init()
: the parser stack is initialized. * The parsing loop executes: * <prefix>stack_set_input()
: the parser stack is given a reference to an input buffer * <prefix>stack_scan()
: the parser stack scans and parses the input * <prefix>stack_reset()
: if the parsing stack is to be re-used, it is reset. * <prefix>stack_cleanup()
: the parser is cleaned up, all memory released, all latent user data deconstructed.
Here is how we might wrap the generated struct calc_stack
from the simple calculator example earlier in a C++ class:
class Parser {
private:
struct calc_stack stack_;
int result_ = 0;
public:
Parser() {
// cannot fail, no error handling needed
calc_stack_init(&stack_);
}
~Parser() {
calc_stack_cleanup(&stack_);
}
void reset() {
switch (calc_stack_reset(&stack_)) {
case _CALC_OVERFLOW: throw std::exception("overflow");
case _CALC_NO_MEMORY: throw std::bad_alloc();
}
}
int scan() {
int r = calc_scan(&stack_, &result_);
// filter out exceptions
switch (r) {
case _CALC_OVERFLOW: throw std::exception("overflow");
case _CALC_NO_MEMORY: throw std::bad_alloc();
case _CALC_INTERNAL_ERROR: throw std::logic_error();
}
return r;
}
// .. and so on ..
};
<prefix>destroy_at(), <prefix>construct_at(), <prefix>move_at() (C++ only)
template<typename T> void <prefix>destroy_at(T *p);
template<typename T> void <prefix>construct_at(T *p);
template<typename T> void <prefix>move_at(T *dst, T *src);
These are function templates with automatic type deduction to construct (call the constructor of), destroy (call the destructor of) and move (assign using std::move
semantics). They are used by the default implementations for %raii_constructor
, %move
and %destructor
whenever the %class
directive is used to specify a C++ type. Because they are C++ specific, they are only generated when the parser is compiled as C++, which occurs upon first encountering a %class
directive. Reason for having them instead of calling placement new, std::move
assignment and the destructor directly is to facilitate array types; for which these functions also specialize.
The function templates are inspired by the C++17 standard library template< class T > void destroy_at( T* p );
and C++20 standard library template< class T, class... Args > void construct_at( T* p, Args&&... args );
functions. Reason for not using them directly is that they are not available in earlier versions of C++ and will therefore not compile “out of the box” unless the compiler supports C++17 or C++20. This would impose requirements on the flags passed to some of the compilers where C++17 and up might not be enabled, whereas now the code can be compiled as-is without requiring changes to the compiler flags.
Function return codes
While in the case of <prefix>scan()
and <prefix>parse()
the return value behavior may be completely customized, there do exist a set of default return codes and these return codes are also used internally. For both working with the default behavior as well as debugging your code, each return code is described here. Future versions may add or remove these return codes.
Notice that all return codes start with a leading _
underscore, this is to avoid collisions with terminal and non-terminal symbols. For instance, if your language has a %token FINISH
then the corresponding C/C++ identifier would be <PREFIX>FINISH
which conflicts with the FINISH return code had it not been prefixed with an underscore.
_<PREFIX>FINISH
The _<PREFIX>FINISH
code is returned by both <prefix>scan()
and <prefix>parse()
to indicate that the input has been fully processed to its finish. It does not mean it did so successfully, as _<PREFIX>FINISH
is also returned if, for instance, an error occurred, and the parser later failed to recover from that error.
_<PREFIX>FINISH
is defined to equal 0, which may be useful when used directly as a part of logic condition evaluation.
_<PREFIX>MATCH
The _<PREFIX>MATCH
code is returned by the internal <prefix>lex()
function to indicate it found a lexical match. It is not returned by <prefix>scan()
or <prefix>parse()
functions, and so caller code does not typically check for it.
_<PREFIX>OVERFLOW
Returned when internal logic detected an overflow. These overflows may occur during allocation size computations. Given the very high field size tolerances, an overflow typically indicates some form of memory corruption or internal error. It should therefore be considered fatal.
_<PREFIX>NO_MEMORY
Returned when a call to realloc()
failed to return memory. All internal data structures remain valid, if there is some magical way that memory can be made available (e.g. free some form of temporary cache used by the program,) then execution could retry. Typically, however, this is impossible and not worth it, and the error should be considered fatal.
_<PREFIX>FEED_ME
Returned by <prefix>scan()
, <prefix>parse()
and the internal <prefix>lex()
functions to indicate more input is needed (whether that be a string buffer in the case of <prefix>scan()
and <prefix>lex()
or the next terminal symbol in the case of <prefix>parse()
.)
<PREFIX>FEED_ME
can only be returned if the input currently being processed is not final. For <prefix>scan()
and <prefix>lex()
this means the last call to <prefix>set_input()
should not have its fourth argument, is_final_input, set. For <prefix>parse()
it means it should not have received the <PREFIX>INPUT_END
symbol as its current input.
_<PREFIX>END_OF_INPUT
Returned by the internal <prefix>lex()
function to indicate the end of the input has been reached and no more lexical matches can be made. This is internally processed by the <PREFIX>scan()
function. It is not returned by <PREFIX>scan()
or <PREFIX>parse()
and so caller code does not typically check for it.
_<PREFIX>SYNTAX_ERROR
Returned by <prefix>scan()
and <prefix>parse()
when the current input symbol mismatches with the grammar of the language. The parser upon re-entry, the parser will be in error recovery mode, if such a recovery is possible, or have discarded the input symbol otherwise.
When _<PREFIX>SYNTAX_ERROR
is returned from <prefix>scan()
, the current symbol is the faulty one, and the caller may use the location functions (<prefix>line()
, <prefix>col()
, <prefix>offset()
, <prefix>endline()
, <prefix>endcol()
and <prefix>endoffset()
) and access the token text itself (<prefix>text()
, <prefix>len()
) to issue an error message to the end user when this error code is returned.
Note that _<PREFIX>SYNTAX_ERROR
is an error discovered by the parser. The next error code represents errors discovered by the scanner.
If a _<PREFIX>SYNTAX_ERROR
s is returned, the parser will enter error recovery mode. Only after the parser has recovered, and three more symbols have been processed without error, will another _<PREFIX>SYNTAX_ERROR
be returned. This behavior does not have any effect on error productions; the error recovery process continues to trigger in the background irrespective of whether _<PREFIX>SYNTAX_ERROR
was returned.
_<PREFIX>LEXICAL_ERROR
Returned by <prefix>scan()
and the internal function <prefix>lex()
when the current input character mismatches any of the patterns in the input language. This currently occurs on a character-by-character basis; consequently, a lot of such errors may be returned consecutively if the input is corrupt or not resembling the language. Callers may wish to group such errors into longer strings of errors, Carburetta does not currently provide such a feature.
Upon reentry, the faulty character is discarded, and input processing is continued anew.
When _<PREFIX>SYNTAX_ERROR
is returned from <prefix>scan()
, the caller may use the location functions (<prefix>line()
, <prefix>col()
, <prefix>offset()
, <prefix>endline()
, <prefix>endcol()
and <prefix>endoffset()
) and access the character text itself (<prefix>text()
, <prefix>len()
) to issue an error message to the end user when this error code is returned. Given that the contents of the invalid character can be anything, care must be taken to ensure such a message is printable.
_<PREFIX>INTERNAL_ERROR
Indicates a problem / inconsistency in the parser logic itself. Either there is a bug in Carburetta itself or some form of memory corruption has taken place. May be returned by <prefix>scan()
and <prefix>parse()
.
Functions
This section describes the functions generated for running the parser.
<prefix>stack_init()
Initializes a struct <prefix>stack
for use.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure to be initialized. All values will be overwritten.
Description
Initializes the struct <prefix>stack
structure so it may be used by the other functions. Note that this function is designed to never fail, and can consequently be easily paired to a corresponding <prefix>stack_cleanup()
.
<prefix>stack_cleanup()
Frees all memory associated with a <prefix>stack
structure and executes all user defined %deconstructor
logic for any latent user data cleanup.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure to be cleaned up. After execution, the structure is invalid, its contents undefined.
Description
Use this for cleanup of the stack structure. The generated code is designed such that, if, at any moment, the program needs to abort, the <prefix>stack_cleanup()
function may be safely called to clean up all resources. That said, <prefix>stack_cleanup()
may not be called from within user defined snippet code unless that snippet code then immediately exits the scan()
and/or parse()
function in which it appears.
<prefix>stack_reset()
Resets the stack to a known-good initial state. Memory allocated by Carburetta generated code is retained for efficiency, any latent user-data in the <prefix>stack
structure is deconstructed using user-defined %deconstructor
logic.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure to be reset. After execution, the structure is valid for use as-if <prefix>stack_cleanup()
and <prefix>stack_init()
were called back to back, except that any input buffer set before the reset using <prefix>set_input()
will continue to be valid.
Return value
Unlike <prefix>stack_init
which will always succeed, <prefix>stack_reset
can fail and may return the following errors: * 0 upon success. * _<PREFIX>OVERFLOW
, an overflow occurred during execution (may want to check if the stack
structure was properly initialized with <prefix>stack_init()
.) * _<PREFIX>NO_MEMORY
, a memory allocation failure, not enough memory or some form of fatal corruption.
Description
Depending on how the language implemented using Carburetta is used, it may be desirable to rapidly invoke many small bits of input. In that case, for efficiency sake it can be desirable to re-use the memory allocated by the scanner and parser, rather than freeing them and allocating them again. For this purpose <prefix>stack_reset()
exists.
Note that <prefix>stack_reset()
internally initializes the stack structure more comprehensively than <prefix>stack_init()
, and, unlike <prefix>stack_init()
, may therefore fail. Callers should check the return value.
<prefix>stack_can_recover()
Checks whether the <prefix>stack
structure is in error recovery mode.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure that is to be checked.
Return value
Non-zero if the <prefix>stack
structure is in error recovery mode, zero otherwise.
Description
If <prefix>scan()
or <prefix>parse()
returns _<PREFIX>SYNTAX_ERROR
, then the caller may want to know whether there is a chance for recovery. There is a chance for recovery if there exists an error production that can be reached by transitioning an error symbol from any of the states currently on the stack. When there is a chance for recovery, the stack enters recovery mode, and will discard input symbols until a symbol matches after an error symbol. In this case, continuing to process input is a good idea as more errors may be discovered for the user.
If the stack does not enter recovery mode, the parser will continue to discard input symbols until a symbol matches the current input. This is very unlikely to lead to recovery, and very likely to generate a lot more errors that are not useful to the end user. In this case, aborting processing and returning the error to the user is a good idea, as it helps the user pinpoint the original problem.
The <prefix>stack_can_recover()
function helps the caller make this determination; if it returns non-zero, there is sense in continuing to process after receiving _<PREFIX>SYNTAX_ERROR
, if it returns zero, it is probably best to abort.
<prefix>set_mode()
Sets the current scanning mode, determining which patterns may match.
<prefix>mode()
Returns the current scanning mode.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure that is to be checked.
mode
One of the generated M_<PREFIX>XXX
constants (where XXX
is the uppercased name of the mode as specified in the %mode
directive, or DEFAULT
for the default mode.)
Description
Changes the mode of the scanner. The mode will determine which patterns may match based on the mode specifications for the patterns.
The mode of a scanner can only be changed at the start of a new pattern to be matched, consequently, a call to <prefix>set_mode()
takes effect when:
a pattern has been matched and scanning has not yet started on the next match.
a call to
<prefix>stack_reset()
is performed (which implies scanning will start anew with the new mode.)
This means calling <prefix>set_mode()
from a pattern action will have the desired effect: the next match will obey the new mode. However, calling <prefix>set_mode()
from a situation where the scanning was temporarily interrupted (eg. _<PREFIX>FEED_ME
is returned) will lead to unpredictable results. Calling <prefix>set_mode()
immediately before or after <prefix>stack_reset()
will have the desired effect of starting the scanner on a new input in a particular mode.
Note that calling <prefix>set_mode()
from a production’s action is valid and leads to predictable results, however, keep in mind that a production will only reduce (and have its action executed) when a lookahead symbol that is not part of the production matches this reduction. The mode will be set for the pattern one additional match beyond the production:
%grammar
%mode ALT
%token ONE TWO THREE
%nt grammar mode-switch
grammar: ONE mode-switch TWO THREE;
mode-switch: { carburetta_set_mode(M_CARBURETTA_ALT); }
The action for the mode-switch
production is executed when the token TWO
has already been scanned - it is using the lookahead of TWO
that the parser determines the action for mode-switch
should execute. Consequently, the mode change in this action carburetta_set_mode(M_CARBURETTA_ALT)
sets the scanner to the ALT
mode specified in the %mode
directive, but this will only take effect on the matching of the THREE
token that follows it.
<prefix>stack_accepts()
Checks whether the symbol (defined by its ordinal value) can currently be accepted as input.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure against which sym is to be checked.
sym
The ordinal value (one of the defined constants based on prior %nt
and %token
directives) of a symbol to be checked as viable input in the current state.
Return value
Non-zero if the <prefix>stack
structure can currently accept the symbol as input, zero if the symbol is currently not acceptable input.
Description
Some languages have ambiguities for which symbol table lookups are required to resolve them. The <prefix>stack_accepts()
function allows an implementation for such a language to determine if the parser can currently accept a symbol and a symbol table lookup might make sense.
For instance, in the language C, when an identifier is about to be processed, it is important to distinguish an existing identifier representing a typedef
from a not yet existing identifier. Take the declaration const uintptr_t x;
as an example, const
is a keyword that, by itself, would suffice to declare the type specifier of a variable declaration. Whether uintptr_t
is part of the type specifier (and const
acts as a modifier on the uintptr_t
) or whether a variable with the name uintptr_t
is to be declared, depends on whether the uintptr_t
exists as a typename in the symbol tables. Such a compiler would need to 1) know whether a typedef-name is accepted, 2) check whether the current identifier is a type in the symbol tables. The <prefix>stack_accepts()
function is intended to help with step 1) without having to deep-dive in the internals of how the LR parser works.
<prefix>set_input()
Sets the buffer where text input is sourced from.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure for which the input buffer is to be set.
s
Pointer to the start of a buffer, len bytes long, where the next sequence of input is for the parser. The pointer will be copied, not the value pointed to. The buffer pointed to must therefore remain valid beyond this call, until the end of input is reached or until the next call to <prefix>set_input()
.
len
The length, in bytes, of the buffer at s.
Description
To pass input to the scanner in a flexible manner, the caller must supply a buffer where the input can be found. To avoid the inefficiencies of a copy, a pointer to the buffer is maintained inside the stack, the buffer is not ‘deep-copied’ into an internal buffer until the moment where it is actually processed as part of a potential match.
The following would therefore be trouble:
In the above code, the lifetime of the const char *
returned by std::string("uh-oh").c_str()
expires at the end of the statement, the code therefore instructs Carburetta to maintain a live reference to a freed buffer; which will crash.
The above would work, provided the variable s
is not modified and maintained until either the next invocation to <prefix>set_input()
or until end of input.
<prefix>set_location()
Set the source code location of the next terminal matched by the scanner.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure for which the source code location of the next terminal symbol is to be set.
line
The line number of the next terminal symbol. The line number is initially 1-based.
col
The column number of the next terminal symbol. Column numbers are 1-based and reset to 1 on every newline.
offset
Byte offset of the next terminal symbol. This is typically 0-based from the start of input.
Description
Sets the start location of the next terminal symbol (the next match of the scanner.) The end location will be moved up accordingly.
After calling <prefix>set_location()
, the return values of <prefix>line()
, <prefix>column()
, <prefix>offset()
, <prefix>endline()
, <prefix>endcolumn()
, and <prefix>endoffset()
are undefined. Only when the <prefix>scan()
(or the internal function <prefix>lex()
) return a match do these functions return the location set. Subsequent matches will progress the location as usual.
<prefix>text()
Returns the pointer to the null terminated string of the current match or error.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the currently matched text is to be returned.
Return value
Returns a const pointer to a null terminated string containing the text currently matched. The pointer remains valid until reentry into the <prefix>scan()
or <prefix>lex()
functions, or, if it is called from a snippet within the <prefix>scan()
function, it is valid until execution leaves the snippet.
Description
<prefix>text()
matches the behavior of the $text
special identifier, but is accessible for code outside the pattern actions.
In the following situations, <prefix>text()
may be called:
- After
_<PREFIX>MATCH
is returned by the internal<prefix>lex()
function. - After
_<PREFIX>SYNTAX_ERROR
is returned by<prefix>scan()
. - After
_<PREFIX>LEXICAL_ERROR
is returned by<prefix>scan()
or<prefix>lex()
. (Note that the match here is a single character.) - If, during a scanner pattern action, execution is returned to the caller.
While it is also possible to call <prefix>text()
from within production actions, doing so may lead to unexpected results as a reduction of a production (leading to the execution of its actions) occurs based on the lookahead (the first symbol beyond the production) and not any symbol in the production itself. It is strongly recommended not to use it in these cases as this lookahead need not have any valid text but may instead be an end-of-input terminal.
<prefix>len()
Returns the length of the string returned by <prefix>text()
.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the length of currently matched text is to be returned.
Return value
Returns the length of the null-terminated string returned by <prefix>text()
; it excludes the null-terminator, so <prefix>text(stack)[<prefix>len(stack)]
will return '\0'
.
Description
Returns the length of the string returned by <prefix>text()
. Although this string is null-terminated, the match may include null characters and the scanner already has this information available for its own purposes. Therefore, for consistency and efficiency reasons, <prefix>text()
is available.
It is subject to the same constraints as <prefix>text()
, please see its description for when use of <prefix>len()
is valid.
<prefix>line()
Returns the line number of the first character of the current match.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the line number of the current match is to be returned.
Return value
The 1-based line number of the start of the current match.
Description
Returns the line number for the start of the current match. If the current match starts with a newline '\n'
character, then the line number returned excludes the increment that that newline executes on the line number.
A line number for any source starts at line 1, or at any number set by <prefix>set_location()
.
It is subject to the same constraints as <prefix>text()
, please see its description for when use of <prefix>line()
is valid.
<prefix>column()
Returns the column number for the start of the current match.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the column number of the first character of the current match is to be returned.
Return value
The 1-based column number of the start of the current match.
Description
Returns the column number of the first character of the current match. If the current match starts with a newline '\n'
character, then the column returned equals the length of the line plus 1 of the line that is terminated by the '\n'
newline character.
The first column of any line is column 1.
It is subject to the same constraints as <prefix>text()
, please see its description for when use of <prefix>column()
is valid.
<prefix>offset()
Returns the byte offset of the start of the current match.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the byte offset of the first character of the current match is to be returned.
Return value
The byte offset value of the start of the current match.
Description
Returns the byte offset of the start of the current match. The byte offset initially starts at 0, but may be changed by <prefix>set_location()
to become any value.
It is subject to the same constraints as <prefix>text()
, please see its description for when use of <prefix>offset()
is valid.
The intended goal of maintaining the byte offset is to be able to calculate the offset of the start of the line that a particular symbol appears on: line-start-offset: symbol-offset - symbol-column + 1
. This is useful when designing a “pretty” error message printer that reproduces the entire line that the erroneous symbol is on, and underlines it (or highlights it,) saving the effort of having to search backward through the input, or retaining all lines of the input.
<prefix>endline()
Returns the line number of the first character after the current match.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the line number of the first character after the current match is to be returned.
Return value
The 1-based line number of the first character after the current match.
Description
The counterpart to <prefix>line()
this function returns the line number after the match. For example, if the current match consists of a single newline, this function returns <prefix>line() + 1
.
The line number initially starts at 1 but may be set to any value by <prefix>set_location()
.
It is subject to the same constraints as <prefix>text()
, please see its description for when use of <prefix>endline()
is valid.
<prefix>endcolumn()
Returns the column number of the first character after the current match.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the column number of the first character after the current match is to be returned.
Return value
The 1-based column number of the first character after the current match.
Description
The counterpart to <prefix>column()
this function returns the column number of the first character after the match. For example, if the current match consists of a single newline, this function returns 1
(because the newline will have reset the column value to 1.)
It is subject to the same constraints as <prefix>text()
, please see its description for when use of <prefix>endcolumn()
is valid. Please see the discussion at <prefix>offset()
for how this function might be used together with <prefix>endcolumn()
to reproduce the entire line the current match ends at.
<prefix>endoffset()
Returns the byte offset of the first character after the current match.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the byte offset of the first character after the current match is to be retrieved.
Return value
The byte offset value of the first character after the current match.
Description
The use of the byte offset after the current match may be for seeking in files to the position where the matched symbol occurs. Please see the description at <prefix>offset()
for how this value may be combined with <prefix>endcolumn()
to find the entire line the matched symbol ends at.
This function is subject to the same constraints as <prefix>text()
, please see its description for when use of <prefix>endoffset()
is valid.
<prefix>token_common_data()
Returns the common data pointer for the current scanner pattern match.
Arguments
stack
Pointer to the <prefix>stack
for which the common data pointer of the current match is to be returned.
Return value
Returns the common data pointer for the current scanner pattern match. Returns NULL
if there is no current token match (e.g. the match does not exist or has already been passed to the parser.) Typically, the common data pointer is NULL
unless we’re in the specific scenario where the parser encounters a syntax error (_<PREFIX>SYNTAX_ERROR
). At that point, <prefix>token_common_data()
will return the common data pointer of the LR lookahead token that caused the syntax error (this is typically the token you wish to report as the cause of the syntax error.)
Note that the type of the value is void *
- this may be inconvenient but ensures that the %common_type
is not unnecessarily exposed in the header.
Description
The <prefix>token_common_data()
function is designed to return the common data pointer associated with the current scanner pattern match. This function becomes useful in scenarios where the parser detects a _<PREFIX>SYNTAX_ERROR
on the next token. In these situations, the next token has already been scanned and its common data constructed, but it has not yet been transferred into the parser stack. To report the error, the common data might be valuable as it can commonly contain the source location of the symbol or token in question. It is, however, difficult to access without knowledge about the scanner’s implementation details.
This function offers an efficient and straightforward method to access the common data of the token where the error occurred. This function returns a pointer to the common data if there is a current token and if a %common_type
(or %common_class
) definition exists. The “current token” refers to the token in its state prior to being shifted onto the LR stack. The function should primarily be relied upon to exist after a _<PREFIX>SYNTAX_ERROR
is returned. Note that even in this case, the function might still return NULL
if there is no common data associated with the token. (This would occur if the the syntax error is caused by the end of input, for instance.)
The return type of this function is a void pointer (void*). While this might be slightly inconvenient, it ensures that the common datatype isn’t exposed in any generated header, thereby avoiding potential dependencies that must be met before including the header.
Callers should always check if NULL
is returned by the function. If it is not, they can confidently cast the returned pointer to the appropriate type for %common_type
or %common_class
.
<prefix>lex()
This function is likely to change from version to version.
Performs lexical scanning on the current input.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure that is to continue scanning.
Return value
<prefix>lex()
may return one of the following return codes:
_<PREFIX>MATCH
, a match was successfully found and stored as the currently matched symbol instack
._<PREFIX>FEED_ME
, more input is needed before a match can be found, caller code should call<prefix>set_input()
._<PREFIX>END_OF_INPUT
, the end of input has been reached, no more matches to be made._<PREFIX>LEXICAL_ERROR
, could not match the input to the patterns, the mismatching character is set in the match buffer._<PREFIX>OVERFLOW
, an internal overflow occurred, this likely indicates a bug or corruption and should be treated as fatal._<PREFIX>NO_MEMORY
, an allocation call torealloc()
failed, can be retried if more memory is made available.
Internally, if the result is a match (_<PREFIX>MATCH
), the best_match_action_
field contains the ordinal of the pattern that matched, where each pattern is numbered in order of appearance in the source file starting at 1.
Description
Internal function that performs the lexical scanning of the input. When given input in the <prefix>stack
structure using <prefix>set_input()
, the function will attempt to match that input to the patterns specified by the user.
If no match can be made on the next character, then _<PREFIX>LEXICAL_ERROR
is returned and the faulty character is returned. Note this can take more than one character to discover, for instance, a string that is missing its closing quotes might take all the way to the end of the file, before the opening '\"'
is returned as the character that is the lexical error.
If, while evaluating input, the end of input has been reached and a longer match might still be possible, then behavior depends on the is_final_input
flag to <prefix>set_input()
. If is_final_input
was set, then the current best match is returned with _<PREFIX>MATCH
, if one is available, or the aforementioned _<PREFIX>LEXICAL_ERROR
if no current best match is available. If is_final_input
was not set, more input will be requested by returning _<PREFIX>FEED_ME
.
If there is no further input to evaluate beyond the last returned symbol or error, then the is_final_input
flag controls whether we return _<PREFIX>END_OF_INPUT
, or _<PREFIX>FEED_ME
- the latter indicates the more input is required (more input can be just setting the is_final_input
flag without supplying any further bytes.)
Note that no user defined pattern snippet code is executed in the generated <prefix>lex()
function, its role is to match the patterns and return the pattern that matched inside the struct <prefix>stack
structure. You should probably not use this function directly as it will be subject to a lot of change from version to version but understanding it will aid in debugging the generated code.
<prefix>scan()
Performs lexical scanning and parsing on the current input.
Syntax
Arguments
stack
Pointer to the <prefix>stack
structure where parsing is to continue.
<… %params …>
Optional arguments defined by the %params
directive.
Return value
The default behavior is for <prefix>scan()
to return one of the following return codes:
_<PREFIX>FINISH
, the scanner and parser is done with processing all input, successful or not, it will eventually return this._<PREFIX>FEED_ME
, more input is needed before a match can be found, caller code should call<prefix>set_input()
._<PREFIX>LEXICAL_ERROR
, could not match the input to the patterns, the mismatching character is set in the match buffer._<PREFIX>SYNTAX_ERROR
, could not match the currently matched symbol to the grammar._<PREFIX>INTERNAL_ERROR
, an internal error occurred, this is either a bug in Carburetta or memory corruption._<PREFIX>OVERFLOW
, an internal overflow occurred, this likely indicates a bug or corruption and should be treated as fatal._<PREFIX>NO_MEMORY
, an allocation call torealloc()
failed, can be retried if more memory is made available.
Description
Main function for scanning and parsing in fused mode. Input in the <prefix>stack
structure, previously set using <prefix>set_input()
is processed, pattern actions are executed, and parsing is performed with its associated production actions.
The default behavior is that the <prefix>scan()
function returns when:
- An error occurs during scanning,
_<PREFIX>LEXICAL_ERROR
is returned, current match is the erroneous character, re-entry resumes scanning after the erroneous character. - An error occurs during parsing,
_<PREFIX>SYNTAX_ERRROR
is returned, current match is the erroneous symbol, re-entry resumes scanning after the erroneous symbol. - More input is needed,
_<PREFIX>FEED_ME
is returned, the caller is expected to call<prefix>set_input()
to either pass in more input or setis_final_input
to non-zero before re-entry (otherwise_<PREFIX>FEED_ME
will be returned again.) - All input has been processed,
_<PREFIX>FINISH
is returned, re-entry will reset the parser to start again. - Any user action code snippet issues a
return
, return code is subject to interpretation by caller, upon re-entry, code resumes as if the action code snippet had completed execution normally. (Exception to this rule is %raii_constructor action code snippets, which, upon re-entry, are executed again.) - A memory allocation failure occurs,
_<PREFIX>NO_MEMORY
is returned, re-entry will retry the failed allocation, which makes sense only if the allocation might now succeed. - An overflow occurs,
_<PREFIX>OVERFLOW
is returned, re-entry will once more overflow (the error is unrecoverable.)
The above behavior can be fully customized; while the return type of <prefix>scan()
is fixed as int
, the value returned can be defined to whatever is needed by the calling code, for instance, if the target language would like to implement yield-like semantics, where part of the input is processed, some part of data returned to caller, only to return into the context for the continuation of processing, then that is possible. To customize the behavior, override the conditions that lead to exit from the function. Please see the following directive for details on how to override return behavior of both <prefix>scan()
and <prefix>parse()
functions: %on_finish
, %on_syntax_error
, %on_lexical_error
, %on_alloc_error
, %on_internal_error
, %on_next_token
, and %on_feed_me
.
<prefix>parse()
Performs parsing for the next given symbol.
Syntax
Arguments
stack
Pointer to the <prefix>stack
for which the next symbol is to be parsed.
sym
Ordinal value of the next symbol to be parsed by the <prefix>parse()
function.
<… %params …>
Optional arguments defined by the %params
directive.
Return value
The default behavior is for <prefix>parse()
to return one of the following return codes:
_<PREFIX>FINISH
, the parser has completed the parsing of all input. This may or may not indicate success, depending on previous errors returned._<PREFIX>FEED_ME
, the symbolsym
has been processed, the parser is ready for the next symbol if input._<PREFIX>SYNTAX_ERROR
, could not match the symbolsym
to the grammar from the parser’s current state._<PREFIX>INTERNAL_ERROR
, an internal error occurred, this is either a bug in Carburetta or memory corruption._<PREFIX>OVERFLOW
, an internal overflow occurred, this likely indicates a bug or corruption and should be treated as fatal._<PREFIX>NO_MEMORY
, an allocation call torealloc()
failed, can be retried if more memory is made available.
Description
Main function for parsing without scanning. The symbol to be parsed next is set in sym
and should be one of the terminals for which Carburetta has generated constants (e.g. "<PREFIX><YOUR_TOKEN_IDENTIFIER>"
where <YOUR_TOKEN_IDENTIFIER>
is the uppercased version of a terminal declared using the %token
directive.)
There are two use-cases in mind for <prefix>parse()
:
- You write your own scanner because, for whatever reason, the scanner capabilities provided by Carburetta make for a difficult fit (eg. you’re parsing Python-style indentation or hitting one of the limitations.)
- You want to use both the
<prefix>parse()
and<prefix>scan()
functions; the parse() function to get into the right context, the scan() function to process actual user input. This is a hybrid mode between scanner and parser.
A little bit on that second option. It is difficult to know the state of the scanner from the perspective of the parser as substantial amounts of lookahead may have been required; consequently, it is difficult to go from a <prefix>stack
operated with <prefix>scan()
to one operated with <prefix>parse()
. The inverse is however very possible and in cases useful.
If your grammar defines special ‘key’ terminals that do not otherwise appear (and match no pattern) then these ‘key’ terminals may be used to let the grammar enter a state where a particular context from the grammar is immediately available. Imagine, for instance, the expression subset of a programming language being immediately available to enable a REPL (Read, Evaluate, Print, Loop) without first having to build out a function.
To do this, first invoke <prefix>parse
with your ‘key’ terminals, special productions at the top level of your grammar short-circuit to the desired subset based on the ‘key’ terminals. Following this, the regular text may be passed in directly. For this scenario, be careful that the initialization that takes place for patterns also takes place for the ‘key’ terminals. To ensure this happens, consider using the %token_action
directive, which defines actions for any given type that execute after the %constructor
action, but only in the <prefix>parse()
function, not in the <prefix>scan()
function.
The default behavior for <prefix>parse()
is to return when:
- An error occurs during parsing,
_<PREFIX>SYNTAX_ERROR
is returned, re-entry should pass in a new symbol insym
, or continues to pass in_<PREFIX>INPUT_END
if the end of input was reached. - The symbol has been processed and another symbol is needed,
_<PREFIX>FEED_ME
is returned. Re-entry takes a new symbol insym
. - The current symbol is
_<PREFIX>INPUT_END
and either it is a valid state to end the grammar, or a_<PREFIX>SYNTAX_ERROR
has already been returned before to indicate this is an invalid way to end the grammar, in this case_<PREFIX>FINISH
is returned. - Any user action code snippet issues a
return
to directly exit the function by an arbitrary user-defined return code. On re-entry, the same symbol insym
should be passed in, execution continues as if the user action code snippet completed normally. (An exception to this behavior is the%raii_constructor
action code snippet which will instead retry the action code snippet upon re-entry.) - A memory allocation failure occurs,
_<PREFIX>NO_MEMORY
is returned, re-entry will retry the failed allocation, which makes sense only if there is some reason to believe allocation might now succeed. - An overflow occurs,
_<PREFIX>OVERFLOW
is returned, re-entry will once more overflow (the error is unrecoverable.)
The above behavior can be fully customized; while the return type of <prefix>parse()
is fixed as int
, the value returned can be defined to whatever is needed by the calling code, for instance, if the target language would like to implement yield-like semantics, where part of the input is processed, some part of data returned to caller, only to return into the context for the continuation of processing, then that is possible. To customize the behavior, override the conditions that lead to exit from the function. Please see the following directive for details on how to override return behavior of both <prefix>scan()
and <prefix>parse()
functions: %on_finish
, %on_syntax_error
, %on_lexical_error
, %on_alloc_error
, %on_internal_error
, %on_next_token
, and %on_feed_me
.
Customizing <prefix>parse() and <prefix>scan()
Carburetta aims to be flexible in how its used and the code it generates. Consequently, there are ways to customize the <prefix>scan()
and <prefix>parse()
algorithm which can fundamentally change the way parsing operates, from control inversion to the function prototypes.
%locals
The %locals
directive defines variables for local declaration. The text passed to it is copied verbatim into the source and so should terminate with a semicolon:
%locals int x;
The above code declares the local variable x. The purpose for %locals
is to prevent repetitive declarations; for instance, if a group of productions all need to capture and check a return variable, then they all need to declare such a variable; doing this in %locals
prevents repetition of such a declaration and makes the code more expressive.
Care must be taken to only use the local variables within their limitations, in particular:
%locals
are currently only accessible from actions associated with productions, you cannot access them from patterns as then, they are not in scope.%locals
have a short lifetime of a single production’s execution. If, and only if, none of the production’s action code snippets returns, then the locals remain in scope during the execution of all variables; only then can the values be relied on to be pertinent.%locals
cannot be used to share state between productions as they will go out of scope between productions, and their stack memory may be used for other purposes by the C compiler.%locals
are local to the function, they are not shared between<prefix>scan()
and<prefix>parse()
, consequently any statis declarations will not share the same memory. For instance, in%locals static char buf[50];
, the buf variable declared in the<prefix>scan()
function will be distinct from the buf variable declared in the<prefix>parse()
function, and there is no way for the production’s action code snippet to discover which function it is executing in.
%params
The %params
directive defines additional parameters for both <prefix>scan()
and <prefix>parse()
. If it is present, an additional ", "
comma will be prepended to glue it to the preceding stack parameter.
%prefix example
%params context_ptr_t context, int value
Translates into the following declarations for <prefix>scan()
and <prefix>parse()
:
int example_scan(struct example_stack *stack, context_ptr_t context, int value);
int example_parse(struct example_stack *stack, int sym, context_ptr_t context, int value);
Important: note that these declarations make their way into any header file generated. Consequently, in the example above, the declaration for context_ptr_t
must be accessible from the header. Declaring such types in the prologue makes the generated C/C++ file work, but the header does not get a copy of the declarations in the prologue, and may be included from a different context. A solution is to ensure that any dependencies are in a separate file, and that any header file generated is only included after these dependencies are included.
%on_finish
When the parser has reached the end of input, and either the grammar has completed or no error recovery can be made, the specified action code snippet is executed.
If the directive is not specified, the default behavior is to return <prefix>FINISHED
(a constant which equals 0.)
%on_syntax_error
When the parser cannot match the next input symbol to the current state and grammar of the parser, the specified action code snippet is executed, but only if no such error has recently occurred (within the last 3 symbols shifted.)
If the directive is not specified, the default behavior is to return <prefix>SYNTAX_ERROR
.
%on_lexical_error
When the scanner cannot find a matching pattern for the current input, the specified action code snippet is executed.
If the directive is not specified, the default behavior is to return <prefix>LEXICAL_ERROR
.
%on_alloc_error
When the scanner or parser fail to allocate memory, the specified action code snippet is executed.
If the directive is not specified, the default behavior is to return <prefix>NO_MEMORY
.
%on_internal_error
When the parser discovers an internal inconsistency in its implementation, or an unexpected overflow occurs, the specified action code snippet is executed.
If the directive is not specified, the default behavior is to return <prefix>INTERNAL_ERROR
in case of an inconsistency, and <prefix>OVERFLOW
in case of an overflow.
%on_next_token
When the parser (not the scanner) needs a new input terminal symbol, the specified action code snippet is executed.
If the directive is not specified, the default behavior is to return <prefix>FEED_ME
.
A valid use-case for %on_next_token
is to feed the parser with a new symbol by storing its ordinal value in sym
without returning to the caller:
Where next_token()
is a user-defined function that retrieves the next token and returns it.
%on_feed_me
When the scanner (not the parser) needs new input bytes, the specified action code snippet is executed.
If the directive is not specified, the default behavior is to return <prefix>FEED_ME
.
A valid use-case for %on_feed_me
is to feed the parser with a new symbol by storing its ordinal value in sym
without returning to the caller:
%params FILE *fp
%locals static char s[300];
%prefix example
%on_feed_me \
size_t len = fread(s, 1, sizeof(s), fp) \
if (ferror(fp)) { \
perror("failed to read"); \
/* -1 would have to be recognized by
** caller. */ \
return -1; \
} \
example_set_input(stack, s, len, feof(fp));
The above code will read from the file pointer fp
until error or end of file, and set the next bit of input accordingly.