qzdoom/tools/re2c/re2c.1

598 lines
23 KiB
Groff

./"
./" $Id: re2c.1.in 663 2007-04-01 11:22:15Z helly $
./"
.TH RE2C 1 "22 April 2005" "Version 0.12.3"
.ds re \fBre2c\fP
.ds le \fBlex\fP
.ds rx regular expression
.ds lx \fIl\fP-expression
.SH NAME
re2c \- convert regular expressions to C/C++
.SH SYNOPSIS
\*(re [\fB-bdefghisuvVw1\fP] [\fB-o output\fP] file\fP
.SH DESCRIPTION
\*(re is a preprocessor that generates C-based recognizers from regular
expressions.
The input to \*(re consists of C/C++ source interleaved with
comments of the form \fC/*!re2c\fP ... \fC*/\fP which contain
scanner specifications.
In the output these comments are replaced with code that, when
executed, will find the next input token and then execute
some user-supplied token-specific code.
For example, given the following code
.in +3
.nf
char *scan(char *p)
{
/*!re2c
re2c:define:YYCTYPE = "unsigned char";
re2c:define:YYCURSOR = p;
re2c:yyfill:enable = 0;
re2c:yych:conversion = 1;
re2c:indent:top = 1;
[0-9]+ {return p;}
[\000-\377] {return (char*)0;}
*/
}
.fi
.in -3
\*(re -is will generate
.in +3
.nf
/* Generated by re2c on Sat Apr 16 11:40:58 1994 */
char *scan(char *p)
{
{
unsigned char yych;
yych = (unsigned char)*p;
if(yych <= '/') goto yy4;
if(yych >= ':') goto yy4;
++p;
yych = (unsigned char)*p;
goto yy7;
yy3:
{return p;}
yy4:
++p;
yych = (unsigned char)*p;
{return char*)0;}
yy6:
++p;
yych = (unsigned char)*p;
yy7:
if(yych <= '/') goto yy3;
if(yych <= '9') goto yy6;
goto yy3;
}
}
.fi
.in -3
You can place one \fC/*!max:re2c */\fP comment that will output a "#define
\fCYYMAXFILL\fP <n>" line that holds the maximum number of characters
required to parse the input. That is the maximum value \fCYYFILL\fP(n)
will receive. If -1 is in effect then YYMAXFILL can only be triggered once
after the last \fC/*!re2c */\fP.
You can also use \fC/*!ignore:re2c */\fP blocks that allows to document the
scanner code and will not be part of the output.
.SH OPTIONS
\*(re provides the following options:
.TP
\fB-?\fP
\fB-h\fP
Invoke a short help.
.TP
\fB-b\fP
Implies \fB-s\fP. Use bit vectors as well in the attempt to coax better
code out of the compiler. Most useful for specifications with more than a
few keywords (e.g. for most programming languages).
.TP
\fB-d\fP
Creates a parser that dumps information about the current position and in
which state the parser is while parsing the input. This is useful to debug
parser issues and states. If you use this switch you need to define a macro
\fIYYDEBUG\fP that is called like a function with two parameters:
\fIvoid YYDEBUG(int state, char current)\fP. The first parameter receives the
state or -1 and the second parameter receives the input at the current cursor.
.TP
\fB-e\fP
Cross-compile from an ASCII platform to an EBCDIC one.
.TP
\fB-f\fP
Generate a scanner with support for storable state.
For details see below at \fBSCANNER WITH STORABLE STATES\fP.
.TP
\fB-g\fP
Generate a scanner that utilizes GCC's computed goto feature. That is \*(re
generates jump tables whenever a decision is of a certain complexity (e.g. a
lot of if conditions are otherwise necessary). This is only useable with GCC
and produces output that cannot be compiled with any other compiler. Note that
this implies -b and that the complexity threshold can be configured using the
inplace configuration "cgoto:threshold".
.TP
\fB-i\fP
Do not output #line information. This is usefull when you want use a CMS tool
with the \*(re output which you might want if you do not require your users to
have \*(re themselves when building from your source.
\fB-o output\fP
Specify the output file.
.TP
\fB-s\fP
Generate nested \fCif\fPs for some \fCswitch\fPes. Many compilers need this
assist to generate better code.
.TP
\fB-u\fP
Generate a parser that supports Unicode chars (UTF-32). This means the
generated code can deal with any valid Unicode character up to 0x10FFFF. When
UTF-8 or UTF-16 needs to be supported you need to convert the incoming stream
to UTF-32 upon input yourself.
.TP
\fB-v\fP
Show version information.
.TP
\fB-V\fP
Show the version as a number XXYYZZ.
.TP
\fB-w\fP
Create a parser that supports wide chars (UCS-2). This implies \fB-s\fP and
cannot be used together with \fB-e\fP switch.
.TP
\fB-1\fP
Force single pass generation, this cannot be combined with -f and disables
YYMAXFILL generation prior to last \*(re block.
.TP
\fb--no-generation-date\fP
Suppress date output in the generated output so that it only shows the re2c
version.
.SH "INTERFACE CODE"
Unlike other scanner generators, \*(re does not generate complete scanners:
the user must supply some interface code.
In particular, the user must define the following macros or use the
corresponding inplace configurations:
.TP
\fCYYCTYPE\fP
Type used to hold an input symbol.
Usually \fCchar\fP or \fCunsigned char\fP.
.TP
\fCYYCURSOR\fP
\*(lx of type \fC*YYCTYPE\fP that points to the current input symbol.
The generated code advances \fCYYCURSOR\fP as symbols are matched.
On entry, \fCYYCURSOR\fP is assumed to point to the first character of the
current token. On exit, \fCYYCURSOR\fP will point to the first character of
the following token.
.TP
\fCYYLIMIT\fP
Expression of type \fC*YYCTYPE\fP that marks the end of the buffer
(\fCYYLIMIT[-1]\fP is the last character in the buffer).
The generated code repeatedly compares \fCYYCURSOR\fP to \fCYYLIMIT\fP
to determine when the buffer needs (re)filling.
.TP
\fCYYMARKER\fP
\*(lx of type \fC*YYCTYPE\fP.
The generated code saves backtracking information in \fCYYMARKER\fP. Some easy
scanners might not use this.
.TP
\fCYYCTXMARKER\fP
\*(lx of type \fC*YYCTYPE\fP.
The generated code saves trailing context backtracking information in \fCYYCTXMARKER\fP.
The user only needs to define this macro if a scanner specification uses trailing
context in one or more of its regular expressions.
.TP
\fCYYFILL\fP(\fIn\fP\fC\fP)
The generated code "calls" \fCYYFILL\fP(n) when the buffer needs
(re)filling: at least \fIn\fP additional characters should
be provided. \fCYYFILL\fP(n) should adjust \fCYYCURSOR\fP, \fCYYLIMIT\fP,
\fCYYMARKER\fP and \fCYYCTXMARKER\fP as needed. Note that for typical
programming languages \fIn\fP will be the length of the longest keyword plus one.
The user can place a comment of the form \fC/*!max:re2c */\fP once to insert
a \fCYYMAXFILL\fP(n) definition that is set to the maximum length value. If -1
switch is used then \fCYYMAXFILL\fP can be triggered only once after the
last \fC/*!re2c */\fP
block.
.TP
\fCYYGETSTATE\fP()
The user only needs to define this macro if the \fB-f\fP flag was specified.
In that case, the generated code "calls" \fCYYGETSTATE\fP() at the very beginning
of the scanner in order to obtain the saved state. \fCYYGETSTATE\fP() must return a signed
integer. The value must be either -1, indicating that the scanner is entered for the
first time, or a value previously saved by \fCYYSETSTATE\fP(s). In the second case, the
scanner will resume operations right after where the last \fCYYFILL\fP(n) was called.
.TP
\fCYYSETSTATE(\fP\fIs\fP\fC)\fP
The user only needs to define this macro if the \fB-f\fP flag was specified.
In that case, the generated code "calls" \fCYYSETSTATE\fP just before calling
\fCYYFILL\fP(n). The parameter to \fCYYSETSTATE\fP is a signed integer that uniquely
identifies the specific instance of \fCYYFILL\fP(n) that is about to be called.
Should the user wish to save the state of the scanner and have \fCYYFILL\fP(n) return
to the caller, all he has to do is store that unique identifer in a variable.
Later, when the scannered is called again, it will call \fCYYGETSTATE()\fP and
resume execution right where it left off. The generated code will contain
both \fCYYSETSTATE\fP(s) and \fCYYGETSTATE\fP even if \fCYYFILL\fP(n) is being
disabled.
.TP
\fCYYDEBUG(\fP\fIstate\fP,\fIcurrent\fC)\fP
This is only needed if the \fB-d\fP flag was specified. It allows to easily debug
the generated parser by calling a user defined function for every state. The function
should have the following signature: \fIvoid YYDEBUG(int state, char current)\fP.
The first parameter receives the state or -1 and the second parameter receives the
input at the current cursor.
.TP
\fCYYMAXFILL
This will be automatically defined by \fC/*!max:re2c */\fP blocks as explained above.
.SH "SCANNER WITH STORABLE STATES"
When the \fB-f\fP flag is specified, \*(re generates a scanner that
can store its current state, return to the caller, and later resume
operations exactly where it left off.
The default operation of \*(re is a "pull" model, where the scanner asks
for extra input whenever it needs it. However, this mode of operation
assumes that the scanner is the "owner" the parsing loop, and that may
not always be convenient.
Typically, if there is a preprocessor ahead of the scanner in the stream,
or for that matter any other procedural source of data, the scanner cannot
"ask" for more data unless both scanner and source live in a separate threads.
The \fB-f\fP flag is useful for just this situation : it lets users design
scanners that work in a "push" model, i.e. where data is fed to the scanner
chunk by chunk. When the scanner runs out of data to consume, it just stores
its state, and return to the caller. When more input data is fed to the scanner,
it resumes operations exactly where it left off.
When using the -f option \*(re does not accept stdin because it has to do the
full generation process twice which means it has to read the input twice. That
means \*(re would fail in case it cannot open the input twice or reading the
input for the first time influences the second read attempt.
Changes needed compared to the "pull" model.
1. User has to supply macros YYSETSTATE() and YYGETSTATE(state)
2. The \fB-f\fP option inhibits declaration of \fIyych\fP and
\fIyyaccept\fP. So the user has to declare these. Also the user has
to save and restore these. In the example \fIexamples/push.re\fP these
are declared as fields of the (C++) class of which the scanner is a
method, so they do not need to be saved/restored explicitly. For C
they could e.g. be made macros that select fields from a structure
passed in as parameter. Alternatively, they could be declared as local
variables, saved with YYFILL(n) when it decides to return and restored
at entry to the function. Also, it could be more efficient to save the
state from YYFILL(n) because YYSETSTATE(state) is called
unconditionally. YYFILL(n) however does not get \fIstate\fP as
parameter, so we would have to store state in a local variable by
YYSETSTATE(state).
3. Modify YYFILL(n) to return (from the function calling it) if more
input is needed.
4. Modify caller to recognise "more input is needed" and respond
appropriately.
5. The generated code will contain a switch block that is used to restores
the last state by jumping behind the corrspoding YYFILL(n) call. This code is
automatically generated in the epilog of the first "\fC/*!re2c */\fP" block.
It is possible to trigger generation of the YYGETSTATE() block earlier by
placing a "\fC/*!getstate:re2c */\fP" comment. This is especially useful when
the scanner code should be wrapped inside a loop.
Please see examples/push.re for push-model scanner. The generated code can be
tweaked using inplace configurations "\fBstate:abort\fP" and "\fBstate:nextlabel\fP".
.SH "SCANNER SPECIFICATIONS"
Each scanner specification consists of a set of \fIrules\fP, \fInamed
definitions\fP and \fIconfigurations\fP.
.LP
\fIRules\fP consist of a regular expression along with a block of C/C++ code that
is to be executed when the associated \fIregular expression\fP is matched.
.P
.RS
\fIregular expression\fP \fC{\fP \fIC/C++ code\fP \fC}\fP
.RE
.LP
Named definitions are of the form:
.P
.RS
\fIname\fP \fC=\fP \fIregular expression\fP\fC;\fP
.RE
.LP
Configurations look like named definitions whose names start
with "\fBre2c:\fP":
.P
.RS
\fCre2c:\fP\fIname\fP \fC=\fP \fIvalue\fP\fC;\fP
.RE
.RS
\fCre2c:\fP\fIname\fP \fC=\fP \fB"\fP\fIvalue\fP\fB"\fP\fC;\fP
.RE
.SH "SUMMARY OF RE2C REGULAR EXPRESSIONS"
.TP
\fC"foo"\fP
the literal string \fCfoo\fP.
ANSI-C escape sequences can be used.
.TP
\fC'foo'\fP
the literal string \fCfoo\fP (characters [a-zA-Z] treated case-insensitive).
ANSI-C escape sequences can be used.
.TP
\fC[xyz]\fP
a "character class"; in this case,
the \*(rx matches either an '\fCx\fP', a '\fCy\fP', or a '\fCz\fP'.
.TP
\fC[abj-oZ]\fP
a "character class" with a range in it;
matches an '\fCa\fP', a '\fCb\fP', any letter from '\fCj\fP' through '\fCo\fP',
or a '\fCZ\fP'.
.TP
\fC[^\fIclass\fP\fC]\fP
an inverted "character class".
.TP
\fIr\fP\fC\e\fP\fIs\fP
match any \fIr\fP which isn't an \fIs\fP. \fIr\fP and \fIs\fP must be regular expressions
which can be expressed as character classes.
.TP
\fIr\fP\fC*\fP
zero or more \fIr\fP's, where \fIr\fP is any regular expression
.TP
\fC\fIr\fP\fC+\fP
one or more \fIr\fP's
.TP
\fC\fIr\fP\fC?\fP
zero or one \fIr\fP's (that is, "an optional \fIr\fP")
.TP
name
the expansion of the "named definition" (see above)
.TP
\fC(\fP\fIr\fP\fC)\fP
an \fIr\fP; parentheses are used to override precedence
(see below)
.TP
\fIrs\fP
an \fIr\fP followed by an \fIs\fP ("concatenation")
.TP
\fIr\fP\fC|\fP\fIs\fP
either an \fIr\fP or an \fIs\fP
.TP
\fIr\fP\fC/\fP\fIs\fP
an \fIr\fP but only if it is followed by an \fIs\fP. The \fIs\fP is not part of
the matched text. This type of \*(rx is called "trailing context". A trailing
context can only be the end of a rule and not part of a named definition.
.TP
\fIr\fP\fC{\fP\fIn\fP\fC}\fP
matches \fIr\fP exactly \fIn\fP times.
.TP
\fIr\fP\fC{\fP\fIn\fP\fC,}\fP
matches \fIr\fP at least \fIn\fP times.
.TP
\fIr\fP\fC{\fP\fIn\fP\fC,\fP\fIm\fP\fC}\fP
matches \fIr\fP at least \fIn\fP but not more than \fIm\fP times.
.TP
\fC.\fP
match any character except newline (\\n).
.TP
\fIdef\fP
matches named definition as specified by \fIdef\fP.
.LP
Character classes and string literals may contain octoal or hexadecimal
character definitions and the following set of escape sequences (\fB\\n\fP,
\fB\\t\fP, \fB\\v\fP, \fB\\b\fP, \fB\\r\fP, \fB\\f\fP, \fB\\a\fP, \fB\\\\\fP).
An octal character is defined by a backslash followed by its three octal digits
and a hexadecimal character is defined by backslash, a lower cased '\fBx\fP'
and its two hexadecimal digits or a backslash, an upper cased \fBX\fP and its
four hexadecimal digits.
.LP
\*(re further more supports the c/c++ unicode notation. That is a backslash followed
by either a lowercased \fBu\fP and its four hexadecimal digits or an uppercased
\fBU\fP and its eight hexadecimal digits. However only in \fB-u\fP mode the
generated code can deal with any valid Unicode character up to 0x10FFFF.
.LP
Since characters greater \fB\\X00FF\fP are not allowed in non unicode mode, the
only portable "\fBany\fP" rules are \fB(.|"\\n")\fP and \fB[^]\fP.
.LP
The regular expressions listed above are grouped according to
precedence, from highest precedence at the top to lowest at the bottom.
Those grouped together have equal precedence.
.SH "INPLACE CONFIGURATION"
.LP
It is possible to configure code generation inside \*(re blocks. The following
lists the available configurations:
.TP
\fIre2c:indent:top\fP \fB=\fP 0 \fB;\fP
Specifies the minimum number of indendation to use. Requires a numeric value
greater than or equal zero.
.TP
\fIre2c:indent:string\fP \fB=\fP "\\t" \fB;\fP
Specifies the string to use for indendation. Requires a string that should
contain only whitespace unless you need this for external tools. The easiest
way to specify spaces is to enclude them in single or double quotes. If you do
not want any indendation at all you can simply set this to \fB""\fP.
.TP
\fIre2c:yybm:hex\fP \fB=\fP 0 \fB;\fP
If set to zero then a decimal table is being used else a hexadecimal table
will be generated.
.TP
\fIre2c:yyfill:enable\fP \fB=\fP 1 \fB;\fP
Set this to zero to suppress generation of YYFILL(n). When using this be sure
to verify that the generated scanner does not read behind input. Allowing
this behavior might introduce sever security issues to you programs.
.TP
\fIre2c:yyfill:parameter\fP \fB=\fP 1 \fB;\fP
Allows to suppress parameter passing to \fBYYFILL\fP calls. If set to zero
then no parameter is passed to \fBYYFILL\fP. If set to a non zero value then
\fBYYFILL\fP usage will be followed by the number of requested characters in
braces.
.TP
\fIre2c:startlabel\fP \fB=\fP 0 \fB;\fP
If set to a non zero integer then the start label of the next scanner blocks
will be generated even if not used by the scanner itself. Otherwise the normal
\fByy0\fP like start label is only being generated if needed. If set to a text
value then a label with that text will be generated regardless of whether the
normal start label is being used or not. This setting is being reset to \fB0\fP
after a start label has been generated.
.TP
\fIre2c:labelprefix\fP \fB=\fP yy \fB;\fP
Allows to change the prefix of numbered labels. The default is \fByy\fP and
can be set any string that is a valid label.
.TP
\fIre2c:state:abort\fP \fB=\fP 0 \fB;\fP
When not zero and switch -f is active then the \fCYYGETSTATE\fP block will
contain a default case that aborts and a -1 case is used for initialization.
.TP
\fIre2c:state:nextlabel\fP \fB=\fP 0 \fB;\fP
Used when -f is active to control whether the \fCYYGETSTATE\fP block is
followed by a \fCyyNext:\fP label line. Instead of using \fCyyNext\fP you can
usually also use configuration \fIstartlabel\fP to force a specific start label
or default to \fCyy0\fP as start label. Instead of using a dedicated label it
is often better to separate the YYGETSTATE code from the actual scanner code by
placing a "\fC/*!getstate:re2c */\fP" comment.
.TP
\fIre2c:cgoto:threshold\fP \fB=\fP 9 \fB;\fP
When -g is active this value specifies the complexity threshold that triggers
generation of jump tables rather than using nested if's and decision bitfields.
The threshold is compared against a calculated estimation of if-s needed where
every used bitmap divides the threshold by 2.
.TP
\fIre2c:yych:conversion\fP \fB=\fP 0 \fB;\fP
When the input uses signed characters and \fB-s\fP or \fB-b\fP switches are
in effect re2c allows to automatically convert to the unsigned character type
that is then necessary for its internal single character. When this setting
is zero or an empty string the conversion is disabled. Using a non zero number
the conversion is taken from \fBYYCTYPE\fP. If that is given by an inplace
configuration that value is being used. Otherwise it will be \fB(YYCTYPE)\fP
and changes to that configuration are no longer possible. When this setting is
a string the braces must be specified. Now assuming your input is a \fBchar*\fP
buffer and you are using above mentioned switches you can set \fBYYCTYPE\fP to
\fBunsigned char\fP and this setting to either \fB1\fP or \fB"(unsigned char)"\fP.
.TP
\fIre2c:define:YYCTXMARKER\fP \fB=\fP YYCTXMARKER \fB;\fP
Allows to overwrite the define YYCTXMARKER and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:define:YYCTYPE\fP \fB=\fP YYCTYPE \fB;\fP
Allows to overwrite the define YYCTYPE and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:define:YYCURSOR\fP \fB=\fP YYCURSOR \fB;\fP
Allows to overwrite the define YYCURSOR and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:define:YYDEBUG\fP \fB=\fP YYDEBUG \fB;\fP
Allows to overwrite the define YYDEBUG and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:define:YYFILL\fP \fB=\fP YYFILL \fB;\fP
Allows to overwrite the define YYFILL and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:define:YYGETSTATE\fP \fB=\fP YYGETSTATE \fB;\fP
Allows to overwrite the define YYGETSTATE and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:define:YYLIMIT\fP \fB=\fP YYLIMIT \fB;\fP
Allows to overwrite the define YYLIMIT and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:define:YYMARKER\fP \fB=\fP YYMARKER \fB;\fP
Allows to overwrite the define YYMARKER and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:define:YYSETSTATE\fP \fB=\fP YYSETSTATE \fB;\fP
Allows to overwrite the define YYSETSTATE and thus avoiding it by setting the
value to the actual code needed.
.TP
\fIre2c:label:yyFillLabel\fP \fB=\fP yyFillLabel \fB;\fP
Allows to overwrite the name of the label yyFillLabel.
.TP
\fIre2c:label:yyNext\fP \fB=\fP yyNext \fB;\fP
Allows to overwrite the name of the label yyNext.
.TP
\fIre2c:variable:yyaccept\fP \fB=\fP yyaccept \fB;\fP
Allows to overwrite the name of the variable yyaccept.
.TP
\fIre2c:variable:yybm\fP \fB=\fP yybm \fB;\fP
Allows to overwrite the name of the variable yybm.
.TP
\fIre2c:variable:yych\fP \fB=\fP yych \fB;\fP
Allows to overwrite the name of the variable yych.
.TP
\fIre2c:variable:yytarget\fP \fB=\fP yytarget \fB;\fP
Allows to overwrite the name of the variable yytarget.
.SH "UNDERSTANDING RE2C"
.LP
The subdirectory lessons of the \*(re distribution contains a few step by step
lessons to get you started with \*(re. All examples in the lessons subdirectory
can be compiled and actually work.
.SH FEATURES
.LP
\*(re does not provide a default action:
the generated code assumes that the input
will consist of a sequence of tokens.
Typically this can be dealt with by adding a rule such as the one for
unexpected characters in the example above.
.LP
The user must arrange for a sentinel token to appear at the end of input
(and provide a rule for matching it):
\*(re does not provide an \fC<<EOF>>\fP expression.
If the source is from a null-byte terminated string, a
rule matching a null character will suffice. If the source is from a
file then you could pad the input with a newline (or some other character that
cannot appear within another token); upon recognizing such a character check
to see if it is the sentinel and act accordingly. And you can also use YYFILL(n)
to end the scanner in case not enough characters are available which is nothing
else then e detection of end of data/file.
.LP
\*(re does not provide start conditions: use a separate scanner
specification for each start condition (as illustrated in the above example).
.SH BUGS
.LP
Difference only works for character sets.
.LP
The \*(re internal algorithms need documentation.
.SH "SEE ALSO"
.LP
flex(1), lex(1).
.P
More information on \*(re can be found here:
.PD 0
.P
.B http://re2c.org/
.PD 1
.SH AUTHORS
.PD 0
.P
Peter Bumbulis <peter@csg.uwaterloo.ca>
.P
Brian Young <bayoung@acm.org>
.P
Dan Nuffer <nuffer@users.sourceforge.net>
.P
Marcus Boerger <helly@users.sourceforge.net>
.P
Hartmut Kaiser <hkaiser@users.sourceforge.net>
.P
Emmanuel Mogenet <mgix@mgix.com> added storable state
.P
.PD 1
.SH VERSION INFORMATION
This manpage describes \*(re, version 0.12.3.
.fi