secondo/Tools/pd/PD3

/*
//[|]	[$\mid $]
//[\]	[$\setminus $]

3 Lexical Analysis

3.1 Introduction

The file ~PDLex.l~ contains a specification of a lexical analyser. A description of the structure of ~lex~ specifications can be found in [ASU86, Section 3.5]. More detailed information is given in [SUN88, Section 10]. From such a specification, the UNIX tool ~lex~ creates a file ~lex.yy.c~ which can then be compiled to obtain a lexical analyser. This analyser is given as a function

----	int yylex()
----

On each call, this function matches a piece of the input string and returns a ~token~, which is an integer constant.

A lex specification has the following structure:

----	declarations
	%%
	translation rules
	%%
	auxiliary procedures
----

The ~declarations~ section provides definitions of tokens to be returned (within brackets of the form \%\{ ... \}\%. The remainder of this section contains definitions of regular symbols (nonterminals of a regular grammar). For example, the definitions

----	digit		[0-9]
	num		({digit}{digit}|{digit})
----

introduce two nonterminals called ~digit~ and ~num~ respectively, by giving regular expressions on the right hand side for them. Terminal symbols can be described by character classes such as [0-9] (for digits), [[\]n] (matches just the end of line symbol), [ab?!] (contains just these four characters), etc. Nonterminal symbols used in regular expressions have to be enclosed by braces. Hence here a number is defined to consist of either one or two digits. The characters ``[|]'', ``+'', and ``[*]'' have the usual meaning for the composition of regular expressions.

The ~translation rules~ section contains a list of rules. Each rule describes some action to be taken whenever a piece of the input string has been matched. For example, the rule

----	^{head1}	{yylval = atom(yytext, yyleng); return(HEAD1);}
----

states the action to be taken when a ~head1~ regular symbol has been recognized which is defined in the ~declarations~ section by:

----	head1		{num}" "
----

So the input string matching ~head1~ consists of one or two digits followed by a blank. A translation rule consists of a regular symbol or regular expression on the left hand side, and some C code in braces on the right hand side. The C code says which token is to be returned (if any, one can also just skip a part of the input string and not return a token). The rule above says that a ~HEAD1~ token is to be returned from lexical analysis on recognizing a ~head1~ regular symbol which must occur at the beginning of a line (this is specified by the leading
caret character).

In addition to returning a token, one often wants to give further information. For the communication with a ~yacc~ generated parser, a global integer variable ~yylval~ is predefined. A value assigned to this variable can be accessed in parsing (see Section 5). The input string matched by a regular expression is given by ~lex~ variables

----	char *yytext;
	int yyleng;
----

where ~yytext~ points to the first character and ~yyleng~ gives the number of characters matched. The lexical analyser specified here usually returns the text that has been matched by creating a corresponding atom in the ~NestedText~ data structure and returning its node index. This happens also in the rule shown above.

The last section ~auxiliary procedures~ contains C code that is copied directly into the program ~lex.yy.c~. Here one can define support variables or functions for the action parts of translation rules.

Some specific comments about the definitions given below:

  * The regular symbols ~open~, ~open2~, and ~close~ consume preceding and following empty lines (consisting of space and end-of-line characters). Similarly, ~epar~ (paragraph end symbol) consumes subsequent empty lines. This is to avoid superfluous space in ``verbatim'' sections of the target Latex document.

  * A ~close~ followed by an ~open~ commentary bracket is omitted completely. This means that adjacent documentation sections are merged into one. This is needed in particular for documents obtained by concatenating several files. Be aware of this in testing lexical analysis! This is the only case when parts of the input string are ``swallowed'' without returning tokens.

Note that the characters

----	~, *, ", [, ], :
----

are returned directly to the parser (each is its own token) because they are used there directly in grammar rules. These characters are matched by the rule with a ``.'' on the left hand side. The dot matches everything unmatched otherwise.


3.2 The Specification

(File ~PDLex.l~)

*/