Chapter 11: Parsing The Input Source

The Input Source

As in ANS Forth, the input source is the source of the character stream that is being parsed. During a Forth session, the input source is usually changed quite frequently. In strongForth, the following input sources exist:

Since blocks are described separately in chapter 17, they shall not be considered yet. To get access to the current input source, you can simply use the word SOURCE. It returns a string within the DATA memory area that contains the characters from the input source:

: SOURCE ( -- CDATA -> CHARACTER UNSIGNED )
  SOURCE-ID
  IF SOURCE-ADDR SOURCE-COUNT
  ELSE TIB #TIB @
  THEN ;

SOURCE-ID is a signed number, which is -1 if the current input source is a string, and 0 if it is the user input device. Let's first assume the input source is a string. In this case, SOURCE returns the address and the length of the string. Both have previously been stored by EVALUATE into the two values SOURCE-ADDR and SOURCE-COUNT. If, on the other hand, SOURCE-ID is 0, SOURCE returns the address of the terminal input buffer plus the number of characters that have been entered. The terminal input buffer is a simple character array in the DATA memory area. Its size determines the maximum number of characters you may type in one line. TIB is the constant address of the terminal input buffer, while #TIB is a variable that contains the number of characters in the terminal input buffer. Here are the definitions of all those values, variables and constants:

+0 VALUE SOURCE-ID
TIB VALUE SOURCE-ADDR
0 VALUE SOURCE-COUNT
DATA-SPACE HERE CAST CDATA -> CHARACTER CONSTANT TIB
80 CHARS ALLOT
0 VARIABLE #TIB

Just like ANS Forth, strongForth uses an index variable to track the current position within the input source during parsing:

0 VARIABLE >IN

>IN starts with 0 and is incremented for each character that is being parsed. But what if >IN exceeds the string length of the input source? If the input source is a string or a block, parsing simply stops. If the current input source is the user input device, it is possible to refill the input source by requesting another line from the user. This is what the ANS Forth word REFILL does. It returns TRUE if the refilling succeeded, otherwise it returns FALSE:

: REFILL ( -- FLAG )
  SOURCE-ID 0< INVERT DUP
  IF TIB 80 ACCEPT #TIB ! 0 >IN !
  THEN ;

The ANS Forth words SAVE-INPUT and RESTORE-INPUT are provided by strongForth as well, but with a different stack effect. In ANS Forth, SAVE-INPUT returns a variable number of items, depending on the requirements of the current input source. RESTORE-INPUT consumes these items and returns a flag that indicates whether the operation succeeded or failed:

SAVE-INPUT ( -- xn ... x1 n ) \ ANS Forth
RESTORE-INPUT ( xn ... x1 n -- flag )

Stack diagrams like these are not allowed in strongForth, because the type system requires that stack diagrams are deterministic. The data types of all input and output parameters must be known at compile time. As a solution, strongForth packs the complete sequence xn ... x1 n into one double-cell item of data type INPUT-SOURCE. Because INPUT-SOURCE is a data type that is only used by SAVE-INPUT and RESTORE-INPUT, these two words always have to be used in pairs. StrongForth's type system efficiently prevents other words to mess around with the input source specification. The only way to get around this restriction is using type casts. You've already seen this technique being used for pictured numeric output, where the unique data type NUMBER-DOUBLE enforces the correct syntax of <# ... # ... #S ... #>. Defining dedicated data types is a common programming technique in strongForth, whenever syntactic rules must be obeyed.

Now, let's get back to SAVE-INPUT, RESTORE-INPUT and INPUT-SOURCE. As long as we only consider strings and the user input device as input sources, it is sufficient for SAVE-INPUT just to save the contents of >IN, and for RESTORE-INPUT to restore >IN:

DT DOUBLE PROCREATES INPUT-SOURCE

: SAVE-INPUT ( -- INPUT-SOURCE )
  >IN @ CAST INPUT-SOURCE ;

: RESTORE-INPUT ( INPUT-SOURCE -- FLAG )
  CAST UNSIGNED >IN ! FALSE ; 

In chapter 17, where blocks are being considered, SAVE-INPUT and RESTORE-INPUT are extended in order to save and restore the current block number as well. SAVE-INPUT will then merge two values into one item of data type INPUT-SOURCE.

You might have noticed that RESTORE-INPUT does not verify that the current input source is identical to the one during execution of SAVE-INPUT. RESTORE-INPUT always returns FALSE to indicate that the restore was successful. The reason is that SAVE-INPUT does not save any information about the current input source in the item of data type INPUT-SOURCE it returns. A saver implementation of SAVE-INPUT and RESTORE-INPUT would require to additionally sav SOURCE-ID, SOURCE-ADDR and SOURCE-COUNT. However, one double-cell item like INPUT-SOURCE does not provide enough space for all these data. But you are encouraged to look for a more sophisticated implementation. One possibility would be to let SAVE-INPUT additionally return SOURCE-ID, SOURCE-ADDR and SOURCE-COUNT, for example like this:

: SAVE-INPUT ( -- CDATA -> CHARACTER UNSIGNED SIGNED INPUT-SOURCE )
  SOURCE-ADDR SOURCE-COUNT SOURCE-ID >IN @ CAST INPUT-SOURCE ;

RESTORE-INPUT would then consume all these data and verify that the current input source is still the same.

Parsing

The input source is where the characters come from. But before strongForth can start interpreting or compiling, the long chain of characters has to be cut into suitable pieces, for example into words delimited by spaces. The actual cutting is performed by a word called ENCLOSE. ENCLOSE is not an ANS Forth word, but it provides most of the low-level semantics for the ANS Forth word PARSE:

ENCLOSE ( CHARACTER CDATA -> 1ST UNSIGNED 4 TH -- 2ND 4 TH 4 TH 4 TH )

So, what does ENCLOSE do with all these input and output parameters? The most important input parameters is CDATA -> 1ST, which is the starting address of the character string to be parsed. This address is returned unchanged as the first output parameter 2ND. CHARACTER is the delimiter, i. e. the character which indicates the position where to cut the string.

All other parameters are indexes. UNSIGNED contains the position within the string where ENCLOSE shall start parsing. It is returned unchanged as the second output parameter. The last input parameter, 4 TH, is the length of the character string. The last two output parameters deliver the index positions of the character after the last non-delimiter character in sequence, and the first character not included in parsing. What does this mean? Let's first view an example in which the character string contains at least one delimiter between the starting position and the end of the string:

PAD 20 ACCEPT .
This is a string.
17  OK
CHAR i PAD 8 17 ENCLOSE . . . PAD = .
14 13 8 TRUE  OK

At the beginning of this example, PAD is filled with the string This is a string., which is 17 characters long. Next, ENCLOSE searches for the character i within the string, starting at position 8. Therefore, the two characters i in the first two words are being skipped.The first delimiter character i is found at position 13, and the first character not included in parsing is the character n at position 14.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

T

h

i

s

 

i

s

 

a

 

s

t

r

i

n

g

.

With these index values, it is easy to calculate the length of the ENCLOSEd string: 13 - 8 = 5. If parsing the string shall be continued, the next ENCLOSE starts at 14, at the first character that was not parsed by the previous ENCLOSE.

To see what happens if the string does not contain any delimiter, let's parse the string from the first example once again:

PAD 17 TYPE
This is a string. OK
CHAR x PAD 8 17 ENCLOSE . . . PAD = .
17 17 8 TRUE  OK

Because the string does not contain an x, ENCLOSE parses until the end of the string. The last non-delimiter character in sequence is the period at position 16, so the next index position after this character is 17. 17 is also the first position not included in parsing. You can repeat the calculation from the first example: 17 - 8 = 9, which is indeed the length of the string a string..

Using ENCLOSE, parsing a string becomes pretty simple. Here's the definition of PARSE:

: PARSE ( CHARACTER -- CDATA -> CHARACTER UNSIGNED )
  SOURCE >IN @ SWAP ENCLOSE >IN ! OVER - ROT ROT + SWAP ;

PARSE expects the delimiter CHARACTER on the stack and forwards it together with the input source and the value of >IN to ENCLOSE. >IN contains the current parsing position within the input source. After ENCLOSE returns, PARSE updates >IN with the the first character that was not parsed by ENCLOSE, so parsing can continue seamlessly when PARSE is executed next time. From the remaining output parameters of ENCLOSE, PARSE calculates the starting address and the length of the parsed string.

A simple application of PARSE is \. The backslash is strongForth's only means to provide comments within the source code, because parentheses are already used up for stack diagrams. In ANS Forth, parentheses enclose small comments within a source line, while comments that extend to the end of the line start with the backslash:

... ( ANS Forth comment ) ...
... \ another ANS Forth comment
...

In this example, ... indicates executable source code. To compensate for the missing parenthesis, strongForth allows using another backslash to terminate the comment, just like the right parenthesis in ANS Forth:

... \ strongForth comment \ ...
... \ another strongForth comment
...

The definition of \ is quite simple. It just parses until the next occurrence of a backslash and then discards the parsed string. Because \ works in both interpretation and compilation state, it has to be an immediate word:

: \ ( -- )
  [CHAR] \ PARSE DROP DROP ; IMMEDIATE

In order to parse the input source for words delimited by spaces, strongForth provides a variant of PARSE, which is called PARSE-WORD. It is suggested by the ANS Forth specification, although it is not a part of the standard. PARSE-WORD has no input parameter, because it always assumes space characters as delimiters. In addition to PARSE, it skips leading spaces and checks the length of the parsed word, which must not be greater than 31, the maximum length of a word in the dictionary. The definition of PARSE-WORD looks quite similar to the definition of PARSE:

: PARSE-WORD ( -- CDATA -> CHARACTER UNSIGNED )
  SOURCE >IN @ SWAP ENCLOSE-WORD >IN ! OVER - ROT ROT + SWAP
  DUP 31 > IF -19 THROW THEN ;

PARSE-WORD uses ENCLOSE-WORD, which is in turn similar to ENCLOSE:

ENCLOSE-WORD ( CDATA -> CHARACTER UNSIGNED 3RD -- 1ST 3RD 3RD 3RD )

Since ENCLOSE-WORD always assumes space characters as delimiters, it does not expect the delimiter character as input parameter. Other than ENCLOSE, the value of the second output parameter is not always identical to the value of the input parameter UNSIGNED. This is because ENCLOSE-WORD skips leading spaces. The second output parameter is actually the index position of the first non-space character after the position indicated by UNSIGNED. However, for both ENCLOSE and ENCLOSE-WORD the second output parameter contains the starting position of the enclosed string.

A very simple application of PARSE-WORD is CHAR. CHAR parses the input source for a word delimited by spaces and returns the first character of the word. Any additional characters are discarded. If the parsed word is empty, which happens at the end of the input source, CHAR simply returns a space character. Here's the definition of CHAR:

: CHAR ( -- CHARACTER )
  PARSE-WORD IF @ ELSE DROP BL THEN ;

Parsing Numbers

StrongForth does not yet include the ANS Forth Floating-Point word set. But it is able to clearly distinguish between unsigned and signed, single-precision and double-precision numbers when parsing the input source. Generally, the input format of a number consists of a sequence of digits, which is optionally preceded by a sign character (either + or -) and is optionally succeeded by a decimal point:

<integer number> := [<sign>]<digits>[.]
<sign>           := { + | - }
<digits>         := <digit><digit>*
<digit>          := { 0 ... 9 } | { A ... Z }

The range of digits is limited by the current number-conversion radix BASE. For example, if BASE is 16, valid digits are 0 to 9 and A to F. G to Z are invalid digits for a hexadecimal number. If BASE is 8, only digits 0 to 7 are allowed. The optional sign and the optional decimal point determine the data type of the number:

Sign Decimal Point Example Data Type
no no 7144 UNSIGNED
yes no +812 SIGNED
no yes 3000511. UNSIGNED-DOUBLE
yes yes -713306492. SIGNED-DOUBLE

The strongForth word NUMBER converts a sequence of characters from a string into a number:

NUMBER ( CDATA -> CHARACTER UNSIGNED -- INTEGER-DOUBLE DATA-TYPE )

The result of the conversion is delivered as an item of data type INTEGER-DOUBLE, because this data type is capable of holding all kinds of single-precision and double-precision numbers. But remember that stack diagrams in strongForth are always static. Therefore, it is not possible for NUMBER to return different data types depending on the contents of the string. Only in Forth systems without strong static typing, something like

NUMBER (  c-addr u1 – n | u2 | d | ud )

is actually allowed. Of course, strongForth provides an acceptable solution for this problem. NUMBER returns the recommended data type of the conversion result explicitly as an item of data type DATA-TYPE. Here are some examples:

DECIMAL
 OK
PARSE-WORD 7144 NUMBER . .
UNSIGNED 7144  OK
PARSE-WORD +812 NUMBER . .
SIGNED 812  OK
PARSE-WORD 3000511. NUMBER . .
UNSIGNED-DOUBLE 3000511  OK
PARSE-WORD -713306492. NUMBER . .
SIGNED-DOUBLE 3581660804  OK
HEX
 OK
PARSE-WORD 6DF4 NUMBER . .
UNSIGNED 6DF4  OK
PARSE-WORD 28H5 NUMBER . .

PARSE-WORD 28H5 NUMBER ? undefined word
INTEGER-DOUBLE DATA-TYPE

The last example fails, because NUMBER does not recognize the letter H as a hexadecimal digit.

Before continuing with the actual definition of NUMBER, we need to have a look at a few words that support NUMBER in parsing digits, sequences of digits and sign characters. The first one is DIGIT?, which is defined as follows:

: DIGIT? ( CHARACTER -- UNSIGNED FLAG )
  [CHAR] 0 - CAST UNSIGNED
  DUP [ CHAR A CHAR 0 - CAST UNSIGNED ] LITERAL <
  IF DUP 9 > IF DROP BASE @ THEN
  ELSE [ CHAR A CHAR 0 10 + - CAST UNSIGNED ] LITERAL -
  THEN DUP BASE @ < ;

DIGIT? expects a character as input parameter, which might be a valid digit or not, given the current number-conversion radix BASE. If the character is a valid digit, DIGIT? returns its numerical value UNSIGNED and a TRUE flag. If is it not, UNSIGNED is undefined and FLAG is FALSE:

CHAR 6 DIGIT? . .
TRUE 6  OK
CHAR C DIGIT? . .
FALSE 12  OK
CHAR C HEX DIGIT? DECIMAL . .
TRUE 12  OK

DIGIT? divides the ASCII character set into 5 segments:

invalid 0 to 9 invalid A to Z invalid

By subtracting [CHAR] 0 and casting the result to data type UNSIGNED, characters 0 to 9 are being converted to their respective numeric values. Simultaneously, the first and the last invalid segments are merged. Next, the invalid segment between characters 9 and A is excluded. By replacing the numerical value with the contents of BASE, characters belonging to this segment will be recognized as being invalid digits at the end. Finally, after adjusting the numerical value of all characters with ASCII values above or equal to character A, the result is compared to the number-conversion radix.

DIGIT? converts only one character to its numerical value. >NUMBER is an ANS Forth word that converts a whole sequence of digits into it's numerical value. It expects on the data stack a double-precision number INTEGER-DOUBLE and the character string CDATA -> CHARACTER UNSIGNED containing the digits, and returns these parameters in the same order. The converted digits are accumulated into INTEGER-DOUBLE by multiplying INTEGER-DOUBLE by the contents of BASE and then adding the numerical value of the digit.

: >NUMBER ( INTEGER-DOUBLE CDATA -> CHARACTER UNSIGNED -- 1ST 2ND 4 TH )
  ROT LOCALS| D |
  BEGIN DUP
  WHILE OVER @ DIGIT?
  WHILE D BASE @ * SWAP + TO D 1 /STRING
  REPEAT DROP
  THEN D ROT ROT ;

The implementation of >NUMBER is straight forward. In order to reduce the number of stack movements, the double-precision number is kept in a local variable. The unusual structure of the loop, which is built with the words BEGIN ... WHILE ... WHILE ... REPEAT ... THEN, is fully compliant to the ANS Forth standard. If the condition of the first WHILE is not met, a branch to the code after THEN is executed, while the second WHILE exits the loop with a branch to the code following REPEAT. DROP removes the (unused) first output parameter of DIGIT?. Note that >NUMBER uses mixed-mode operations, that are overloaded versions of * and +.

Finally, NUMBER takes advantage of a word that recognizes sign characters. >SIGN converts an item of data type CHARACTER into a signed number. >SIGN simply returns +1 if CHARACTER is +, -1 if CHARACTER is -, and 0 for all other characters. The strongForth version of CASE ... OF .... ENDOF ... ENDCASE works exactly as in ANS Forth.

: >SIGN ( CHARACTER -- SIGNED )
  CASE [CHAR] + OF +1 ENDOF
       [CHAR] - OF -1 ENDOF
  +0 SWAP ENDCASE ;

Now it's time to study the definition of NUMBER itself. It looks quite complex, but it is not too difficult to understand. First, a double-precision number of data type INTEGER-DOUBLE is created and pushed down to the third stack position. The resulting parameter configuration matches the input parameter list of >NUMBER. Next, the local S of data type SIGNED is defined. S keeps the information about the optional sign character, which is evaluated in the following IF clause.

: NUMBER ( CDATA -> CHARACTER UNSIGNED -- INTEGER-DOUBLE DATA-TYPE )
  NULL INTEGER-DOUBLE ROT ROT +0 LOCALS| S | DUP
  IF OVER @ >SIGN DUP TO S IF /STRING THEN
  THEN DUP
  IF OVER @ DIGIT? IF DROP >NUMBER ELSE DROP -13 THROW THEN
  ELSE -13 THROW
  THEN DUP
  IF " ." COMPARE IF -13 THROW THEN
     S IF [DT] SIGNED-DOUBLE ELSE [DT] UNSIGNED-DOUBLE THEN
  ELSE DROP DROP S IF [DT] SIGNED ELSE [DT] UNSIGNED THEN
  THEN S 0< IF SWAP NEGATE SWAP THEN ;

Of course, a number should contain at least one digit. If the string is empty, or contains just a sign character, or the first character is not a valid numerical digit, NUMBER throws an exception. Otherwise, it uses >NUMBER to parse the sequence of digits and convert it into a number.

If the string does not contain any characters after the last digit, NUMBER has successfully parsed a single-precision number of data type SIGNED or UNSIGNED, depending on the presence of a leading sign character. Otherwise, the only character that may follow the last digit is a decimal point, which indicates a double-precision number of data type SIGNED-DOUBLE or UNSIGNED-DOUBLE. Any other character or a sequence of more than one character is invalid.

At the end of the definition of NUMBER, S is queried again to find out whether the number is negative. If it is, the numerical value needs to be negated. We're done.

It is certainly worth to consider shortening the definition of NUMBER by doing some more factoring. For example, each of the three first-level IF clauses is a candidate for being factored out into a separate word. However, these words would all have very specialized semantics which are unlikely to be usable for other purposes. After all, it's just a matter of programming style. The definition of NUMBER doesn't look like good Forth style, but it works correctly and it doesn't contain spaghetti code or any dirty tricks.

There are actually two unsigned numbers that are being used much more often that any other number: 0 and 1. Since all numbers are compiled as literals with the tokens LIT or DLIT followed by one or two cells containing the numeric value, it is desirable to have special definitions for 0 and 1, which compile as only one cell of virtual code. Therefore, the following two definitions are included in the strongForth dictionary:

0 CONSTANT 0
1 CONSTANT 1

Note that the compiler recognizes these two constants only if they are contained in the input source exactly as 0 or 1. 01, for example, will be compiled as the numerical literal 1 with two cells of virtual code, and not as the constant 1 with only one cell of virtual code. Of course, the runtime semantics is exactly the same.

The Interpreter

Now we're getting into the very heart of strongForth. Parsing the input source, looking up words in the dictionary and interpreting or compiling these words are actually the core functionalities of each Forth system. StrongForth's type system doesn't change this statement, but by considering data types, it adds useful features like type checking and operator overloading.

All of these tasks are performed by a single word: INTERPRET. Since Forth doesn't make a big difference between interpreting and compiling, INTERPRET serves as both the interpreter and the compiler. Whether a word is to be interpreted or to be compiled depends on whether the system is in interpretation or compilation state. Immediate words are always interpreted, as long as they are not forced to be compiled by [COMPILE] or POSTPONE.

So here's the definition of INTERPRET:

: INTERPRET ( -- )
  BEGIN PARSE-WORD DUP
  WHILE OVER OVER FIND-LOCAL DUP
     IF ROT DROP ROT DROP ABS LOCAL,
     ELSE DROP DROP OVER OVER 0 4 FIND DUP
        IF ROT DROP ROT DROP 0< STATE @ AND
           IF COMPILE,
           ELSE FALSE DT>DT (EXECUTE)
           THEN
        ELSE DROP DROP NUMBER DUP >DT DOUBLE? STATE @
           IF
              IF LITERAL,
              ELSE D>S LITERAL,
              THEN
           ELSE
              IF ( DOUBLE -- )CAST
              ELSE D>S ( SINGLE -- )CAST
              THEN
           THEN
        THEN
     THEN
  REPEAT DROP DROP ;

INTERPRET is a loop that parses the input source as long as there are words to parse. For example, if the input source is the user input device, INTERPRET processes one complete line of text.

By passing the character string returned by PARSE-WORD to FIND-LOCAL, INTERPRET first checks whether the word is a local. Locals only exist in compilation state. If FIND-LOCAL finds a match in the local dictionary, the local is directly compiled by LOCAL,. FIND-LOCAL, LOCAL, and the local dictionary will be described in chapter 12.

If FIND-LOCAL does not succeed, FIND gets a chance to search the complete dictionary. With 4 as the matching criteria code, FIND performs an additional input parameter match on the interpreter or compiler data type heap, depending on whether the word is to be interpreted or compiled. Let's assume FIND really finds a suitable word. If the system is in compilation state and it is not an immediate word, the word is being compiled by COMPILE,. Otherwise, i. e., either the system is in interpretation state or the word is immediate, the word is to be interpreted. In strongForth, interpreting a word means applying its stack effect to the interpreter data type heap and then executing the word. The interpreter data type heap is updated by FALSE DT>DT. (EXECUTE) calls the inner interpreter in order to execute the word. It expects the word's execution token on the data stack, which is delivered by DT>DT:

(EXECUTE) ( TOKEN -- )

Note that (EXECUTE) is a low-level word, that should be used with care. Since (EXECUTE) does not consider the stack effects of the words it executes, it can easily corrupt strongForth's type system. Its usage in INTERPRET is correct, because DT>DT applies a word's stack effect to the interpreter data type heap before the word is executed.

Finally, let's see what happens if neither FIND-LOCAL nor FIND can find the word whose name was parsed in the input source. Then the final hope is NUMBER. NUMBER does not return a flag to indicate whether it was successful or not. If it cannot recognize a word as being a valid number, it directly throws an exception.

Once NUMBER has accepted the word, it returns its data type, which is one of the following:

Depending on the contents of STATE, the number is either interpreted or compiled. In both cases, >DT pushes the data type to the corresponding data type heap. LITERAL, compiles the number as virtual machine code, whereas ( SINGLE -- )CAST and ( DOUBLE -- )CAST interpret it. If the data type of the number is either SIGNED or UNSIGNED, the value delivered by NUMBER needs to be converted from a double-precision number to a single-precision number. Note that LITERAL, is overloaded to provide versions for single-precision and double-precision literals.

)CAST is actually a very interesting immediate word. It compiles nothing! At least, it has no execution semantics. All it does is applying a given stack diagram to the compiler or interpreter data type heap, just as if a word with this stack diagram has been compiled. During the compilation of INTERPRET, it simply removes a data type (SINGLE or DOUBLE) from the compiler data type heap. But why does removing a data type from the compiler data type heap actually interpret a number? In strongForth, interpreting a number requires two things to be done:

  1. Push the data type of the number to the interpreter data type heap.
  2. Push the numerical value of the number to the data stack.

Both tasks have already been completed before )CAST is executed. >DT has taken care of the interpreter data type heap, and the number, either single-precision or double-precision, is already on the data stack. So, all that has to be done is a little clean-up during the compilation of INTERPRET. )CAST simply removes data types INTEGER-DOUBLE or SINGLE from INTERPRET's compiler data type heap. The number is now invisible for INTERPRET. However, an important precondition for this to work correctly is that INTERPRET does not have anything left on the data stack below the number. Otherwise, next time INTERPRET tried to access these data, it would find the number instead. Needless to say, )CAST should be used very carefully, because it might easily corrupt strongForth's type system.

: )CAST ( MEMORY-SPACE FLAG STACK-DIAGRAM -- )
  <DIAGRAM DUP DUP OFFSET >R 2 CELLS ALLOT
  HERE CAST DATA -> STACK-DIAGRAM R@ - 1- DUP DUP 1+ R@ MOVE !
  FAR-HERE -> DATA-TYPE R> - 1- CAST DEFINITION STATE @ DT>DT
  DROP -2 CELLS ALLOT DIAGRAM> ; IMMEDIATE

)CAST marks the end of a stack diagram. The length of the stack diagram is calculated by OFFSET, and temporarily stored on the return stack. The application of the stack diagram on the interpreter or compiler data type heap is performed by DT>DT, which expects a definition on the data stack. Converting the stack diagram into those parts of a definition that are needed by DT>DT just requires inserting the attribute field and the token field. The stack diagram is already in the local name space. By allocating two additional cells, moving the stack diagram two cells towards higher addresses and finally storing the item of data type STACK-DIAGRAM that was delivered by <DIAGRAM into the gap, a kind of mutilated definition is created. Now, DT>DT can be applied. -2 CELLS ALLOT DIAGRAM> cleans up the local name space.

Note that )CAST and CAST are different in many ways. It's not only the syntax that distinguishes these two words. Because CAST always converts the data type of only one item, it is save in the sense that it cannot corrupt strongForth's data type system. Whenever a single-cell item is converted into a double-cell item or vice versa, it compiles appropriate conversion words in order to keep the data stack and the data type heap aligned. In contrast to this behaviour, )CAST is potentially unsave. It can change the contents of the data type heap arbitrarily, without taking regard of the data stack. Its use should be limited to rare occasions.

Evaluating and Postponing

StrongForth implements the ANS Forth word EVALUATE as an application of INTERPRET. The ANS Forth glossary on EVALUATE says:

    ( i*x c-addr u -- j*x )

    Save the current input source specification. Store minus-one (-1) in SOURCE-ID if it is present. Make the string described by c-addr and u both the input source and input buffer, set >IN to zero, and interpret. When the parse area is empty, restore the prior input source specification. Other stack effects are due to the words EVALUATEd.

And that's exactly what EVALUATE does in strongForth:

: EVALUATE ( CDATA -> CHARACTER UNSIGNED -- )
  SOURCE-ADDR SOURCE-COUNT SOURCE-ID >IN @ LOCALS| I S C A |
  0 >IN ! -1 TO SOURCE-ID TO SOURCE-COUNT TO SOURCE-ADDR
  INTERPRET
  I >IN ! S TO SOURCE-ID C TO SOURCE-COUNT A TO SOURCE-ADDR ;

The input source specification, consisting of SOURCE-ADDR, SOURCE-COUNT, SOURCE-ID and >ID, is stored in locals. These values cannot be kept on the data stack, because the data stack is usually affected by the evaluated words. While evaluating, SOURCE-ADDR and SOURCE-COUNT contain the string that is being evaluated. In order to allow for EVALUATE to be used recursively, i. e. using EVALUATE in turn within an evaluated string, SOURCE-ADDR and SOURCE-COUNT need to be saved as part of the input source specification.

This version of EVALUATE works with strings that are located in the DATA memory area; because that's where SOURCE expects the input sources to be. But in many cases, it is necessary to evaluate strings located in the CONST memory area. For examples, strings compiled into words are always kept in the constant data space. The phrase

... " ..." EVALUATE ...

can only be compiled if an overloaded version of EVALUATE for strings in the CONST memory area exists.

: EVALUATE ( CCONST -> CHARACTER UNSIGNED -- )
  SPACE@ SWAP LOCAL-SPACE HERE CAST CDATA -> CHARACTER
  OVER CHARS ALLOT ALIGN LOCALS| ADDR COUNT | SPACE!
  ADDR COUNT MOVE ADDR COUNT EVALUATE
  SPACE@ LOCAL-SPACE ADDR CAST ADDRESS HERE - ALLOT SPACE! ;

You might be surprised about the complexity of this definition. But cince SOURCE does not support input sources in the CONST memory area, the string has to be copied to the DATA memory area before it can be evaluated. EVALUATE allocates space for a copy of the string in the local data space, copies the string from the CONST memory area to this location, evaluates it and then deallocates the allocated memory. Again, locals are used to clean up the data stack before actually evaluating the string.

Using the local data space for storing a copy of the evaluated string has the advantage that EVALUATE can be used recursively, because strings on different levels of evaluation do not interfere with each other. But this technique also has a drawback. The evaluated string must not itself change the local data space. Anything like

" ... LOCAL-SPACE ... ALLOT ..." EVALUATE \ Don't do that!

is strictly forbidden, unless the sum of allocated and deallocated cells is zero:

" ... LOCAL-SPACE ... 8 ALLOT ... -8 ALLOT ... " EVALUATE \ Okay

One of the most prominent applications of EVALUATE is the ANS Forth word POSTPONE. According to the ANS Forth specification, it compiles the compilation semantics of a word to the current definition. What does this mean? The compilation semantics of an immediate word is to execute the word immediately. POSTPONE compiles this compilation semantics by adding the token of the immediate word into the current definition using COMPILE,. The immediate word is then executed at runtime instead of at compile time. It's execution is POSTPONEd.

The compilation semantics of a non-immediate word is to compile the token of the word into the current definition. Therefore, POSTPONE compiles code that compiles the word at runtime. It actually compiles the name of the word as a string and then compiles the token of EVALUATE. Here's the definition of POSTPONE:

: POSTPONE ( -- )
  ?COMPILE PARSE-WORD OVER OVER 0 6 FIND +1 =
  IF COMPILE, DROP DROP
  ELSE DROP [COMPILE] SLITERAL " EVALUATE" EVALUATE
  THEN ; IMMEDIATE

But why does POSTPONE use EVALUATE instead of just calculating the execution token of the word and then compiling something like

LIT <token> CONST,

into the current definition? This is what an ANS Forth system would do. In strongForth, the integrity of the type system requires that a word is compiled within the context of the current definition. When POSTPONE is executed, it only knows the context of the word that compiles the postponed word, but not the context of the postponed word itself. Selecting the correct token is not possible if the word is overloaded. Therefore, it is not sufficient to postpone the creation of virtual code for the word. FINDing the word in the dictionary and updating the data type heap has to be postponed as well. EVALUATE does exactly this. For POSTPONE, it does not even matter whether a postponed non-immediate word is actually found in the dictionary or not, because the context is invalid anyway. If FIND either finds a non-immediate word or just fails, it simply compiles the name of the word as a string, plus EVALUATE to have the word compiled at runtime.

Would you have guessed that POSTPONE itself is a typical application of POSTPONE? If not, have a look at the following alternative definition of POSTPONE:

: POSTPONE ( -- )
  ?COMPILE PARSE-WORD OVER OVER 0 6 FIND +1 =
  IF COMPILE, DROP DROP
  ELSE DROP POSTPONE SLITERAL POSTPONE EVALUATE
  THEN ; IMMEDIATE

This produces exactly the same virtual code as the first definition of POSTPONE. But let's have a look at another example:

: SLITERAL ( CDATA -> CHARACTER UNSIGNED -- )
  ?COMPILE SPACE@ CONST-SPACE ROT ROT
  POSTPONE SLIT ", ALIGN SPACE! ; IMMEDIATE
 OK
SEE SLITERAL
: SLITERAL ( CDATA -> CHARACTER UNSIGNED -- )
?COMPILE SPACE@ CONST-SPACE ROT ROT " SLIT" EVALUATE ", ALIGN SPACE! ; IMMEDIATE
  OK

During compilation of SLITERAL, the context SLIT will be compiled into is not yet known. However, when SLITERAL is finally being executed, the context is known, and EVALUATE is able to compile SLIT into this context.


Dr. Stephan Becher - October 29th, 2005