Conditionals and loops are control structures that change the linear flow of execution. Normally, the inner interpreter executes virtual code sequentially, one token after the other. For example, HEX is a word with purely linear execution:
: HEX 16 BASE ! ; OK SEE HEX : HEX ( -- ) 16 BASE ! ; OK
During execution of HEX, the inner interpreter first pushes the literal 16 onto the data stack, then the address of the system variable BASE, and finally it executes !. Simple and straightforward. Now let's have a look at another word:
: SOURCE ( -- CDATA -> CHARACTER UNSIGNED ) SOURCE-ID IF SOURCE-ADDR SOURCE-COUNT ELSE TIB #TIB @ THEN ; OK SEE SOURCE : SOURCE ( -- CDATA -> CHARACTER UNSIGNED ) SOURCE-ID 0BRANCH 10 SOURCE-ADDR SOURCE-COUNT BRANCH 8 TIB #TIB @ ; OK
SOURCE is not executed linearly. The runtime code compiled by IF decides whether to continue with the IF branch (SOURCE-ADDR SOURCE-COUNT) or with the ELSE branch (TIB #TIB @). At the end of the IF branch, the interpreter jumps to the code after the ELSE branch.
IF, ELSE and THEN are immediate words that compile virtual code to make the inner interpreter deviate from the linear flow of execution. The two words that perform these deviations are BRANCH and 0BRANCH:
BRANCH ( -- ) 0BRANCH ( SINGLE -- )
After executing BRANCH, the inner interpreter does not continue with the token that immediately succeeds the one of BRANCH. Instead, it fetches the next token from a position that is calculated by adding a constant offset to the current position within the virtual code. This constant offset is stored in the memory cell directly after the token of BRANCH. In the example of SOURCE, BRANCH is succeeded by the literal 8, which means that 8 address units are skipped. On a typical 16-bit machine, 8 address units are 4 cells. 4 cells are 4 tokens, and the 4 tokens being skipped are 8, TIB, #TIB, and @. Execution continues with the token compiled by ;.
0BRANCH has an input parameter of data type SINGLE. If the value of the parameter is zero, 0BRANCH executes in the same way as BRANCH. In our example, 10 address units or 5 tokens are skipped. The 5 tokens are 10, SOURCE-ADDR, SOURCE-COUNT, BRANCH, and 8. Execution continues with TIB, the first word of the ELSE branch. If, on the other hand, SINGLE is not equal to zero, 0BRANCH just skips its constant offset parameter, and execution continues with SOURCE-ADDR, the first word of the IF branch.BRANCH and 0BRANCH are no special strongForth words. Though not specified by ANS Forth, these two low-level words are used by many Forth systems for implementing conditionals and loops. What's really special about strongForth is the handling of data type information when the linear flow of execution is disturbed by conditional or unconditional branches. You'll learn more about this in the next section.
Let's consider the generalized virtual machine code of a simple conditional with an IF and an ELSE branch:
Label | Virtual Machine Code | Compiler Data Type Heap |
---|---|---|
calculate the condition |
||
L1: | 0BRANCH | ( x1 ... xn condition ) |
L2: | offset to L6 | |
L3: | IF branch |
( x1 ... xn ) |
L4: | BRANCH | ( y1 ... ym ) |
L5: | offset to L7 | |
L6: | ELSE branch |
( x1 ... xn ) |
L7: | continue linear flow of execution |
( y1 ... ym ) |
If the condition is false, 0BRANCH skips the IF branch by directly jumping to the ELSE branch at label L6. At the end of the IF branch, BRANCH skips the ELSE branch by jumping to label L7. Here's where both branches join.
Now, what about the data type information? Assume that at label L1, the following data types are on the compiler data type heap:
( x1 ... xn condition )
During execution of the immediate word IF in compilation state, 0BRANCH is compiled into the virtual machine code. Because 0BRANCH consumes the condition, the compiler data type heap at label L3 looks as follows:
( x1 ... xn )
Both the IF branch and the ELSE branch start with these data types. Since the IF branch might change the compiler data type heap, the contents of the compiler data type heap at label L3 need to be saved in some way. It will be restored at the beginning of the ELSE branch, which is marked by label L6. After the IF branch has been compiled, the compiler data type heap might look totally different:
( y1 ... ym )
These are the contents of the compiler data type heap at label L4. Because of the BRANCH instruction, execution will continue at label L7. Again, the current contents of the compiler data type heap needs to be saved.
Before compiling the ELSE branch, the contents of the compiler data type heap are replaced by its contents at the beginning of the IF branch. At the end of the ELSE branch, both branches join. This means, that the stack effect of the ELSE branch has to be exactly the same as the stack effect of the IF branch. This is ensured by comparing the contents of the compiler data type heap at label L7 with what has been saved at label L4. An exception is thrown if this condition is not met. At L7, execution continues with ( y1 ... ym ) on the compiler data type heap
StrongForth provides three words, that take care of saving and restoring the contents of the compiler data type heap:
FREEZE ( -- CONTROL-FLOW ) REFREEZE ( CONTROL-FLOW CONST -- 1ST ) THAW ( CONTROL-FLOW -- CONST )
An item of data type CONTROL-FLOW contains information about
The second information is needed for calculating the branch offsets. Now, let's have a closer look at the definition of FREEZE:
: FREEZE ( -- CONTROL-FLOW ) ?COMPILE DTP@ IF SPACE@ LOCAL-SPACE HERE DEPTH 2* , DTP@ DEPTH - HERE CAST DATA -> DATA-TYPE DEPTH DUP 2* CELLS ALLOT MOVE SWAP SPACE! ELSE NULL ADDRESS THEN CONST-HERE MERGE CAST CONTROL-FLOW ;
FREEZE saves the current contents of the compiler data type heap in the local name space, provided the compiler data type heap is not locked. The output parameter of data type CONTROL-FLOW, which is a double-cell item, is constructed by merging the local name space pointer and the constant data space pointer. It therefore contains a pointer to the frozen contents of the compiler data type heap, and a pointer to the branch instruction where this snapshot was taken. If the compiler data type heap is locked, a null pointer is merged with the constant data space pointer, because no memory is allocated in the local name space.
The data structure, which FREEZE saves in the local data space, consists of one cell to indicate its size, plus the contents of the compiler data type heap:
2n |
basic data type 1 |
basic data type 2 |
... |
basic data type n |
Note that 2n is the size of the frozen compiler data type heap in cells, which is two times the number of basic data types, because items of data type DATA-TYPE occupy two cells each. ALLOT allocates the necessary space in the local data space, while a simple MOVE copies the data types.
REFREEZE is a word that re-packs an item of data type CONTROL-FLOW by updating the pointer to the virtual machine code without affecting the pointer to the data types. This special function is not required for compiling simple conditional clauses, but it eases the handling of items of data type CONTROL-FLOW in the implementation of more complex control structures like CASE ... OF ... ENDOF ... ENDCASE.
: REFREEZE ( CONTROL-FLOW -- 1ST ) SPLIT DROP CONST-HERE MERGE CAST CONTROL-FLOW ;
In the previous section, you've seen that the frozen contents of the compiler data type heap are used for two purposes:
These two operations are alternatively performed by THAW. THAW expects an item of data type CONTROL-FLOW on the data stack and returns the pointer to the virtual machine code that was included in it by FREEZE. If the compiler data type heap is locked when THAW is executed, there's obviously nothing to compare the frozen data types against. In this case, THAW unlocks the compiler data type heap and copies the frozen data types back to it. Otherwise, if the compiler data type heap is not locked, its contents is compared with the frozen data types. An exception is thrown if the data types do not exactly match.
Finally, what does THAW do if the compiler data type heap was locked when FREEZE created the item of data type CONTROL-FLOW? In this case, there are not data types to either restore to compare. THAW simply locks the compiler data type heap in order to restore its status to the one when FREEZE was executed.
Examples on how to use FREEZE and THAW will be shown in the next section.
Now, after you've learned the basics of branching and about keeping track of the data type information when deviating from the linear flow of execution, it's time to study the implementations of IF, ELSE, THEN and AHEAD. Actually, there's not much left:
DT CONTROL-FLOW PROCREATES ORIGIN : IF ( -- ORIGIN ) ?COMPILE POSTPONE 0BRANCH FREEZE CAST ORIGIN CONST-CELL-ALLOT ; IMMEDIATE : THEN ( ORIGIN -- ) ?COMPILE THAW CONST-HERE OVER - SWAP -> INTEGER ! ; IMMEDIATE : AHEAD ( -- ORIGIN ) ?COMPILE POSTPONE BRANCH FREEZE CAST ORIGIN DTP| CONST-CELL-ALLOT ; IMMEDIATE : ELSE ( ORIGIN -- 1ST ) POSTPONE AHEAD SWAP POSTPONE THEN ; IMMEDIATE
All of these are immediate words that may only be executed during compilation. That's pretty obvious, because they all compile virtual machine code. IF compiles the token of 0BRANCH and then freezes the contents of the compiler data type heap. The branch offset depends on the size of the IF branch. Since the IF branch is yet to be compiled, the offset is unknown at this point, and IF just allocates one memory cell in the constant data space. ORIGIN is a subtype of CONTROL-FLOW. It is created at the origin of a branch. You'll later see that CONTROL-FLOW has a second subtype DESTINATION, which is created at the destination of a branch.
If the IF branch is terminated by THEN, i. e., if there's no ELSE branch, the contents of the compiler data type heap at the beginning and at the end of the IF branch should exactly match.
Label | Virtual Machine Code | Compiler Data Type Heap |
---|---|---|
calculate the condition |
||
L1: | 0BRANCH | ( x1 ... xn condition ) |
L2: | offset to L7 | |
L3: | IF branch |
( x1 ... xn ) |
L7: | continue linear flow of execution |
( x1 ... xn ) |
Because the compiler data type heap is not locked when THEN is executed, THAW just compares the contents of the compiler data type heap with the frozen data types. THAW returns the pointer to the branch offset in the constant data space (L2). Remember that the cell for the the offset was allocated by IF. Since THEN marks the end of the IF branch (L7), it can now calculate the branch offset by subtracting the pointer to the branch offset from the current constant data space pointer, and store the result in this cell. That's all.
AHEAD is an ANS Forth word that is very similar to IF. The difference to IF is that AHEAD compiles BRANCH instead of 0BRANCH, which means that the branch is always skipped, independently of any condition. Furthermore, it locks the compiler data type heap, because execution will not continue after a BRANCH instruction.
This doesn't make sense? Well, certainly not as long as you think about compiling code that looks like this:
... AHEAD ... THEN ...
But if you look back at the virtual code compiled by
... IF ... ELSE ... THEN ...you'll see that the ELSE branch is actually nothing else than an AHEAD branch, because it is preceeded by a BRANCH instruction. And indeed, AHEAD is included in the definition of ELSE. Actually, ELSE has two tasks:
ELSE begins with the second task. It executes AHEAD to freeze the contents of the compiler data type heap at the end of the IF branch, and compiles a BRANCH instruction. Then, it performs the first task by executing THEN. THEN calculates the branch offset at label L2. Since the compiler data type heap was just locked by AHEAD, THEN restores the contents the compiler data type heap to the status at label L3, instead of comparing them with the frozen data types. This means, the ELSE branch starts with the same compiler data type heap as the IF branch. The item of data type ORIGIN, which ELSE returns, contains the contents of the compiler data type heap at the end of the IF branch.
The final THEN at the end of the ELSE branch will do exactly the same like the THEN at the end of an IF ... THEN conditional clause, with the exception that it calculates the offset of a BRANCH instruction instead of the offset of a 0BRANCH instruction. It compares the contents of the compiler data type heap with the data types that were frozen at the end of the IF branch.
The two branch instructions BRANCH and 0BRANCH are also used for implementing loop structures like BEGIN ... UNTIL, BEGIN ... AGAIN and BEGIN ... WHILE ... REPEAT. The main difference between conditionals and loops is that conditionals contain forward branches, while loops contain backward branches. Before going into details, let's have a look at the virtual code of a typical loop:
: #S ( NUMBER-DOUBLE -- 1ST ) BEGIN # DUP 0= UNTIL ; OK SEE #S : #S ( NUMBER-DOUBLE -- 1ST ) # DUP 0= 0BRANCH -8 ; OK
You can see that BEGIN does not compile any virtual code at all. UNTIL compiles a conditional backward branch. Here's the general structure of a BEGIN ... UNTIL loop:
Label | Virtual Machine Code | Compiler Data Type Heap |
---|---|---|
linear flow of execution |
||
L1: | loop body |
( x1 ... xn ) |
L2: | calculate the condition |
|
L3: | 0BRANCH | ( x1 ... xn condition ) |
L4: | offset to L1 | |
L5: | continue linear flow of execution |
( x1 ... xn ) |
Since you already know how conditionals are implemented in strongForth, it should be pretty obvious what BEGIN and UNTIL do. BEGIN freezes the contents of the compiler data type heap and remembers the location in the virtual machine code where the loop begins. UNTIL compiles a conditional branch to this location and verifies that the contents of the compiler data type heap at the end of the loop matches its contents at the beginning of the loop. Let's see if this is correct:
DT CONTROL-FLOW PROCREATES DESTINATION : BEGIN ( -- DESTINATION ) ?COMPILE FREEZE CAST DESTINATION ; IMMEDIATE : UNTIL ( DESTINATION -- ) ?COMPILE POSTPONE 0BRANCH THAW CONST-HERE - CONST, ; IMMEDIATE
Indeed. BEGIN just freezes the contents of the compiler data type heap. DESTINATION is a subtype of CONTROL-FLOW.
After UNTIL has compiled 0BRANCH, the condition is removed from the compiler data type heap. Because execution might continue at the beginning of the loop, THAW compares the contents of the compiler data type heap at the end and at the beginning of the loop's body. Note that the compiler data type heap is not locked. Finally, UNTIL compiles the (negative) branch offset, which is the distance from the current location (L4) to the one that was saved by BEGIN (L1).
The definition of AGAIN is quite similar to the definition of UNTIL. Instead of a conditional backward branch, AGAIN compiles an unconditional backward branch. Furthermore, AGAIN locks the compiler data type heap, because the virtual machine code succeeding the unconditional branch will never be executed. Here's the definition of AGAIN, and a diagram of the general structure of the virtual code that constitutes such an infinite loop:
: AGAIN ( DESTINATION -- ) ?COMPILE POSTPONE BRANCH THAW DTP| CONST-HERE - CONST, ; IMMEDIATE
Label | Virtual Machine Code | Compiler Data Type Heap |
---|---|---|
linear flow of execution |
||
L1: | loop body |
( x1 ... xn ) | L3: | BRANCH | ( x1 ... xn ) |
L4: | offset to L1 |
Loops that are constructed with BEGIN ... WHILE ... REPEAT are actually a combination of an infinite loop (BEGIN ... AGAIN) and a conditional clause (IF ... THEN):
Label | Virtual Machine Code | Compiler Data Type Heap |
---|---|---|
linear flow of execution |
||
L1: | loop body (part 1) |
( x1 ... xn ) |
L2: | calculate the condition |
|
L3: | 0BRANCH | ( y1 ... ym condition ) |
L4: | offset to L8 | |
L5: | loop body (part 2) |
( y1 ... ym ) | L6: | BRANCH | ( x1 ... xn ) |
L7: | offset to L1 | |
L8: | continue linear flow of execution |
( y1 ... ym ) |
The semantics of WHILE is almost identical to the semantics of IF. It just handles the additional item of data type DESTINATION, which was created by BEGIN. SWAP takes care that this item stays on top of the data stack. At the end of the loop, REPEAT executes first AGAIN and then THEN. AGAIN compiles an unconditional backward branch using DESTINATION. Now, only ORIGIN is left on the stack. ORIGIN was created by the IF within WHILE and is consumed by the THEN within REPEAT. These are the definitions of WHILE and REPEAT:
: WHILE ( DESTINATION -- ORIGIN 1ST ) POSTPONE IF SWAP ; IMMEDIATE : REPEAT ( ORIGIN DESTINATION -- ) POSTPONE AGAIN POSTPONE THEN ; IMMEDIATE
Note that the body of the loop is split into two parts. The first part starts with ( x1 ... xn ) on the compiler data type heap and ends with ( y1 ... ym condition ). The second part starts with ( y1 ... ym ), because the condition is consumed by 0BRANCH, and ends with ( x1 ... xn ). After the end of the loop, execution continues with ( y1 ... ym ).
Although strongForth additionally keeps track of the data type information, the definitions of IF, ELSE, THEN, BEGIN, UNTIL, AGAIN, WHILE and REPEAT are not more complex that the respective definitions in an ANS Forth implementation. FREEZE and THAW do most of the job.
DO loops cannot be implemented by compiling BRANCH and 0BRANCH. The reason is, that DO loops need to create and discard loop control parameters on entry and on exit of the loop, respectively. Let's first view a basic example:
: COUNTER 10 0 DO I . LOOP ; OK COUNTER 0 1 2 3 4 5 6 7 8 9 OK SEE COUNTER : COUNTER ( -- ) 10 0 (DO) 12 (R@) 0 . (LOOP) -8 ; OK
You can easily see that DO compiles (DO), I compiles (R@) and LOOP compiles (LOOP). Each one of these three low-level words has an additional offset parameter included in the virtual code.
(R@) is a word you already know in connection with locals. It fetches one item at the specified offset (in this case, 0) from the return stack and pushes it onto the data stack. This is not surprising, because the loop index I is nothing else but a local. I is created by DO and destroyed by LOOP, which means that it is only visible at compile time between these two words. Outside of the loop, I does not exist:: INVISIBLE 10 0 DO I . LOOP I . ; : INVISIBLE 10 0 DO I . LOOP I ? undefined word
Now it's pretty obvious what (DO) and (LOOP) are doing. (DO) pops both the loop limit and the loop index from the data stack and pushes them onto the return stack, because locals are always stored on the return stack. But only the loop index can easily be accessed using I. The loop limit is actually hidden behind the loop index. It is only used by (LOOP).
That is all (DO) does. So why does it have an additional offset parameter (12)? This is the offset in address units from the location where the offset is stored to the location of the first token after the loop. (DO) does not use this parameter for any purpose, but (?DO) and LEAVE do. You'll learn more about DO's offset parameter later in this section.
Naturally, the loop index needs to have a data type that is countable in some way. Out of the subtypes of data type SINGLE, only INTEGER and ADDRESS meet this requirement. Items of data types LOGICAL, TOKEN and MEMORY-SPACE are not countable. As a result, we need two overloaded versions of (DO):
(DO) ( ADDRESS 1ST -- ) (DO) ( INTEGER 1ST -- )
Just like ANS Forth, strongForth does not allow double-cell items as loop index.
An interesting side effect of the loop index I being a local variable is the fact that it's value may be changed by TO. Here's an example:
: HICCUP ( -- ) 20 0 DO I 7 = IF 15 TO I THEN I . LOOP ; OK HICCUP 0 1 2 3 4 5 6 15 16 17 18 19 OK
However, please note that this feature is not compliant to ANS Forth at all. Applying TO to a loop index generates non-standard code that might not be portable.
Another consequence of the loop index and the loop limit being locals is the fact that you don't need to take care about them when you exit a loop:
: EXIT-LOOP ( -- ) 10 0 DO I . I 6 > IF EXIT THEN LOOP ; OK EXIT-LOOP 0 1 2 3 4 5 6 7 OK
As explained in chapter 10, EXIT automatically removes all locals from the return stack before returning to the calling definition. Compiling the ANS Forth word UNLOOP immediately before EXIT in order to get rid of the loop control parameters is not necessary. Actually, UNLOOP is not even defined in strongForth. But if you want to have it for compatibility reasons, you may certainly define it as a dummy word:
: UNLOOP ;
Now, what about (LOOP)? (LOOP) does not consume any items from the data stack. Does this mean that there's only one version? Let's see:
WORDS (LOOP) (LOOP) ( CADDRESS -- ) (LOOP) ( ADDRESS -> DOUBLE -- ) (LOOP) ( ADDRESS -> SINGLE -- ) (LOOP) ( ADDRESS -- ) (LOOP) ( INTEGER -- ) OK
Bad guess. But where does the input parameter for (LOOP) come from? It obviously has the data type of the loop index. Since (LOOP) does not really consume an item from the data stack, the parameter might be called a dummy parameter. It's only purpose is for LOOP to select different overloaded versions of (LOOP) for different data types of the loop index. The versions for loop indexes of data type INTEGER and ADDRESS simply increment the loop index by 1, whereas the versions for data types ADDRESS -> SINGLE, ADDRESS -> DOUBLE and CADDRESS increment the loop index by the number of address units per single-cell, per double-cell or per character, respectively. You might consider the strongForth word 1+ being hidden inside (LOOP). The same 5 overloaded versions, and in the same dictionary order, exist for 1+ as well:
WORDS 1+ 1+ ( INTEGER-DOUBLE -- 1ST ) 1+ ( CFAR-ADDRESS -- 1ST ) 1+ ( FAR-ADDRESS -> DOUBLE -- 1ST ) 1+ ( FAR-ADDRESS -> SINGLE -- 1ST ) 1+ ( FAR-ADDRESS -- 1ST ) 1+ ( CADDRESS -- 1ST ) 1+ ( ADDRESS -> DOUBLE -- 1ST ) 1+ ( ADDRESS -> SINGLE -- 1ST ) 1+ ( ADDRESS -- 1ST ) 1+ ( INTEGER -- 1ST ) OK
Like (DO), (LOOP) needs a branch offset in the virtual machine code. It's the address difference between the location where the offset is stored and the beginning of the loop, i. e., the location immediately following the branch offset of the (DO) instruction. (LOOP) increments the loop index according to the rules for the corresponding data type and then compares its value to the loop limit. If loop index and loop limit are equal, execution continues with the next instruction, otherwise a branch to the beginning of the loop body is executed.
To summarize, here's a diagram showing the structure of the virtual code that constitutes a DO loop:
Label | Virtual Machine Code | Compiler Data Type Heap |
---|---|---|
linear flow of execution |
||
L1: | (DO) | ( x1 ... xn limit index ) |
L2: | offset to L6 | |
L3: | loop body |
( x1 ... xn ) |
L4: | (LOOP) | ( x1 ... xn ) |
L5: | offset to L3 | |
L6: | continue linear flow of execution |
( x1 ... xn ) |
Compiling a DO loop is similar to compiling a conditional clause or a BEGIN loop. We have to compile the low-level tokens (DO) and (LOOP), calculate and compile branch offsets, and freeze and thaw the contents of the compiler data type heap in order to guarantee data type consistency inside and outside of the loop. In addition to that, DO and LOOP have to create and destroy a local for the loop index, respectively. And finally, they have to take care about nested loops. If a DO loop is nested inside another DO loop, the loop index has to be temporarily renamed from I to J. You'll see in a minute how this is actually being done.
The definition of DO does not look spectacular:
DT ORIGIN PROCREATES LOOP-ORIGIN : DO ( -- LOOP-ORIGIN ) ?COMPILE POSTPONE (DO) NEST-DO ; IMMEDIATE
It just compiles the token of (DO). Everything else is done by a word called NEST-DO:
: NEST-DO ( -- LOOP-ORIGIN ) FREEZE CAST LOOP-ORIGIN CONST-CELL-ALLOT 2 #LOCALS +! " I" TRANSIENT OVER OVER FIND-LOCAL 0> IF 1 NESTING ELSE DROP THEN CREATE-LOCAL SPACE@ LOCAL-SPACE OVER , SPACE! ;
Now things are getting more interesting. NEST-DO starts freezing the contents of the compiler data type heap. Since (DO) has already been compiled, the loop limit and the loop index are no longer present. The output parameter of FREEZE is being casted to data type LOOP-ORIGIN, which is a subtype of data type ORIGIN. Using a dedicated data type effectively prevents syntax violations caused by invalid mixing of DO and LOOP with other words creating loops or conditionals:
: WRONG ( FLAG -- ) IF 817 . LOOP ; : WRONG ( FLAG -- ) IF 817 . LOOP ? undefined word
On the other hand, LOOP-ORIGIN being a subtype of ORIGIN does not prevent compiling stuff like this:
: POSSIBLE ( SIGNED 1ST -- ) DO I . THEN ; OK +7 -2 POSSIBLE -2 OK
THEN resolves the forward reference created by DO, and you'll see later in this section that LOOP itself actually compiles THEN for exactly this purpose. In this case, the loop body will always be executed once, and it is rather doubtful whether this construction makes any sense from a semantical point of view. However, it is syntactically allowed, and executing POSSIBLE does not lead to a system crash. The loop index exists beyond THEN, and, being just another local, will be cleanly wiped away at the end of the definition.
Let's get back to NEST-DO. CONST-CELL-ALLOT allocates one cell of virtual machine code for the branch offset, which will later be calculated by LOOP. Since both the loop index and the loop limit are being pushed to the return stack at runtime, two cells for locals are required. But before the local I for the loop index is actually being created, NEST-DO checks whether a loop index with this name already exists. If it does, it might be the loop index of an outer DO loop.
1 NESTING
renames it to J. Next, the local I for the current loop is created. But what if a local with name I has been defined which is not the loop index of an outer loop? Consider this case:
: TRY-IT ( -- ) 10 LOCALS| I | 4 0 DO I . LOOP I . ; OK TRY-IT 0 1 2 3 10 OK
No problem. An ordinary local that has been assigned the name I is not renamed by NEST-DO. Because its index value is negative, it can easily be distinguished from a loop index, whose index value is positive. The ordinary local I won't be visible inside the loop anyway, because it is hidden by the loop index.
What does NESTING do exactly? It consumes a pointer to the data type of the local it is supposed to rename from I to J, and an additional numerical parameter:
NESTING ( DATA -> DATA-TYPE INTEGER -- )
Given the memory layout of locals in the local name space, NESTING locates the name field. If the name is exactly one character long, and only then, it adds the numerical parameter to the value of the character. In the case of NEST-DO, 1 added to I results in J. Later, within LOOP, NESTING is called with a numerical parameter of -1 in order to reverse this renaming. Note that NESTING does not throw an exception if the name of the local has more than one character.
Again, back to NEST-DO. After creating the loop index as a local, NEST-DO's final action is to append the control flow, which was returned by FREEZE, to the local name space. You will later see that this additional information is required by LEAVE. It can be accessed from the loop index by traversing the output parameter.
Next, we'll have a closer look at the definition of LOOP:
: LOOP ( LOOP-ORIGIN -- ) ?LOOP @>DT POSTPONE (LOOP) NEST-LOOP ; IMMEDIATE
Just like with DO, most of the work is done in the internal words. The first one is ?LOOP:
: ?LOOP ( -- DATA -> DATA-TYPE ) ?COMPILE " I" TRANSIENT FIND-LOCAL 0> INVERT IF -26 THROW THEN ;
?LOOP verifies the existence of the loop index I, and returns a pointer to its data type. If not in compilation state, or if the loop index does not exist, ?LOOP throws appropriate exceptions.
After executing ?LOOP, LOOP compiles (LOOP). You already know that (LOOP) expects a dummy parameter with the data type of the loop index on the stack. @>DT is used to add the data type of the dummy parameter to the compiler data type heap. The rest of LOOP's job is being done by NEST-LOOP:
: NEST-LOOP ( LOOP-ORIGIN -- ) DUP THAW 1 CELLS + CONST-HERE - CONST, +2 FORGET-LOCAL " J" TRANSIENT FIND-LOCAL IF -1 NESTING ELSE DROP THEN POSTPONE THEN ;
The main task of NEST-LOOP is resolving two branch offsets. The first one is from (DO) to the end of the loop, and the second one is from (LOOP) to the beginning of the loop. THAW checks the congruence of the compiler data type heap at the beginning and at the end of the DO loop and returns the address of the virtual machine code at the point where LOOP-ORIGIN was created. This is L2 in the above diagram. The offset to be compiled is the difference between L5, the current virtual machine code location, and L3, the beginning of the loop body. The second branch offset is now resolved.
Next, NEST-LOOP discards the loop control parameters. If the DO loop is nested inside another DO loop, as indicated by the existence of a loop index with name J, the outer DO loop's loop index needs to be renamed back from J to I. The last action of NEST-LOOP is resolving the first branch offset by executing THEN.
Finally, here's an example of a nested loop, that uses local I within the outer loop and both locals I and J within the inner loop:
: MULTITABLE ( -- ) 11 1 DO 11 1 DO I J * +4 .R LOOP CR LOOP ; OK MULTITABLE 1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20 24 28 32 36 40 5 10 15 20 25 30 35 40 45 50 6 12 18 24 30 36 42 48 54 60 7 14 21 28 35 42 49 56 63 70 8 16 24 32 40 48 56 64 72 80 9 18 27 36 45 54 63 72 81 90 10 20 30 40 50 60 70 80 90 100 OK
ANS Forth specifies two additional words, that can be used for constructing DO loops. If ?DO is used instead of DO, the loop index is compared against the loop limit before the loop body is executed the first time. With ?DO, it is possible to implement loops that are executed zero times, whereas loops starting with DO are always executed at least once.
At the other end of the loop, LOOP may be replaced by +LOOP. +LOOP expects an additional numerical parameter that is added to the loop index after each execution of the loop body. This parameter can even have a negative value. LOOP, on the other hand, always increments the loop index by one. But remember that the actual value of one depends on the data type of the loop index.
The definition of ?DO differs from the definition of DO only by the fact that it compiles (?DO) instead of (DO):
: ?DO ( -- LOOP-ORIGIN ) ?COMPILE POSTPONE (?DO) NEST-DO ; IMMEDIATE
In addition to the semantics of (DO), (?DO) compares the loop index with the loop limit. If both are equal, it takes a branch to the first instruction after the loop, which is labelled L6 in the above virtual machine code diagram. Note that (?DO) is executed only once before entering or skipping the loop. All succeeding tests of the loop index are performed by (LOOP) or (+LOOP). Just like (DO), (?DO) is defined as two overloaded versions for loop indexes of data types INTEGER and ADDRESS:
(?DO) ( ADDRESS 1ST -- ) (?DO) ( INTEGER 1ST -- )
Now let's see how +LOOP works. Again, there's only a minor difference to LOOP:
: +LOOP ( LOOP-ORIGIN -- ) ?LOOP @>DT POSTPONE (+LOOP) NEST-LOOP ; IMMEDIATE
+LOOP compiles (+LOOP) instead of (LOOP). There are 5 overloaded versions of (+LOOP), which look quite similar to those of (LOOP):
WORDS (+LOOP) (+LOOP) ( INTEGER CADDRESS -- ) (+LOOP) ( INTEGER ADDRESS -> DOUBLE -- ) (+LOOP) ( INTEGER ADDRESS -> SINGLE -- ) (+LOOP) ( INTEGER ADDRESS -- ) (+LOOP) ( INTEGER INTEGER -- ) OK
The first parameter is always the increment for the loop index. This is a real parameter. The second parameter is the dummy parameter whose sole purpose is to make the compiler select the overloaded version of (+LOOP) which matches the data type of the loop index. Instead of incrementing the loop index by one, (+LOOP) increments the loop index by the value of the numerical parameter of data type INTEGER. But again, if the data type of the loop index is ADDRESS or a subtype of ADDRESS, the addition executes according to the rules for address arithmetic. For example, if the loop index is an address of a single-cell item (ADDRESS -> SINGLE), it is incremented by the value of the numerical parameter, multiplied by the size of a cell in address units. Similar address arithmetics apply if the loop index is an address of a double-cell item or a character address. If you're in doubt, have a look back at the section on address arithmetic in chapter 3 and imagine the word + being embedded in (+LOOP).
The loop exit condition of (LOOP) is pretty simple. (LOOP) exits the loop if the loop index equals the loop limit. For (+LOOP), the exit condition is more complicated, because it is possible that this condition is never met, like in the following example:
: ENDLESS-LOOP? ( -- ) 18 0 DO I . 4 +LOOP ; OK ENDLESS-LOOP? 0 4 8 12 16 OK
The loop exits correctly, although the loop index never equals the loop limit. Actually, (+LOOP) exits the loop when the loop index crosses the border between the loop limit minus one and the loop limit. In the example above, this is between 17 and 18. Furthermore, the rule applies in either direction, because the loop increment might be negative:
: COUNTBACK ( -- ) -6 +6 DO I . -2 +LOOP ; OK COUNTBACK 6 4 2 0 -2 -4 -6 OK
In this example, the border is between -6 and -7. The loop exits when (+LOOP) tries to subtract 2 from the loop index with value -6.
LEAVE is a word that does not fit well into the structure of DO loops, because it does not constitute a syntactical unit together with DO, ?DO, LOOP and +LOOP. LEAVE is typically nested inside other syntactical structures, as in the following example:
: TEST ( -- ) 10 0 DO I . I 7 > IF [ .S ] LEAVE THEN LOOP ; COLON-DIAGRAM LOOP-ORIGIN ORIGIN OK
.S has been inserted here in order to visualize the contents of the interpreter data type heap immediately before LEAVE is executed. It demonstrates that the loop control flow parameter (LOOP-ORIGIN) is not directly available to LEAVE, because LEAVE is not directly nested inside the DO ... LOOP structure. Any assumtions about where to find the loop control flow parameter on the stack are void, because LEAVE may even be nested deeper inside conditional clauses or loops. The only way by which LEAVE can get access to the loop control flow parameter is by locating the loop index. And that's the reason why NEST-DO stores the loop control-flow parameter together with the loop index in the local name space.
: LEAVE ( -- ) ?LOOP POSTPONE (LEAVE) BEGIN DUP 1+ SWAP @ DT-PREFIX ATTRIBUTE? INVERT UNTIL CAST DATA -> LOOP-ORIGIN @ THAW DTP| CONST-HERE - CONST, ; IMMEDIATE
After locating the loop index and compiling the token of (LEAVE), LEAVE traverses the output parameter of the loop index, because NEST-DO stored the loop control flow parameter directly behind it. THAW makes sure that the current contents of the compiler data type heap matches its contents at the beginning (and at the end) of the loop. Next, the compiler data type heap is locked, because LEAVE actually compiles an unconditional branch. Finally, the branch offset from the current virtual machine code location to the beginning of the loop is calculated and compiled.
To the beginning of the loop? Yes. Since the end of the loop has not yet been compiled, there's no other choice. LOOP or +LOOP will not be able to resolve the branch offsets of one or more occurences of LEAVE. That leads to the question what (LEAVE) really does. It's stack diagram does not yield any clue:
(LEAVE) ( -- )
But you know that (LEAVE) is an unconditional branch, that it's parameter is the offest to the beginning of the loop and that the branch destination is the first instruction after the loop. Now it should be pretty obvious. The parameter of (DO) is the branch offset to the end of the loop. (LEAVE) uses it's own parameter for calculating the location of the (DO) instruction, and then it uses (DO)'s parameter for branching to the end of the loop. This looks awkward, but it works fine. Of course, it would have been more efficient to compile BRANCH instead of (LEAVE), but this solution would require keeping a record of all LEAVE instructions inside a DO loop, so that LOOP or +LOOP could resolve the branch offsets. The actually implemented solution is much simpler.
Conditionals built with CASE, OF, ENDOF and ENDCASE are more complicated that simple IF ... THEN ... ELSE conditionals. As usual in this chapter, let's start with an example before digging into the details of the implementation:
: .NOUN ( UNSIGNED -- ) CASE 0 OF ." ZERO" ENDOF 1 OF ." ONE" ENDOF 2 OF ." TWO" ENDOF ." MANY" ENDCASE ; OK 1 .NOUN ONE OK 3 .NOUN MANY OK SEE .NOUN : .NOUN ( UNSIGNED -- ) 0 -BRANCH 18 DROP " ZERO" TYPE BRANCH 56 1 -BRANCH 16 DROP " ONE" TYPE (ENDOF) -20 2 -BRANCH 16 DROP " TWO" TYPE (ENDOF) -42 " MANY" TYPE DROP ; OK
CASE, OF, ENDOF and ENDCASE are immediate words. By comparing the definition of .NOUN with the output of SEE, you can easily identify what is compiled by each of these words. CASE does not compile anything, just like BEGIN. OF compiles the special branch instruction -BRANCH including an offset parameter, plus the token of DROP. -BRANCH expects two single-cell items of the same data type on the stack. It returns only the first one of them, which is the case selector:
-BRANCH ( SINGLE 1ST -- 1ST )
If both items have the same value, -BRANCH continues with the next instruction, entering the OF ... ENDOF branch. DROP discards the case selector. If the values of the two items are not equal, -BRANCH performs a branch to the location after the next ENDOF. The branch offset is calculated by ENDOF.
ENDOF is pretty interesting. Its first occurrence inside the CASE ... ENDCASE structure compiles a BRANCH instruction to the location following ENDCASE, whereas the other occurences of ENDOF compile the token of a word called (ENDOF) with a negative offset. (ENDOF) has no parameters:
(ENDOF) ( -- )
Why not compiling another BRANCH instead of (ENDOF)? Because this would mean that ENDCASE had to resolve more than one branch offset. The offset of (ENDOF) points back to the branch offset at the end of the first OF ... ENDOF branch. (ENDOF)'s offset is a backward reference, which can immediately be resolved by ENDOF. In order to branch to the first instruction after the CASE ... ENDCASE structure, (ENDOF) first uses its own offset for calculating the address of the BRANCH instruction at the end of the first OF ... ENDOF branch, and then uses the other offset for branching to the end of the structure. This technique is similar to (LEAVE).
Finally, ENDCASE just compiles DROP to get rid of the case selector. The below diagram visualizes the virtual code that constitutes a complete CASE ... ENDCASE structure.
Label | Virtual Machine Code | Compiler Data Type Heap |
---|---|---|
calculate case selector |
||
L1: | calculate case 1 |
( x1 ... xn selector ) |
L2: | -BRANCH | ( x1 ... xn selector s1 ) |
L3: | offset to L8 | |
L4: | DROP | ( x1 ... xn selector ) |
L5: | body 1 |
( x1 ... xn ) |
L6: | BRANCH | ( y1 ... ym ) |
L7: | offset to L18 | |
L8: | calculate case 2 |
( x1 ... xn selector ) |
L9: | -BRANCH | ( x1 ... xn selector s2 ) |
L10: | offset to L15 | |
L11: | DROP | ( x1 ... xn selector ) |
L12: | body 2 |
( x1 ... xn ) |
L13: | (ENDOF) | ( y1 ... ym ) |
L14: | offset to L7 | |
L15: | other cases |
|
L16: | default body |
( x1 ... xn selector ) |
L17: | DROP | ( y1 ... yn selector ) |
L18: | continue linear flow of execution |
( y1 ... ym ) |
Now, let's see how this virtual code is created by the four compiling words CASE, OF, ENDOF and ENDCASE. Data type consistency is handled by items of two new data types, which are subtypes of data type ORIGIN:
DT ORIGIN PROCREATES OF-ORIGIN DT ORIGIN PROCREATES ENDOF-ORIGIN
Items of data type OF-ORIGIN represent the contents of the compiler data type heap at the beginning of the CASE ... ENDCASE structure and at the beginning of each case, as marked by OF. In the diagram above, this is denoted by
( x1 ... xn selector )
The contents of the compiler data type heap at the end of each case, i. e., at the location where ENDOF is executed, is represented by items of data type ENDOF-ORIGIN:
( y1 ... ym )
Now, here's the definition of CASE:
: CASE ( -- ENDOF-ORIGIN OF-ORIGIN ) ?COMPILE NULL ENDOF-ORIGIN FREEZE CAST OF-ORIGIN ; IMMEDIATE
CASE does not compile any virtual code. It just freezes the contents of the compiler data type heap in an item of data type OF-ORIGIN. At the location of each OF, the compiler data type heap shall have exactly these contents. CASE also provides an item of data type ENDOF-ORIGIN, but since ( y1 ... yn ) is unknown before the first occurrence of ENDOF, its value is still null.
The next word within the syntax of a CASE ... ENDCASE structure is OF:
: OF ( ENDOF-ORIGIN OF-ORIGIN -- 2ND 1ST ) ?COMPILE POSTPONE -BRANCH DUP THAW DROP REFREEZE SWAP CONST-CELL-ALLOT POSTPONE DROP ; IMMEDIATE
OF expects the two parameters created by CASE on the data stack. However, it doesn't touch the item of data type ENDOF-ORIGIN. After compiling the token of -BRANCH, OF checks whether the contents of the compiler data type heap matches these that were frozen by CASE. At this point, the case selector is still present. The address value returned by THAW is not required, because CASE does not mark any forward or backward reference, and the branch compiled by the preceding OF has already been resolved by teh preceding ENDOF. For the next ENDOF to resolve the forward reference, REFREEZE merges the current virtual code location into the item of data type OF-ORIGIN. The branch offset of -BRANCH is allocated by CONST-CELL-ALLOT. Finally, OF compiles DROP to get rid of the case selector.
In order to avoid possible syntax violations, the order of the two items of data types OF-ORIGIN and ENDOF-ORIGIN on the data stack is reversed between OF and ENDOF. This ensures that OF and ENDOF have to be used alternating. Two OFs or two ENDOFs in a row are rejected by the compiler.
: ENDOF ( OF-ORIGIN ENDOF-ORIGIN -- 2ND 1ST ) ?COMPILE DUP 0= IF DROP POSTPONE AHEAD CAST ENDOF-ORIGIN ELSE POSTPONE (ENDOF) DUP THAW CONST-HERE - CONST, THEN DTP| SWAP DUP POSTPONE THEN ; IMMEDIATE
The definition of ENDOF contains a conditional clause that distinguishes the first from all succeeding ENDOFs within a CASE ... ENDCASE structure. The first ENDOF compiles a BRANCH instruction with a forward reference to be resolved by ENDCASE, and freezes the compiler data type heap as denoted by ( y1 ... yn ). All succeeding ENDOFs compile (ENDOF) and immediately resolve the offset to point the BRANCH instruction of the first ENDOF. Additionally, they compare the current contents of the compiler data type heap with those that were frozen by the first ENDOF. Because BRANCH as well as (ENDOF) discontinue the linear flow of execution, the compiler data type heap has to be locked. THEN resolves the forward reference that was created by the preceding OF.
: ENDCASE ( ENDOF-ORIGIN OF-ORIGIN -- ) ?COMPILE DROP POSTPONE DROP DUP 0= IF DROP ELSE POSTPONE THEN THEN ; IMMEDIATE
ENDCASE finalizes the CASE ... ENDCASE structure. It removes the two control flow items that were created by CASE, and resolves the forward reference of the first ENDOF. The branches of all other ENDOFs are automatically being resolved, because they reference the offset of the branch instruction compiled by the first ENDOF. In order to get rid of the case selector, ENDCASE compiles a final DROP.
A special case needs to be considered if the CASE ... ENDCASE structure does not contain any OFs and ENDOFs at all. Semantically, this makes no sense, but syntactically it is allowed:
: NONSENSE ( CHARACTER -- ) CASE ." Default" ENDCASE ; OK CHAR A NONSENSE Default OK SEE NONSENSE : NONSENSE ( CHARACTER -- ) " Default" TYPE DROP ; OK
This case can be recognized by ENDCASE, because the item of data type ENDOF-ORIGIN is still null. Instead of resolving a forward reference, ENDCASE just cleans up the data stack.
Dr. Stephan Becher - November 14th, 2005