--- gforth/Attic/gforth.ds 1994/11/14 19:01:16 1.2 +++ gforth/Attic/gforth.ds 1994/11/23 16:54:39 1.3 @@ -746,14 +746,14 @@ entry represents a backward branch targe building any control structure possible (except control structures that need storage, like calls, coroutines, and backtracking). -if -ahead -then -begin -until -again -cs-pick -cs-roll +doc-if +doc-ahead +doc-then +doc-begin +doc-until +doc-again +doc-cs-pick +doc-cs-roll On many systems control-flow stack items take one word, in gforth they currently take three (this may change in the future). Therefore it is a @@ -763,41 +763,102 @@ words. Some standard control structure words are built from these words: -else -while -repeat +doc-else +doc-while +doc-repeat Counted loop words constitute a separate group of words: -?do -do -for -loop -s+loop -+loop -next -leave -?leave -unloop -undo +doc-?do +doc-do +doc-for +doc-loop +doc-s+loop +doc-+loop +doc-next +doc-leave +doc-?leave +doc-unloop +doc-undo The standard does not allow using @code{cs-pick} and @code{cs-roll} on @i{do-sys}. Our system allows it, but it's your job to ensure that for every @code{?DO} etc. there is exactly one @code{UNLOOP} on any path -through the program (@code{LOOP} etc. compile an @code{UNLOOP}). Also, -you have to ensure that all @code{LEAVE}s are resolved (by using one of -the loop-ending words or @code{UNDO}). +through the definition (@code{LOOP} etc. compile an @code{UNLOOP} on the +fall-through path). Also, you have to ensure that all @code{LEAVE}s are +resolved (by using one of the loop-ending words or @code{UNDO}). Another group of control structure words are -case -endcase -of -endof +doc-case +doc-endcase +doc-of +doc-endof @i{case-sys} and @i{of-sys} cannot be processed using @code{cs-pick} and @code{cs-roll}. +@subsubsection Programming Style + +In order to ensure readability we recommend that you do not create +arbitrary control structures directly, but define new control structure +words for the control structure you want and use these words in your +program. + +E.g., instead of writing + +@example +begin + ... +if [ 1 cs-roll ] + ... +again then +@end example + +we recommend defining control structure words, e.g., + +@example +: while ( dest -- orig dest ) + POSTPONE if + 1 cs-roll ; immediate + +: repeat ( orig dest -- ) + POSTPONE again + POSTPONE then ; immediate +@end example + +and then using these to create the control structure: + +@example +begin + ... +while + ... +repeat +@end example + +That's much easier to read, isn't it? Of course, @code{BEGIN} and +@code{WHILE} are predefined, so in this example it would not be +necessary to define them. + +@subsection Calls and returns + +A definition can be called simply be writing the name of the +definition. When the end of the definition is reached, it returns. An earlier return can be forced using + +doc-exit + +Don't forget to clean up the return stack and @code{UNLOOP} any +outstanding @code{?DO}...@code{LOOP}s before @code{EXIT}ing. The +primitive compiled by @code{EXIT} is + +doc-;s + +@subsection Exception Handling + +doc-catch +doc-throw + @node Locals @section Locals @@ -923,10 +984,10 @@ says. If @code{UNREACHABLE} is used wher lie to the compiler), buggy code will be produced. Another problem with this rule is that at @code{BEGIN}, the compiler -does not know which locals will be visible on the incoming back-edge -. All problems discussed in the following are due to this ignorance of -the compiler (we discuss the problems using @code{BEGIN} loops as -examples; the discussion also applies to @code{?DO} and other +does not know which locals will be visible on the incoming +back-edge. All problems discussed in the following are due to this +ignorance of the compiler (we discuss the problems using @code{BEGIN} +loops as examples; the discussion also applies to @code{?DO} and other loops). Perhaps the most insidious example is: @example AHEAD @@ -1299,6 +1360,355 @@ programs harder to read, and easier to m merit of this syntax is that it is easy to implement using the ANS Forth locals wordset. +@node Internals +@chapter Internals + +Reading this section is not necessary for programming with gforth. It +should be helpful for finding your way in the gforth sources. + +@section Portability + +One of the main goals of the effort is availability across a wide range +of personal machines. fig-Forth, and, to a lesser extent, F83, achieved +this goal by manually coding the engine in assembly language for several +then-popular processors. This approach is very labor-intensive and the +results are short-lived due to progress in computer architecture. + +Others have avoided this problem by coding in C, e.g., Mitch Bradley +(cforth), Mikael Patel (TILE) and Dirk Zoller (pfe). This approach is +particularly popular for UNIX-based Forths due to the large variety of +architectures of UNIX machines. Unfortunately an implementation in C +does not mix well with the goals of efficiency and with using +traditional techniques: Indirect or direct threading cannot be expressed +in C, and switch threading, the fastest technique available in C, is +significantly slower. Another problem with C is that it's very +cumbersome to express double integer arithmetic. + +Fortunately, there is a portable language that does not have these +limitations: GNU C, the version of C processed by the GNU C compiler +(@pxref{C Extensions, , Extensions to the C Language Family, gcc.info, +GNU C Manual}). Its labels as values feature (@pxref{Labels as Values, , +Labels as Values, gcc.info, GNU C Manual}) makes direct and indirect +threading possible, its @code{long long} type (@pxref{Long Long, , +Double-Word Integers, gcc.info, GNU C Manual}) corresponds to Forths +double numbers. GNU C is available for free on all important (and many +unimportant) UNIX machines, VMS, 80386s running MS-DOS, the Amiga, and +the Atari ST, so a Forth written in GNU C can run on all these +machines@footnote{Due to Apple's look-and-feel lawsuit it is not +available on the Mac (@pxref{Boycott, , Protect Your Freedom--Fight +``Look And Feel'', gcc.info, GNU C Manual}).}. + +Writing in a portable language has the reputation of producing code that +is slower than assembly. For our Forth engine we repeatedly looked at +the code produced by the compiler and eliminated most compiler-induced +inefficiencies by appropriate changes in the source-code. + +However, register allocation cannot be portably influenced by the +programmer, leading to some inefficiencies on register-starved +machines. We use explicit register declarations (@pxref{Explicit Reg +Vars, , Variables in Specified Registers, gcc.info, GNU C Manual}) to +improve the speed on some machines. They are turned on by using the +@code{gcc} switch @code{-DFORCE_REG}. Unfortunately, this feature not +only depends on the machine, but also on the compiler version: On some +machines some compiler versions produce incorrect code when certain +explicit register declarations are used. So by default +@code{-DFORCE_REG} is not used. + +@section Threading + +GNU C's labels as values extension (available since @code{gcc-2.0}, +@pxref{Labels as Values, , Labels as Values, gcc.info, GNU C Manual}) +makes it possible to take the address of @var{label} by writing +@code{&&@var{label}}. This address can then be used in a statement like +@code{goto *@var{address}}. I.e., @code{goto *&&x} is the same as +@code{goto x}. + +With this feature an indirect threaded NEXT looks like: +@example +cfa = *ip++; +ca = *cfa; +goto *ca; +@end example +For those unfamiliar with the names: @code{ip} is the Forth instruction +pointer; the @code{cfa} (code-field address) corresponds to ANS Forths +execution token and points to the code field of the next word to be +executed; The @code{ca} (code address) fetched from there points to some +executable code, e.g., a primitive or the colon definition handler +@code{docol}. + +Direct threading is even simpler: +@example +ca = *ip++; +goto *ca; +@end example + +Of course we have packaged the whole thing neatly in macros called +@code{NEXT} and @code{NEXT1} (the part of NEXT after fetching the cfa). + +@subsection Scheduling + +There is a little complication: Pipelined and superscalar processors, +i.e., RISC and some modern CISC machines can process independent +instructions while waiting for the results of an instruction. The +compiler usually reorders (schedules) the instructions in a way that +achieves good usage of these delay slots. However, on our first tries +the compiler did not do well on scheduling primitives. E.g., for +@code{+} implemented as +@example +n=sp[0]+sp[1]; +sp++; +sp[0]=n; +NEXT; +@end example +the NEXT comes strictly after the other code, i.e., there is nearly no +scheduling. After a little thought the problem becomes clear: The +compiler cannot know that sp and ip point to different addresses (and +the version of @code{gcc} we used would not know it even if it could), +so it could not move the load of the cfa above the store to the +TOS. Indeed the pointers could be the same, if code on or very near the +top of stack were executed. In the interest of speed we chose to forbid +this probably unused ``feature'' and helped the compiler in scheduling: +NEXT is divided into the loading part (@code{NEXT_P1}) and the goto part +(@code{NEXT_P2}). @code{+} now looks like: +@example +n=sp[0]+sp[1]; +sp++; +NEXT_P1; +sp[0]=n; +NEXT_P2; +@end example +This can be scheduled optimally by the compiler (see \sect{TOS}). + +This division can be turned off with the switch @code{-DCISC_NEXT}. This +switch is on by default on machines that do not profit from scheduling +(e.g., the 80386), in order to preserve registers. + +@subsection Direct or Indirect Threaded? + +Both! After packaging the nasty details in macro definitions we +realized that we could switch between direct and indirect threading by +simply setting a compilation flag (@code{-DDIRECT_THREADED}) and +defining a few machine-specific macros for the direct-threading case. +On the Forth level we also offer access words that hide the +differences between the threading methods (@pxref{Threading Words}). + +Indirect threading is implemented completely +machine-independently. Direct threading needs routines for creating +jumps to the executable code (e.g. to docol or dodoes). These routines +are inherently machine-dependent, but they do not amount to many source +lines. I.e., even porting direct threading to a new machine is a small +effort. + +@subsection DOES> +One of the most complex parts of a Forth engine is @code{dodoes}, i.e., +the chunk of code executed by every word defined by a +@code{CREATE}...@code{DOES>} pair. The main problem here is: How to find +the Forth code to be executed, i.e. the code after the @code{DOES>} (the +DOES-code)? There are two solutions: + +In fig-Forth the code field points directly to the dodoes and the +DOES-code address is stored in the cell after the code address +(i.e. at cfa cell+). It may seem that this solution is illegal in the +Forth-79 and all later standards, because in fig-Forth this address +lies in the body (which is illegal in these standards). However, by +making the code field larger for all words this solution becomes legal +again. We use this approach for the indirect threaded version. Leaving +a cell unused in most words is a bit wasteful, but on the machines we +are targetting this is hardly a problem. The other reason for having a +code field size of two cells is to avoid having different image files +for direct and indirect threaded systems (@pxref{image-format}). + +The other approach is that the code field points or jumps to the cell +after @code{DOES}. In this variant there is a jump to @code{dodoes} at +this address. @code{dodoes} can then get the DOES-code address by +computing the code address, i.e., the address of the jump to dodoes, +and add the length of that jump field. A variant of this is to have a +call to @code{dodoes} after the @code{DOES>}; then the return address +(which can be found in the return register on RISCs) is the DOES-code +address. Since the two cells available in the code field are usually +used up by the jump to the code address in direct threading, we use +this approach for direct threading. We did not want to add another +cell to the code field. + +@section Primitives + +@subsection Automatic Generation + +Since the primitives are implemented in a portable language, there is no +longer any need to minimize the number of primitives. On the contrary, +having many primitives is an advantage: speed. In order to reduce the +number of errors in primitives and to make programming them easier, we +provide a tool, the primitive generator (@file{prims2x.fs}), that +automatically generates most (and sometimes all) of the C code for a +primitive from the stack effect notation. The source for a primitive +has the following form: + +@format +@var{Forth-name} @var{stack-effect} @var{category} [@var{pronounc.}] +[@code{""}@var{glossary entry}@code{""}] +@var{C code} +[@code{:} +@var{Forth code}] +@end format + +The items in brackets are optional. The category and glossary fields +are there for generating the documentation, the Forth code is there +for manual implementations on machines without GNU C. E.g., the source +for the primitive @code{+} is: +@example ++ n1 n2 -- n core plus +n = n1+n2; +@end example + +This looks like a specification, but in fact @code{n = n1+n2} is C +code. Our primitive generation tool extracts a lot of information from +the stack effect notations@footnote{We use a one-stack notation, even +though we have separate data and floating-point stacks; The separate +notation can be generated easily from the unified notation.}: The number +of items popped from and pushed on the stack, their type, and by what +name they are referred to in the C code. It then generates a C code +prelude and postlude for each primitive. The final C code for @code{+} +looks like this: + +@example +I_plus: /* + ( n1 n2 -- n ) */ /* label, stack effect */ +/* */ /* documentation */ +{ +DEF_CA /* definition of variable ca (indirect threading) */ +Cell n1; /* definitions of variables */ +Cell n2; +Cell n; +n1 = (Cell) sp[1]; /* input */ +n2 = (Cell) TOS; +sp += 1; /* stack adjustment */ +NAME("+") /* debugging output (with -DDEBUG) */ +{ +n = n1+n2; /* C code taken from the source */ +} +NEXT_P1; /* NEXT part 1 */ +TOS = (Cell)n; /* output */ +NEXT_P2; /* NEXT part 2 */ +} +@end example + +This looks long and inefficient, but the GNU C compiler optimizes quite +well and produces optimal code for @code{+} on, e.g., the R3000 and the +HP RISC machines: Defining the @code{n}s does not produce any code, and +using them as intermediate storage also adds no cost. + +There are also other optimizations, that are not illustrated by this +example: Assignments between simple variables are usually for free (copy +propagation). If one of the stack items is not used by the primitive +(e.g. in @code{drop}), the compiler eliminates the load from the stack +(dead code elimination). On the other hand, there are some things that +the compiler does not do, therefore they are performed by +@file{prims2x.fs}: The compiler does not optimize code away that stores +a stack item to the place where it just came from (e.g., @code{over}). + +While programming a primitive is usually easy, there are a few cases +where the programmer has to take the actions of the generator into +account, most notably @code{?dup}, but also words that do not (always) +fall through to NEXT. + +@subsection TOS Optimization + +An important optimization for stack machine emulators, e.g., Forth +engines, is keeping one or more of the top stack items in +registers. If a word has the stack effect {@var{in1}...@var{inx} @code{--} +@var{out1}...@var{outy}}, keeping the top @var{n} items in registers +@itemize +@item +is better than keeping @var{n-1} items, if @var{x>=n} and @var{y>=n}, +due to fewer loads from and stores to the stack. +@item is slower than keeping @var{n-1} items, if @var{x<>y} and @var{x