--- gforth/doc/gforth.ds	2003/02/22 18:24:29	1.108
+++ gforth/doc/gforth.ds	2003/02/23 21:16:59	1.109
@@ -511,6 +511,14 @@ The optional Search-Order word set
 * search-idef::                 Implementation Defined Options                 
 * search-ambcond::              Ambiguous Conditions              
 
+Emacs and Gforth
+
+* Installing gforth.el::        Making Emacs aware of Forth.
+* Emacs Tags::                  Viewing the source of a word in Emacs.
+* Hilighting::                  Making Forth code look prettier.
+* Auto-Indentation::            Customizing auto-indentation.
+* Blocks Files::                Reading and writing blocks files.
+
 Image Files
 
 * Image Licensing Issues::      Distribution terms for images.
@@ -538,6 +546,7 @@ Threading
 
 * Scheduling::                  
 * Direct or Indirect Threaded?::  
+* Dynamic Superinstructions::   
 * DOES>::                       
 
 Primitives
@@ -1032,7 +1041,7 @@ For related information about the creati
 @cindex flags on the command line
 
 Gforth is made up of two parts; an executable ``engine'' (named
-@file{gforth} or @file{gforth-fast}) and an image file. To start it, you
+@command{gforth} or @command{gforth-fast}) and an image file. To start it, you
 will usually just say @code{gforth} -- this automatically loads the
 default image file @file{gforth.fi}. In many other cases the default
 Gforth image will be invoked like this:
@@ -1043,10 +1052,15 @@ gforth [file | -e forth-code] ...
 This interprets the contents of the files and the Forth code in the order they
 are given.
 
-In addition to the @file{gforth} engine, there is also an engine called
-@file{gforth-fast}, which is faster, but gives less informative error
-messages (@pxref{Error messages}) and may catch fewer stack underflows.
-You should use it for debugged, performance-critical programs.
+In addition to the @command{gforth} engine, there is also an engine
+called @command{gforth-fast}, which is faster, but gives less
+informative error messages (@pxref{Error messages}) and may catch some
+stack underflows later or not at all.  You should use it for debugged,
+performance-critical programs.
+
+Moreover, there is an engine called @command{gforth-itc}, which is
+useful in some backwards-compatibility situations (@pxref{Direct or
+Indirect Threaded?}).
 
 In general, the command line looks like this:
 
@@ -1165,6 +1179,16 @@ or the segmentation violation SIGSEGV) b
 signal. This option is useful when the engine and/or the image might be
 severely broken (such that it causes another signal before recovering
 from the first); this option avoids endless loops in such cases.
+
+@item --no-dynamic
+@item --dynamic
+Disable or enable dynamic superinstructions with replication
+(@pxref{Dynamic Superinstructions}).
+
+@item --no-super
+Disable dynamic superinstructions, use just dynamic replication
+(@pxref{Dynamic Superinstructions}).
+
 @end table
 
 @cindex loading files at startup
@@ -7297,6 +7321,9 @@ doc-name>int
 doc-name?int
 doc-name>comp
 doc-name>string
+doc-id.
+doc-.name
+doc-.id
 
 @c ----------------------------------------------------------
 @node Compiling words, The Text Interpreter, Tokens for Words, Words
@@ -9066,6 +9093,7 @@ doc-ekey>char
 doc->number
 doc->float
 doc-accept
+doc-edit-line
 doc-pad
 @c anton: these belong in the input stream section
 doc-parse
@@ -13882,7 +13910,7 @@ later and does not work for words contai
 @end menu
 
 @c ----------------------------------
-@node Installing gforth.el, Emacs Tags, , Emacs and Gforth
+@node Installing gforth.el, Emacs Tags, Emacs and Gforth, Emacs and Gforth
 @section Installing gforth.el
 @cindex @file{.emacs}
 @cindex @file{gforth.el}, installation
@@ -14013,7 +14041,7 @@ Example:
 @end example
 
 @c ----------------------------------
-@node Blocks Files,, Auto-Indentation, Emacs and Gforth
+@node Blocks Files,  , Auto-Indentation, Emacs and Gforth
 @section Blocks Files
 @cindex blocks files, use with Emacs
 @code{forth-mode} Autodetects blocks files by checking whether the
@@ -14470,10 +14498,14 @@ doc-arg
 Reading this chapter is not necessary for programming with Gforth. It
 may be helpful for finding your way in the Gforth sources.
 
-The ideas in this section have also been published in Bernd Paysan,
-@cite{ANS fig/GNU/??? Forth} (in German), Forth-Tagung '93 and M. Anton
-Ertl, @cite{@uref{http://www.complang.tuwien.ac.at/papers/ertl93.ps.Z, A
-Portable Forth Engine}}, EuroForth '93.
+The ideas in this section have also been published in the following
+papers: Bernd Paysan, @cite{ANS fig/GNU/??? Forth} (in German),
+Forth-Tagung '93; M. Anton Ertl,
+@cite{@uref{http://www.complang.tuwien.ac.at/papers/ertl93.ps.Z, A
+Portable Forth Engine}}, EuroForth '93; M. Anton Ertl,
+@cite{@uref{http://www.complang.tuwien.ac.at/papers/ertl02.ps.gz,
+Threaded code variations and optimizations (extended version)}},
+Forth-Tagung '02.
 
 @menu
 * Portability::                 
@@ -14513,13 +14545,7 @@ GNU C Manual}). Its labels as values fea
 Labels as Values, gcc.info, GNU C Manual}) makes direct and indirect
 threading possible, its @code{long long} type (@pxref{Long Long, ,
 Double-Word Integers, gcc.info, GNU C Manual}) corresponds to Forth's
-double numbers@footnote{Unfortunately, long longs are not implemented
-properly on all machines (e.g., on alpha-osf1, long longs are only 64
-bits, the same size as longs (and pointers), but they should be twice as
-long according to @pxref{Long Long, , Double-Word Integers, gcc.info, GNU
-C Manual}). So, we had to implement doubles in C after all. Still, on
-most machines we can use long longs and achieve better performance than
-with the emulation package.}. GNU C is available for free on all
+double numbers on many systems.  GNU C is freely available on all
 important (and many unimportant) UNIX machines, VMS, 80386s running
 MS-DOS, the Amiga, and the Atari ST, so a Forth written in GNU C can run
 on all these machines.
@@ -14588,6 +14614,7 @@ Of course we have packaged the whole thi
 @menu
 * Scheduling::                  
 * Direct or Indirect Threaded?::  
+* Dynamic Superinstructions::   
 * DOES>::                       
 @end menu
 
@@ -14632,37 +14659,172 @@ There are various schemes that distribut
 NEXT between these parts in several ways; in general, different schemes
 perform best on different processors.  We use a scheme for most
 architectures that performs well for most processors of this
-architecture; in the furture we may switch to benchmarking and chosing
+architecture; in the future we may switch to benchmarking and chosing
 the scheme on installation time.
 
 
-@node Direct or Indirect Threaded?, DOES>, Scheduling, Threading
+@node Direct or Indirect Threaded?, Dynamic Superinstructions, Scheduling, Threading
 @subsection Direct or Indirect Threaded?
 @cindex threading, direct or indirect?
 
-@cindex -DDIRECT_THREADED
-Both! After packaging the nasty details in macro definitions we
-realized that we could switch between direct and indirect threading by
-simply setting a compilation flag (@code{-DDIRECT_THREADED}) and
-defining a few machine-specific macros for the direct-threading case.
-On the Forth level we also offer access words that hide the
-differences between the threading methods (@pxref{Threading Words}).
-
-Indirect threading is implemented completely machine-independently.
-Direct threading needs routines for creating jumps to the executable
-code (e.g. to @code{docol} or @code{dodoes}). These routines are inherently
-machine-dependent, but they do not amount to many source lines. Therefore,
-even porting direct threading to a new machine requires little effort.
-
-@cindex --enable-indirect-threaded, configuration flag
-@cindex --enable-direct-threaded, configuration flag
-The default threading method is machine-dependent. You can enforce a
-specific threading method when building Gforth with the configuration
-flag @code{--enable-direct-threaded} or
-@code{--enable-indirect-threaded}. Note that direct threading is not
-supported on all machines.
+Threaded forth code consists of references to primitives (simple machine
+code routines like @code{+}) and to non-primitives (e.g., colon
+definitions, variables, constants); for a specific class of
+non-primitives (e.g., variables) there is one code routine (e.g.,
+@code{dovar}), but each variable needs a separate reference to its data.
+
+Traditionally Forth has been implemented as indirect threaded code,
+because this allows to use only one cell to reference a non-primitive
+(basically you point to the data, and find the code address there).
+
+@cindex primitive-centric threaded code
+However, threaded code in Gforth (since 0.6.0) uses two cells for
+non-primitives, one for the code address, and one for the data address;
+the data pointer is an immediate argument for the virtual machine
+instruction represented by the code address.  We call this
+@emph{primitive-centric} threaded code, because all code addresses point
+to simple primitives.  E.g., for a variable, the code address is for
+@code{lit} (also used for integer literals like @code{99}).
+
+Primitive-centric threaded code allows us to use (faster) direct
+threading as dispatch method, completely portably (direct threaded code
+in Gforth before 0.6.0 required architecture-specific code).  It also
+eliminates the performance problems related to I-cache consistency that
+386 implementations have with direct threaded code, and allows
+additional optimizations.
+
+@cindex hybrid direct/indirect threaded code
+There is a catch, however: the @var{xt} parameter of @code{execute} can
+occupy only one cell, so how do we pass non-primitives with their code
+@emph{and} data addresses to them?  Our answer is to use indirect
+threaded dispatch for @code{execute} and other words that use a
+single-cell xt.  So, normal threaded code in colon definitions uses
+direct threading, and @code{execute} and similar words, which dispatch
+to xts on the data stack, use indirect threaded code.  We call this
+@emph{hybrid direct/indirect} threaded code.
+
+@cindex engines, gforth vs. gforth-fast vs. gforth-itc
+@cindex gforth engine
+@cindex gforth-fast engine
+The engines @command{gforth} and @command{gforth-fast} use hybrid
+direct/indirect threaded code.  This means that with these engines you
+cannot use @code{,} to compile an xt.  Instead, you have to use
+@code{compile,}.
+
+@cindex gforth-itc engine
+If you want to compile xts with @code{,}, use @command{gforth-itc}.  This
+engine uses plain old indirect threaded code.  It still compiles in a
+primitive-centric style, so you cannot use @code{compile,} instead of
+@code{,} (e.g., for producing tables of xts with @code{] word1 word2
+... [}.  If you want to do that, you have to use @command{gforth-itc}
+and execute @code{' , is compile,}.  Your program can check if it is
+running on a hybrid direct/indirect threaded engine or a pure indirect
+threaded engine with @code{threading-method} (@pxref{Threading Words}).
+
+
+@node Dynamic Superinstructions, DOES>, Direct or Indirect Threaded?, Threading
+@subsection Dynamic Superinstructions
+@cindex Dynamic superinstructions with replication
+@cindex Superinstructions
+@cindex Replication
+
+The engines @command{gforth} and @command{gforth-fast} use another
+optimization: Dynamic superinstructions with replication.  As an
+example, consider the following colon definition:
+
+@example
+: squared ( n1 -- n2 )
+  dup * ;
+@end example
+
+Gforth compiles this into the threaded code sequence
+
+@example
+dup
+*
+;s
+@end example
+
+In normal direct threaded code there is a code address occupying one
+cell for each of these primitives.  Each code address points to a
+machine code routine, and the interpreter jumps to this machine code in
+order to execute the primitive.  The routines for these three
+primitives are (in @command{gforth-fast} on the 386):
+
+@example
+Code dup  
+( $804B950 )  add     esi , # -4  \ $83 $C6 $FC 
+( $804B953 )  add     ebx , # 4  \ $83 $C3 $4 
+( $804B956 )  mov     dword ptr 4 [esi] , ecx  \ $89 $4E $4 
+( $804B959 )  jmp     dword ptr FC [ebx]  \ $FF $63 $FC 
+end-code
+Code *  
+( $804ACC4 )  mov     eax , dword ptr 4 [esi]  \ $8B $46 $4 
+( $804ACC7 )  add     esi , # 4  \ $83 $C6 $4 
+( $804ACCA )  add     ebx , # 4  \ $83 $C3 $4 
+( $804ACCD )  imul    ecx , eax  \ $F $AF $C8 
+( $804ACD0 )  jmp     dword ptr FC [ebx]  \ $FF $63 $FC 
+end-code
+Code ;s  
+( $804A693 )  mov     eax , dword ptr [edi]  \ $8B $7 
+( $804A695 )  add     edi , # 4  \ $83 $C7 $4 
+( $804A698 )  lea     ebx , dword ptr 4 [eax]  \ $8D $58 $4 
+( $804A69B )  jmp     dword ptr FC [ebx]  \ $FF $63 $FC 
+end-code
+@end example
+
+With dynamic superinstructions and replication the compiler does not
+just lay down the threaded code, but also copies the machine code
+fragments, usually without the jump at the end.
+
+@example
+( $4057D27D )  add     esi , # -4  \ $83 $C6 $FC 
+( $4057D280 )  add     ebx , # 4  \ $83 $C3 $4 
+( $4057D283 )  mov     dword ptr 4 [esi] , ecx  \ $89 $4E $4 
+( $4057D286 )  mov     eax , dword ptr 4 [esi]  \ $8B $46 $4 
+( $4057D289 )  add     esi , # 4  \ $83 $C6 $4 
+( $4057D28C )  add     ebx , # 4  \ $83 $C3 $4 
+( $4057D28F )  imul    ecx , eax  \ $F $AF $C8 
+( $4057D292 )  mov     eax , dword ptr [edi]  \ $8B $7 
+( $4057D294 )  add     edi , # 4  \ $83 $C7 $4 
+( $4057D297 )  lea     ebx , dword ptr 4 [eax]  \ $8D $58 $4 
+( $4057D29A )  jmp     dword ptr FC [ebx]  \ $FF $63 $FC 
+@end example
+
+Only when a threaded-code control-flow change happens (e.g., in
+@code{;s}), the jump is appended.  This optimization eliminates many of
+these jumps and makes the rest much more predictable.  The speedup
+depends on the processor and the application; on the Athlon and Pentium
+III this optimization typically produces a speedup by a factor of 2.
+
+The code addresses in the direct-threaded code are set to point to the
+appropriate points in the copied machine code, in this example like
+this:
 
-@node DOES>,  , Direct or Indirect Threaded?, Threading
+@example
+primitive  code address
+   dup       $4057D27D
+   *         $4057D286
+   ;s        $4057D292
+@end example
+
+Thus there can be threaded-code jumps to any place in this piece of
+code.  This also simplifies decompilation quite a bit.
+
+@cindex --no-dynamic command-line option
+@cindex --no-super command-line option
+You can disable this optimization with @option{--no-dynamic}.  You can
+use the copying without eliminating the jumps (i.e., dynamic
+replication, but without superinstructions) with @option{--no-super};
+this gives the branch prediction benefit alone; the effect on
+performance depends on the CPU.
+
+@cindex --dynamic command-line option
+On some machines this optimization is disabled by default, because it is
+unsafe on these machines.  However, if you feel adventurous, you can
+enable it with @option{--dynamic}.
+
+@node DOES>,  , Dynamic Superinstructions, Threading
 @subsection DOES>
 @cindex @code{DOES>} implementation
 
@@ -14670,36 +14832,22 @@ supported on all machines.
 @cindex @code{DOES>}-code
 One of the most complex parts of a Forth engine is @code{dodoes}, i.e.,
 the chunk of code executed by every word defined by a
-@code{CREATE}...@code{DOES>} pair. The main problem here is: How to find
-the Forth code to be executed, i.e. the code after the
-@code{DOES>} (the @code{DOES>}-code)? There are two solutions:
+@code{CREATE}...@code{DOES>} pair; actually with primitive-centric code,
+this is only needed if the xt of the word is @code{execute}d. The main
+problem here is: How to find the Forth code to be executed, i.e. the
+code after the @code{DOES>} (the @code{DOES>}-code)? There are two
+solutions:
 
 In fig-Forth the code field points directly to the @code{dodoes} and the
-@code{DOES>}-code address is stored in the cell after the code address (i.e. at
-@code{@i{CFA} cell+}). It may seem that this solution is illegal in
-the Forth-79 and all later standards, because in fig-Forth this address
-lies in the body (which is illegal in these standards). However, by
-making the code field larger for all words this solution becomes legal
-again. We use this approach for the indirect threaded version and for
-direct threading on some machines. Leaving a cell unused in most words
-is a bit wasteful, but on the machines we are targeting this is hardly a
-problem. The other reason for having a code field size of two cells is
-to avoid having different image files for direct and indirect threaded
-systems (direct threaded systems require two-cell code fields on many
-machines).
-
-@cindex @code{DOES>}-handler
-The other approach is that the code field points or jumps to the cell
-after @code{DOES>}. In this variant there is a jump to @code{dodoes} at
-this address (the @code{DOES>}-handler). @code{dodoes} can then get the
-@code{DOES>}-code address by computing the code address, i.e., the address of
-the jump to @code{dodoes}, and add the length of that jump field. A variant of
-this is to have a call to @code{dodoes} after the @code{DOES>}; then the
-return address (which can be found in the return register on RISCs) is
-the @code{DOES>}-code address. Since the two cells available in the code field
-are used up by the jump to the code address in direct threading on many
-architectures, we use this approach for direct threading on these
-architectures. We did not want to add another cell to the code field.
+@code{DOES>}-code address is stored in the cell after the code address
+(i.e. at @code{@i{CFA} cell+}). It may seem that this solution is
+illegal in the Forth-79 and all later standards, because in fig-Forth
+this address lies in the body (which is illegal in these
+standards). However, by making the code field larger for all words this
+solution becomes legal again.  We use this approach.  Leaving a cell
+unused in most words is a bit wasteful, but on the machines we are
+targeting this is hardly a problem.
+
 
 @node Primitives, Performance, Threading, Engine
 @section Primitives
@@ -14717,14 +14865,16 @@ architectures. We did not want to add an
 @cindex primitives, automatic generation
 
 @cindex @file{prims2x.fs}
+
 Since the primitives are implemented in a portable language, there is no
 longer any need to minimize the number of primitives. On the contrary,
 having many primitives has an advantage: speed. In order to reduce the
 number of errors in primitives and to make programming them easier, we
-provide a tool, the primitive generator (@file{prims2x.fs}), that
-automatically generates most (and sometimes all) of the C code for a
-primitive from the stack effect notation.  The source for a primitive
-has the following form:
+provide a tool, the primitive generator (@file{prims2x.fs} aka Vmgen,
+@pxref{Top, Vmgen, Introduction, vmgen, Vmgen}), that automatically
+generates most (and sometimes all) of the C code for a primitive from
+the stack effect notation.  The source for a primitive has the following
+form:
 
 @cindex primitive source format
 @format
@@ -14795,6 +14945,8 @@ where the programmer has to take the act
 account, most notably @code{?dup}, but also words that do not (always)
 fall through to @code{NEXT}.
 
+For more information
+
 @node TOS Optimization, Produced code, Automatic Generation, Primitives
 @subsection TOS Optimization
 @cindex TOS optimization for primitives