Up | ||
Previous | ![]() |
Next |
Mail the author |
Then the new semester began and there was little time for further work. Till February 1998 almost nothing happened except that I wrote the assembler. Within the next two month the system grew and became a fully ANS (draft) compatible FORTH.
I now want FLK to be a fast standard system. It is meant to be an experiment in both meta compilation and code generation, but it should be a fully functional standalone FORTH too.
Since most of my work turns around neural networks fast floating point support is nessesary. Together with vector and matrix operations and some visualiziation tools FLK could become a good system for experiments in that field.
A different field that I'm interested in is symbolic computation. Sooner or later FLK will contain a computer algebra tool. It therefore has to be fast in non-floating point calculations too.
Start-up and user interface
To start FLK execute the program flk. Any commandline options are
interpreted as file names to be INCLUDED
.
When FLK is up and running you can enter words to execute. To include a file
use S" filename.ext" INCLUDED
or INCLUDE
. The latter
lets you input a filename using a different history list and completer.
A history list is accessible from the word ACCEPT
only. If you
press the up- or down-arrow key you can cycle your previously made inputs.
Once you are sure that this the text you want press the return key and your
text is appended to the history list and the word returns.
A completer is a special word to save you typing. Two completer words are implemented: One for the normal FORTH command line and one for filenames. If you want to learn to implement one yourself look in the files flkinput.fs and flktools.fs for the existing completers.
To activate the completer when ACCEPT
is running press the
Tabulator or Set-Tabulator key. In the FORTH commandline the completer
searches the beginning of the word the cursor is in or behind. All words in
the current search order with this beginning are searched and the longest
common string of their names is generated. This string replaces the beginning
in the inputline. If there is no word with this beginning one alert is
produced, if there is more than one word with the beginning, two alerts are
produced.
The filename completer takes the whole inputline and performs similiar to tcsh's completer. It tries to expand the path level by level using the longest common string method similiar to the command line completer. Two alerts are produced if more than one file in the directory matches, one if none matches.
Upon startup all copies of flkkern (flk is one of them.) search for an system image to load in four places:
COMPILE,
. With version 1.2 the so-called level 2 compiler words are introduced. They try to fold more than one word into fewer machine code than the separate compiling would produce. Since they have access to the last few literals (including CONSTANTs and CREATEd words) it is possible to include these literals into the code instead of loading a register and then working with that register.
The return stack is addressed by esp
, the data stack by ebp
.
Since the indexed access using ebp
requires an offset value this is
the first opportunity to save time. Instead of increasing and decreasing
ebp
itself the offset is increased and decreased. At each access to
the stack one add
operation less is nessesary. Before calling another
word or returning from this word the accumulated offset has to added to
ebp
.
Control-flow words like IF
or DO
have to save the offset to
ebp
and words like THEN
or LOOP
restore the value
by adding the difference to ebp
.
The next possible optimization is to keep the top few items of the data stack
in the CPU registers to reduce fetch and store operations. Since every word
has a different number of accepted and produced items a defined state has to
be reached at the beginning of each word. In this state eax
caches
the top of stack item and no other registers (except ebp
and
esp
) have a defined meaning.
Each primitive first resets the register allocator and then requests the stack items and free registers (in that order) it needs, performs its operation and eventually marks the requested register free or puts free registers onto the stack.
One important point to mention is that each saved image contains a relocation table. This table contains the addresses of cells whos contents have to be corrected relative to the memory address of the first byte of the image. The contents of these cells are absolute addresses. Words are provided for the handling of relocation issues.
Adding your own primitives
This section describes the creation of primitives by the example of the word
COUNT
. The only way to compile a primitive is to put it into the file
flkprim.fs. If you want to write a compiling word without
interpretation semantics it is better to program an immediate word an throw an
exception if interpreting.
COUNT
can be written as the colon definition:
: COUNT ( caddr -- caddr+1 len ) DUP CHAR+ SWAP C@ ;
As a primitive it is written as:
p: COUNT ( caddr -- caddr+1 len ) regalloc-reset req-any req-free free0 free0 xor, 0 [tos0] free0l mov, tos0 inc, 0 free>tos ;
The line p: COUNT ( c-addr1 -- c-addr2 u )
defines the primitive
and informs about the stack effect. Only one space before the name of
the primitive is allowed. Tabs are allowed after the name only if a space
immediately follows the name.
The first thing to do is to reset the register alloctor using
regalloc-reset
. Now we request one item from the stack and one free
register by req-any req-free
. Then the actual code generation starts.
free0 free0 xor, 0 [tos0] free0l mov, tos0 inc,
The byte at caddr is fetched into the cleared free0
meta register.
Which register is hidden behind free0
is not interesting. Neither the
user nor the programmer need to know it.
The last line puts the free0
register on top of the stack.
Other control words for the register allocator can be found in flkprim.fs in the definitions of the other primitives.
Adding your own level 2 compilers
This section contains the desciption of a level 2 compiler (found in
flkopt.fs).
Each level 2 optimizer consumes zero items and produces no items either. To declare an optimizer edit flkopt.fs for optimizers that work in host and target or flktopt.fs for those that only run in the target.
First thing to do is to declare the sequence to optimize away:
opt( ''# '' + '' @ )opt:
does this. This optimizer is declared
for the sequence number additiion fetch. Whenever this sequence is
found, the following code is executed instead of their individual
optimizers.
The rest of the word is very similiar to a primitive declaration. There are three exceptions: You have to delete the optimized words at the end of the word and you have to get or set the actual value of the number parameter. How to do this is shown in the code snippets below.
opt( ''# '' + '' @ )opt: ( Get the actual value and a flag telling if it is an address. ) 0 opt-getlit \ x rel? ( Normal code generation. ) regalloc-reset req-any \ tos0=offs ?+relocate [tos0] tos0 mov, ( All items used up. ) 0 3 opt-remove ;opt opt( ''# ''# '' + )opt: ( get left parameter to + ) 1 opt-getlit \ x1 rel1 ( get right parameter to + ) 0 opt-getlit \ x1 rel1 x0 rel0 ( If one is an address, result is an address to. ) ROT OR -ROT \ rel tos1 tos0 ( Perform the actual calculation. ) + SWAP \ x rel ( Store it back into the cache. ) 0 opt-setlit ( Delete the words optimized away from the cache. ) 1 2 opt-remove ;opt
But seriously, some of the mistakes made by users (and programmers) are not reported at the moment. Some of them never will.
Among these unreported errors are data stack over- and underflows, return stack over- and underflows and floating point stack overflows. Some of them can produce unexpected or wrong results, some of them cause segmentation faults.
For a more detailed list of ambiguous conditions see here.
Benchmarks
To summarize this section: 63 % of all statistics are faked. 17 % of all
people know that.
Seriously: I used the benchmarks of Anton Ertl's Benchmark suite to compare the speed of FLK with that of gforth. You can indirectly compare several other systems with FLK at Anton Ertl's performance web page.
The following sections describe the benchmark programs, show a list of times of gforth and FLK and explain which optimiziers have been implemented to achieve the speed-up.
The used system was a 133MHz Pentium without MMX running Linux kernel 2.0.30 and KDE. All times can differ a bit due to limited timer resolution and cpu load. The cpu used was between 97 and 99 % in all tests.
The initial state had no optimizers except combining OVER
or
2DUP
, relational operators and IF
and
WHILE
to allocate fewer registers and not to generate an
intermediate flag on the stack. All other changes are incremental. These tests
were performed with version 1.2 but aplly for later versions too.
I C@
C! DO +LOOP DUP
Optimization Time of FLK in sec. Speed factor
(gforth: 14.37 sec)
initial test 3.94 3.6
DUP and +LOOP combined (1 register less used) 3.9 3.6
LOOP (short jumps when possible instead always near jumps) 3.7 3.9
I (esp access using SIB addressing instead exchanges and ebp access
using MOD/RM addressing) 2.4 6
As you can see in the second row saving a register when enough of them are
available gains very little. Removing unnessary jump gains a bit more due to
the saved space in the branch predictor of the pentium. The last change
removes at least two AGIs (address generation interlock) per I
in
the innermost loop. That gains at least four cycles per loop.
Bubble sort
Another classical benchmark: sorting 6000 random numbers. Implementation: two
nested loops. The most frequent used words are: I 2@ > SWAP
2!
Optimization Time of FLK in sec. Speed factor
(gforth: 14.54 sec)
initial test 4.29 3.4
all opt. above 2.72 5.4
Fibonacci
This little word has two recursive calls and measures mostly call/return
performance.
Optimization Time of FLK in sec. Speed factor
(gforth: 17.13 sec)
initial test 2.35 7.3
all opt. above 2.2 7.8
+ changed to SWAP + 2.16 7.9
The change of +
to
SWAP +
produces code that looks
better before an EXIT
. The four hundreds of a second saved can be
blamed on the timer tolerance.
Up | ||
Previous | FLK |
Next |
Mail the author |