Forth Reduced Instruction Set Computers John R. Hayes Martin E. Fraeman Johns Hopkins University / Applied Physics Laboratory 1. Introduction This note describes three 32 bit Forth microprocessor chips we have designed over the past couple of years. We call our chips FRISCs (Forth Reduced Instruction Set Computers). We will briefly describe FRISC 1 and 2 which have nearly identical architectures. We will then describe in more detail our latest design, called FRISC 3. All three chips are fully 32 bits with 32 bit address and 32 bit data busses. They are word addressed, i.e. no bytes. 2. FRISC 1 and 2 FRISC 1 and 2 have two instruction formats, a subroutine call and a user-defined microcode instruction. The msb of the instruction determines its type. A zero indicates that the remaining 31 bits are the address of a subroutine to call. The call executes in one cycle. A one in the msb indicates that the 31 bits are a microcode word that directly controls the resources of the chip's data path. The microcode word can represent most Forth primitives (e.g. dup, over, +, <, 0=, etc.) and the data path can execute most primitives in a single cycle. Primitives that must access memory take two cycles to execute. These include branch, ?branch, @, !, and (literal). Both FRISC 1 and 2 have two on-chip stack caches. The stack cache gives the programmer the illusion of having an arbitrarily large stack of on-chip registers. A stack caching algorithm guarantees that the top four stack elements are always present in the cache. Accessing a stack is equivalent to accessing a register and thus provides single cycle execution of the primitives. FRISC 1 was the first implementation of the architecture just described. Full custom design techniques were used. The chip was built using MOSIS' 4 micron CMOS Silicon on Sapphire (SOS) process. When the chips were received from MOSIS we discovered that a design rule violation had disastrously effected yield. However, enough partially functional chips were found to verify the correctness of the design. One chip worked well enough for - 2 - us to run a Forth system on it. Unfortunately, we were not able fix our mistake because MOSIS discontinued their SOS process. So, we re-implemented the design in a scalable bulk CMOS process and had chips built at 3 microns. These chips function perfectly but at a disappointing 1MHz clock rate (we had predicted 3MHz). 3. FRISC 3 Early in 1987 we acquired a new design tool, the Genesil silicon compiler from Silicon Compiler Systems. This more sophisticated tool would allow us to tackle more complex architectures. In May we started work on an improved architecture. The FRISC 3 architecture inherits many features from FRISC 1 and 2 including the single cycle call and microcode instruction and the two on-chip stack caches. A number of features have been added including a new load/store instruction format, single cycle branch, return bit (similar to Novix NC4016), multiply and divide steps, and an improved stack caching algorithm. The new load/store instruction has addressing modes that capture many Forth programming idioms. For example, if foo is a variable in the low 64 kwords of address space, foo @ can be represented with a single instruction. Some forms of this load/store format allow 16 bit literals to be pushed on the stack in one cycle. The following paragraphs describe FRISC 3 in some detail. 3.1 FRISC 3 Data Path In addition to the stack caches there are four global utility registers in the data path. Two of these registers are dedicated to the stack caching algorithm but the other two may be used as a system designer sees fit. For instance they could be used to implement an additional stack or a frame pointer for a traditional language such as C. The ALU provides the expected logic and arithmetic functions. A single bit left shifter on the input side of the ALU and single bit right shifter on the output are available for multiplication and division steps. A single condition code flag (FL) is provided. The flag can be loaded with one of sixteen ALU conditions or the shift out bit from one of the shifters. Subsequently, the flag can control a conditional branch or be fed into the ALU's carry input for doing multiprecision arithmetic or be read onto a bus yielding a 32 bit 0 or -1 truth value. - 3 - There are several other elements in the datapath that need mention. First is the presence of a register that when read always returns the value zero (Zero). Second is a program counter (PC). Finally, there is a processor status word (PSW) that contains the state of the interrupt system and the stack caches. 3.2 FRISC 3 Instruction Set Architecture The FRISC 3 instruction set consists of eight instruction types. There are three control flow instructions, four load/store instructions, and a microcode instruction. All FRISC 3 instructions are 32 bits wide. Each of these three instruction categories is reflected in the following three instruction formats: +--------+----------------------------------------------+ | Type:3 | Address:29 | +--------+----------+------+------+---------+-----------+ | Type:3 | Return:1 | R1:4 | R2:4 | Stack:4 | Offset:16 | +--------+----------+------+------+---------+-----------+ | Type:3 | Return:1 | R1:4 | R2:4 | Stack:4 | ALU:16 | +--------+----------+------+------+---------+-----------+ The three most significant bits (msbs) of the instruction determine its type and the interpretation of the remaining 29 bits. The control flow instructions are call, branch, and conditional branch. The destination is an absolute address embedded in the instruction. The conditional branch will be taken if the flag is 0. The upper sixteen bits of the load/store and micro instructions have the same format. In both formats, the R1 field selects a source register, R2 selects a destination register, and Stack selects any combination of pushing or popping the parameter and return stacks. The Return field can cause the top of the return stack to be loaded into the program counter and provide the address of the next instruction. With a micro instruction the operation performed on R1 is selected by the ALU field and the second operand is always TOS. With load/store instructions the operation is always addition and the second operand comes from the Offset field. The four load/store instructions are load, store, load address low (lal), and load address high (lah). A register transfer level notation summarizes their operation: - 4 - load: *(R1 + Offset) -> R2 store: *(R1 + Offset) <- R2 load address low (lal): R1 + Offset -> R2 load address high (lah): R1 + Offset*2^16 -> R2 The offset is a sixteen bit unsigned number. The * denotes an address computation so, for a load instruction, R1 + Offset is the address of data to be loaded into R2. A single addressing mode, register indirect plus offset, is provided. Degenerate cases of this addressing mode yield other useful modes. Setting the offset to zero produces a register indirect mode. Setting R1 to the zero register allows absolute addressing in the bottom 64kwords of address space.. The load address instructions are degenerate loads in that an address is computed but no data is fetched. Instead the address is saved in R2. The lah instruction is similar to lal except that the offset is shifted left sixteen bits before being added to R1. The primary use for these two instructions is the construction of literals. Sixteen bit literals can be produced by a single lal instruction. Any 32 bit literals can be constructed by an lah followed by an lal. The micro instruction is the workhorse of the processor since it is used to implement most of Forth's primitive operations. All micro instructions consist of an operation performed on R1 and TOS with the result stored in R2. The ALU field selects the operation performed. This field has two formats, one for doing arithmetic or logic operations and one for doing shift, multiply, or divide steps: +-------+--------+-----------+-------+--------+-----------+ arith: | Sel:1 | Bsrc:1 | ALUcond:4 | Cin:2 | Flag:1 | ALUop:7 | +-------+--------+-----------+-------+--------+-----------+ shift: | Sel:1 | Bsrc:1 | ALUcond:4 | Cin:2 | Flag:1 | Shiftop:7 | +-------+--------+-----------+-------+--------+-----------+ The following table shows how the FRISC3 instruction set implements a number of Forth primitives. The last entry is the innermost loop of the (infamous) sieve after it was run through a metacompiler with a peephole optimizer. This illustrates how multiple Forth primitives can be packed into one FRISC3 instruction. - 5 - +-----------------------+-----------------------------------------+ | dup | tos + 0 -> tos pushp | +-----------------------+-----------------------------------------+ | over | sos + 0 -> tos pushp | +-----------------------+-----------------------------------------+ | >r | tos + 0 -> tor popp pushr | +-----------------------+-----------------------------------------+ | r> | tor + 0 -> tos popr pushp | +-----------------------+-----------------------------------------+ | 1+ | tos + 1 -> tos | +-----------------------+-----------------------------------------+ | 0= | tos nopb Z ->fl-> tos | +-----------------------+-----------------------------------------+ | + | tos bplusa czero -> tos popp | +-----------------------+-----------------------------------------+ | < | tos bminusa cone NxorV ->fl-> tos popp | +-----------------------+-----------------------------------------+ | exit | return popr | +-----------------------+-----------------------------------------+ | @ | *(tos + 0) -> tos | +-----------------------+-----------------------------------------+ | ! | *(tos + 0) <- tos popp | | | popp | +-----------------------+-----------------------------------------+ | begin | | | dup size < while | zero + 8190 -> tos pushp | | | sos bminusa cone LT ->fl popp | | | ?br forward | | 0 over flags + ! | *(tos + a[flags]) <- zero | | over + | sos bplusa czero -> tos | | repeat | br back | +-----------------------+-----------------------------------------+