S" 6.1.2165 is the primary word for generating
strings. In more complex applications, it suffers from several
deficiencies:
S" string can only contain printable characters,
S" string cannot contain the '"'
character,
S" string cannot be used with wide characters as
dicussed in the Forth 200x internationalisation and
XCHAR proposals.
S\"
with very similar operations. S\" behaves like
S", but uses the '\' character as an escape
character for the entry of characters that cannot be used with
S".
This technique is widespread in languages other than Forth.
It has benefit in areas such as
The basis of the current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here)is a fixed-width unit handled by EMIT
and friends. It corresponds to the current ANS definition of a
character. An extended character (called an xchar here) consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.
The consequences of this are:
The XCHARs proposal can be used to handle
extended characters on the stack. XEMIT and friends allow
us to handle some additional odd-ball requirements such as 9-bit
control characters, e.g. for the MDB bus used by vending machines.
C@,
C! and friends as "primitive characters" or pchars.
Characters that may be wider than a pchar are called "extended
characters" or xchars.
These are compatible with the XCHARs proposal.
This proposal does not require systems to handle xchars, but does
not disenfranchise those that do.
S\" is used like S" but treats the
'\' character specially. One or more characters after the
'\' indicate what is substituded.
The following list is what is currently available in the Forth
systems surveyed.
\a | BEL (alert, ASCII 7) |
\b | BS (backspace, ASCII 8) |
\e | ESC (not in C99, ASCII 27) |
\f | FF (form feed, ASCII 12) |
\l | LF (ASCII 10) |
\m | CR/LF pair (ASCII 13, 10) - for HTML etc. |
\n | newline - CRLF for Windows/DOS, LF for Unices |
\q | double-quote (ASCII 34) |
\r | CR (ASCII 13) |
\t | HT (tab, ASCII 9) |
\v | VT (ASCII 11) |
\z | NUL (ASCII 0) |
\" | " |
\[0-7]+ | Octal numerical character value, finishes at the first non-octal character |
\x[0-9a-f]+ | Hex numerical character value, finishes at the first non-hex character |
\\ | backslash itself |
\ | before any other character represents that character |
The following three of these cause parsing and readability problems. As far as I know, requiring characters to come in 8 bit units will not upset any systems. Systems with characters less than 7 bits are non-compliant, and I know of no 7 bit CPUs. All current systems use character units of 8 bits or more.
\[0-7]+ |
Octal numerical character value, finishes at the first non-octal character |
\x[0-9a-f]+ |
Hex numerical character value, finishes at the first non-hex character |
Why do we need two representations, both of variable length? This proposal selects the hexadecimal representation, requiring two hex digits. A consequence of this is that xchars must be represented as a sequence of pchars. Although initially seen as a problem by some people, it avoids at least the following problems:
\ |
before any other character represents that character |
This is an unnecessary general case, and so is not mandated. By making it an ambiguous condition, we do not disenfranchise existing implementations, and leave the way open for future extensions.
6.2.xxxx S\" s-slash-quote CORE EXT
X:EscapedString
" (double-quote),
using the translation rules below. Append the run-time
semantics given below to the current definition.
\' character
it is processed by parsing and substituting one or more characters
as follows:
\a | BEL (alert, ASCII 7) |
\b | BS (backspace, ASCII 8) |
\e | ESC (not in C99, ASCII 27) |
\f | FF (form feed, ASCII 12) |
\l | LF (ASCII 10) |
\m | CR/LF pair (ASCII 13, 10) |
\n | newline - implementation dependent newline, e.g. CR/LF, LF, or LF/CR. |
\q | double-quote (ASCII 34) |
\r | CR (ASCII 13) |
\t | HT (tab, ASCII 9) |
\v | VT (ASCII 11) |
\z | NUL (ASCII 0) |
\" | " |
\xAB | A and B are Hexadecimal numerical characters. The resulting character is the conversion of these two characters. |
\\ | backslash itself |
\ | before any other character constitutes an ambiguous condition. |
C" , 11.6.1.2165 S",
A.6.1.2165 S"
\x is not followed by by two hexadecimal characters
Taken from the VFX Forth source tree and modified to remove most implementation dependencies. Assumes the use of the # and $ numeric prefices to indicate decimal and hexadecimal respectively.
Another implementation (with some deviations) can be found in the gforth source tree.
decimal
: PLACE \ c-addr1 u c-addr2 --
\ *G Copy the string described by c-addr1 u to a counted string at
\ ** the memory address described by c-addr2.
2dup 2>r \ write count last
1 chars + swap move
2r> c! \ to avoid in-place problems
;
: $, \ caddr len --
\ *G Lay the string into the dictionary at *\fo{HERE}, reserve
\ ** space for it and *\fo{ALIGN} the dictionary.
dup >r
here place
r> 1 chars + allot
align
;
: addchar \ char string --
\ *G Add the character to the end of the counted string.
tuck count + c!
1 swap c+!
;
: append \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
>r
tuck r@ count + swap cmove \ add source to end
r> c+! \ add length to count
;
: extract2H \ caddr len -- caddr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the* string, returning the remaining string
\ ** and the converted number.
base @ >r hex
0 0 2over >number 2drop drop
>r 2 chars /string r>
r> base !
;
create EscapeTable \ -- addr
\ *G Table of translations for \a..\z.
7 c, \ \a
8 c, \ \b
char c c, \ \c
char d c, \ \d
#27 c, \ \e
#12 c, \ \f
char g c, \ \g
char h c, \ \h
char i c, \ \i
char j c, \ \j
char k c, \ \k
#10 c, \ \l
char m c, \ \m
#10 c, \ \n (Unices only)
char o c, \ \o
char p c, \ \p
char " c, \ \q
#13 c, \ \r
char s c, \ \s
9 c, \ \t
char u c, \ \u
#11 c, \ \v
char w c, \ \w
char x c, \ \x
char y c, \ \y
0 c, \ \z
create CRLF$ \ -- addr ; CR/LF as counted string
2 c, #13 c, #10 c,
internal
: addEscape \ caddr len dest -- caddr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
over 0= \ zero length check
if drop exit endif
>r \ -- caddr len ; R: -- dest
over c@ [char] x = if \ hex number?
1 chars /string extract2H r> addchar exit
endif
over c@ [char] m = if \ CR/LF pair?
1 chars /string #13 r@ addchar #10 r> addchar exit
endif
over c@ [char] n = if \ CR/LF pair?
1 chars /string crlf$ count r> append exit
endif
over c@ [char] a [char] z 1+ within if
over c@ [char] a - EscapeTable + c@ r> addchar
else
over c@ r> addchar
endif
1 chars /string
;
external
: parse\" \ caddr len dest -- caddr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters much as C does. The
\ ** translated string is a counted string at *\i{dest}
\ ** The supported escapes (case sensitive) are:
\ *D \a BEL (alert)
\ *D \b BS (backspace)
\ *D \e ESC (not in C99)
\ *D \f FF (form feed)
\ *D \l LF (ASCII 10)
\ *D \m CR/LF pair - for HTML etc.
\ *D \n newline - CRLF for Windows/DOS, LF for Unices
\ *D \q double-quote
\ *D \r CR (ASCII 13)
\ *D \t HT (tab)
\ *D \v VT
\ *D \z NUL (ASCII 0)
\ *D \" "
\ *D \xAB Two char Hex numerical character value
\ *D \\ backslash itself
\ *D \ before any other character represents that character
dup >r 0 swap c! \ zero destination
begin \ -- caddr len ; R: -- dest
dup
while
over c@ [char] " <> \ check for terminator
while
over c@ [char] \ = if \ deal with escapes
1 /string r@ addEscape
else \ normal character
over c@ r@ addchar 1 /string
endif
repeat then
dup \ step over terminating "
if 1 /string endif
r> drop
;
: readEscaped \ "string" -- caddr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{PAD}.
source >in @ /string tuck \ -- len caddr len
pad parse\" nip
- >in +!
pad
;
: S\" \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
readEscaped count state @ if
compile (s") $,
then
; IMMEDIATE
Note that you can be both a system implementor and a programmer, so you can submit both kinds of ballots.