RfD: Xchar wordset
23 August 2009, Bernd Paysan

Change history:

20090823	A few more edits to get fourth version
20090717	Edits made to bring proposal in shape
20090325	Edits made at Forth200x review meeting
20081203	Editing after feedback of third version
20081123        Clean up and check for consistency with Gforth 
implementation
20070714        Second version
20050926        First version

Add the following data type to section 3.1.2:

3.1.2.3  Primitive Character

A primitive character (pchar) includes all character values, at least
{0..255}, that can be stored into memory with C! and retreived with
C@ from memory.

Add the following chapter:

18. The optional Extended Characters wordset

//to be merged with the i18n wordset//

18.1. Introduction

Forth defines graphic and control characters from the ASCII character
set.  ASCII is only appropriate for the English language.  Most
western languages however fit somewhat into the Forth frame, since a
single byte is sufficient to encode the few special characters in each
(though not always the same encoding can be used; latin-1 is widely
used, though).  For other languages, different char sets have to be
used, several of them variable-width.  Most prominent representative
of the variable-width character sets is UTF-8, which has gained quite
some popularity recently.  Since ANS Forth specifies ASCII encoding
for characters, only ASCII-compatible encodings may be used.
Fortunately, being ASCII compatible has so many benefits that most
encodings actually are ASCII compatible.  The characters beyond the
ASCII encoding are called "extended characters".

Fixed width representations like UTF-16 have also been proposed, but
those did not catch on, especially since they are not sufficiently
ASCII compatible.  Furthermore, UTF-16 also became a variable-width
representation when UCS32 was introduced.

The xchar wordset does not solve problems that come from using
multiple different encodings and switching or converting between them.
A system implementing the xchar proposal should support at least UTF-8
as widespread and common encoding.

All words dealing with strings should handle xchars when the xchar
wordset is present.  This includes dictionary definitions, so the
dictionary entries should not use bit 8 for other purposes.  White
space parsing does not have to treat code points greater hex 20 as
white space.

18.2. Additional terms and notation

Code point:  A member of an extended character set.

18.3. Additional usage requirements

18.3.1. Data types

Append table 18.1. to table 3.1.

Table 18.1: Data types

Symbol	    Data type			Size on stack
-----------------------------------------------------
pchar	    primitive character		       1 cell
xchar	    extended character	     	       1 cell
xc-addr	    character-aligned address	       1 cell
xc-addr u   character-aligned string	      2 cells

xchar is the code point of an extended character; on the stack it is a
	subset of u.  Extended characters are stored encoded as one or
	more pchars in memory.

xc-addr  is the address of an xchar in memory.  Alignment requirements 
are the
        same as c-addr.

xc-addr u  is a string of xchars in memory, starting at xc-addr, u
	pchars long.

18.3.2. Environmental queries

Append table 18.2 to table 3.5.

Table 18.2: Environmental Query Strings

String	    Value data type	Constant?	Meaning
XCHAR-ENCODING	  c-addr u	no		Returns a printable
		  	 			ASCII string that
		  	 			represents the
		  	 			encoding, and use the
		  	 			preferred MIME name
		  	 			(if any) or the name
		  	 			in
		  	 			
http://www.iana.org/assignments/character-sets
		  	 			like "ISO-LATIN-1" or
		  	 			"UTF-8", with the
		  	 			exception of "ASCII",
		  	 			where we prefer the
		  	 			alias "ASCII".
MAX-XCHAR	u		no		Maximal value for xchar
XCHAR-MAXMEM	u		no		Maximal memmory consumed
						by an xchar in address 
units

Append table 18.3 to table 3.6

Table 18.3: Forth 200x Extensions

String	    Value data type	Constant?	Meaning
X:xchar		--		--		The Xchar extension is 
present

18.3.4. Common encodings

Input and files are often encoded iso-latin-1 or utf-8.  The encoding
depends on settings of the computer system such as the LANG
environment variable on Unix.  You can use the system consistently
only when you don't change the encoding, or only use the ASCII subset.
Typical use in environments with more than one encoding
(e.g. different localized environemnts have different encodings) is
that the base system is ASCII only, and then extended
encoding-specific.

18.3.5. Ambiguous conditions

An ambigous condition exists if the data in memory does not encode a
valid xchar, or if the xchar value is outside the range of allowed
code points of the current character set used.

Append table 18.4 to table 9.2:

Code    Reserved for
-------------------
-77	Malformed xchar

18.6 Glossary

18.6.1 Xchar words

X-SIZE ( xc-addr u1 -- u2 ) XCHAR
u2 is the number of pchars used to encode the first xchar stored in
the string xc-addr u1.  To calculate the size of the xchar, only the
bytes inside the buffer may be accessed.  An ambiguous condition
exists if the xchar is incomplete or malformed.

XC-SIZE ( xchar -- u ) XCHAR
u is the number of pchars used to encode xchar in memory.

XC@+ ( xc-addr1 -- xc-addr2 xchar ) XCHAR
Fetches the xchar at xc-addr1.  xc-addr2 points to the first memory
location after the retrieved xchar.

XC!+ ( xchar xc-addr1 -- xc-addr2 ) XCHAR
Stores the xchar at xc-addr1.  xc-addr2 points to the first memory
location after the stored xchar.

XC!+? ( xchar xc-addr1 u1 -- xc-addr2 u2 flag ) XCHAR
Stores the xchar into the string buffer specified by xc-addr1 u1.
xc-addr2 u2 is the remaining string buffer.  If the xchar did fit into
the buffer, flag is true, otherwise flag is false, and xc-addr2 u2
equal xc-addr1 u1.  XC!+? is safe for buffer overflows.

XC, ( xchar -- ) XCHAR
Append the encoding of xchar to the dictionary.

XCHAR+ ( xc-addr1 -- xc-addr2 ) XCHAR
Adds the size of the xchar stored at xc-addr1 to this address, giving
xc-addr2.

XKEY ( -- xchar ) XCHAR
Reads an xchar from the terminal.  This will discard all input events up 
to the
completion of the xchar.

XKEY? ( -- flag ) XCHAR
Queries whether it's possible to do XKEY without blocking.  Subsequent 
KEY?,
KEY, EKEY?, and EKEY may be affected by XKEY?.

XEMIT ( xchar -- ) XCHAR
Prints an xchar on the terminal.

18.6.1 Xchar extension words

-TRAILING-GARBAGE ( xc-addr u1 -- xc-addr u2 ) XCHAR EXT
Examine the last xchar in the string xc-addr u1 - if the encoding is
correct and it represents a full xchar, u2 equals u1, otherwise, u2
represents the string without the last (garbled)
xchar.  -TRAILING-GARBAGE does not change this garbled xchar.

X-WIDTH ( xc-addr u -- n ) XCHAR EXT
n is the number of monospace ASCII characters that take the same space
to display as the the xchar string xc-addr u; assuming a monospaced
display font, i.e. xchar width is always an integer multiple of the
width of an ASCII character.

HOLDS ( c-addr u -- ) XCHAR EXT (or go to CORE EXT)
Adds a string to the picture numeric output buffer.

XHOLD ( xchar -- ) XCHAR EXT
Adds an xchar to the picture numeric output buffer.

XC-WIDTH ( xchar -- n ) XCHAR EXT
n is the number of monospace ASCII characters that take the same space
to display as the xchar; i.e. xchar width is always an integer
multiple of the width of an ASCII char.

+X/STRING ( xc-addr1 u1 -- xc-addr2 u2 ) XCHAR EXT
Step forward by one xchar in the buffer defined by xc-addr1 u1.
xc-addr2 u2 is the remaining buffer after stepping over the first
xchar in the buffer.

X\STRING- ( xc-addr1 u1 -- xc-addr1 u2 ) XCHAR EXT
Search for the penultimate xchar in the string xc-addr1 u1.  The
string xc-addr1 u2 contains all xchars of xc-addr u1, but the last.
Unlike XCHAR-, X\STRING- can be implemented in encodings where xchar
boundaries can only reliably detected when scanning in forward
direction.

XCHAR- ( xc-addr1 -- xc-addr2 ) XCHAR EXT
Goes backward from xc-addr1 until it finds an xchar so that the size of 
this
xchar added to xc-addr2 gives xc-addr1.  There is an ambiguous condition 
when
the encoding doesn't permit reliable backward stepping through the text.

EKEY>XCHAR ( u -- xchar t / u f ) XCHAR EXT
If the keyboard event u corresponds to an xchar, return the xchar and
true.  Otherwise, return u and false.

CHAR ( "<spaces>name" -- xchar ) XCHAR EXT
Skip leading space delimiters.  Parse name delimited by a space.  Put 
the
value of its first xchar onto the stack.

[CHAR]	 XCHAR EXT
Interpretation: Interpretation semantics for this word are undefined.
        Compilation:    ( ?<spaces>name? -- )
Skip leading space delimiters.  Parse name delimited by a space.  Append 
the
run-time semantics given below to the current definition.
        Run-time:       ( -- xchar )
Place xchar, the value of the first xchar of name, on the stack.

Number prefixes:
'<xchar>' will be interpreted as literal that puts xchar on the stack.

Rationale:

There's an alternative proposal to name these extended versions XCHAR
and [XCHAR] instead.  Since however XCHAR and [XCHAR] would completely
take over the role of CHAR and [CHAR], and being fully backward
compatible, there is no point to provide new words for this purpose.
The character number prefix is also supposed to work with xchars.

PARSE ( xchar "text<xc>" -- addr u ) XCHAR EXT
Parse a text in the input stream with the xchar as delimiter.

Terminal string IO like TYPE and ACCEPT also are extended to work with
xchars in the string.  Since Unicode input and display poses a number
of challanges like input method editors for different languages,
left-to-right and right-to-left writing, and most fonts contain only a
subset of Unicode glyphs, systems should document their
capabilities.  File IO and in-memory string handling should work
transparently with xchars.

18.7. Reference implementation

-------------------------xchar.fs----------------------------
\ xchar reference implementation: UTF-8 (and ISO-LATIN-1)

\ environmental dependency: pchars are stored as bytes
\ environmental dependency: lower case words accepted

\ doesn't implement set-source-encoding and set-file-encoding

-77 Constant mal-xchar

base @ hex

80 Value maxascii

: xc-size ( xchar -- n )
    dup      maxascii u< IF  drop 1  EXIT  THEN \ special case ASCII
    800  2 >r
    BEGIN  2dup u>=  WHILE  5 lshift r> 1+ >r  dup 0= UNTIL  THEN
    2drop r> ;

: xc@+ ( xc-addr -- xc-addr' u )
    count  dup maxascii u< IF  EXIT  THEN  \ special case ASCII
    7F and  40 >r
    BEGIN   dup r@ and  WHILE  r@ xor
            6 lshift r> 5 lshift >r >r count
            3F and r> or
    REPEAT  r> drop ;

: xc!+ ( xchar xc-addr -- xc-addr' )
    over maxascii u< IF  tuck c! char+  EXIT  THEN \ special case ASCII
    >r 0 swap  3F
    BEGIN  2dup u>  WHILE
            2/ >r  dup 3F and 80 or swap 6 rshift r>
    REPEAT  7F xor 2* or  r>
    BEGIN   over 80 u< 0= WHILE  tuck c! char+  REPEAT  nip ;

: xc, ( xchar -- ) here xc!+ dp ! ;

: xc!+? ( xchar xc-addr u -- xc-addr' u' flag )
    >r over xc-size r@ over u< IF ( xchar xc-addr1 len r: u1 )
        \ not enough space
        drop nip r> false
    ELSE
        >r xc!+ r> r> swap - true
    THEN ;

\ scan to next/previous xchar

: xchar+ ( xc-addr -- xc-addr' )  xc@+ drop ;
: xchar- ( xc-addr -- xc-addr' )
    BEGIN  1 chars - dup c@ C0 and maxascii <>  UNTIL ;

: +x/string ( xc-addr1 u1 -- xc-addr2 u2 )
    over dup xchar+ swap - /string ;
: x\string- ( xc-addr u -- xc-addr u' )
    over + xchar- over - ;

\ skip trailing garbage

: x-size ( xc-addr u1 -- u2 )
    0=        IF drop 0 exit  THEN
    \ length of UTF-8 char starting at u8-addr (accesses only u8-addr)
    c@
    dup 80 u< IF drop 1 exit THEN
    dup c0 u< IF mal-xchar throw  THEN
    dup e0 u< IF drop 2 exit THEN
    dup f0 u< IF drop 3 exit THEN
    dup f8 u< IF drop 4 exit THEN
    dup fc u< IF drop 5 exit THEN
    dup fe u< IF drop 6 exit THEN
    mal-xchar throw ;

: -trailing-garbage ( xc-addr u1 -- xc-addr u2 )
    2dup + dup xchar- ( addr u1 end1 end2 )
    2dup dup over over - x-size + = IF \ last xchar ok
        2drop
    ELSE
        nip nip over -
    THEN ;

\ utf key and emit

: xkey ( -- xchar )
    key dup maxascii u< IF  EXIT  THEN  \ special case ASCII
    7F and  40 >r
    BEGIN  dup r@ and  WHILE  r@ xor
            6 lshift r> 5 lshift >r >r key
            3F and r> or
    REPEAT  r> drop ;

: xemit ( xchar -- )
    dup maxascii u< IF  emit  EXIT  THEN \ special case ASCII
    0 swap  3F
    BEGIN  2dup u>  WHILE
            2/ >r  dup 3F and 80 or swap 6 rshift r>
    REPEAT  7F xor 2* or
    BEGIN   dup 80 u< 0= WHILE emit  REPEAT  drop ;

: holds ( addr u -- )
    BEGIN  dup  WHILE  1- 2dup + c@ hold  REPEAT  2drop ;

Create xholdbuf 8 allot

: xhold ( xchar -- )  xholdbuf tuck xc!+ over - holds ;

\ utf size

\ uses wcwidth ( xchar -- n )

: wc, ( n low high -- )  1+ , , , ;

Create wc-table \ derived from wcwidth source code, for UCS32
0 0300 0357 wc,
0 035D 036F wc,
0 0483 0486 wc,
0 0488 0489 wc,
0 0591 05A1 wc,
0 05A3 05B9 wc,
0 05BB 05BD wc,
0 05BF 05BF wc,
0 05C1 05C2 wc,
0 05C4 05C4 wc,
0 0600 0603 wc,
0 0610 0615 wc,
0 064B 0658 wc,
0 0670 0670 wc,
0 06D6 06E4 wc,
0 06E7 06E8 wc,
0 06EA 06ED wc,
0 070F 070F wc,
0 0711 0711 wc,
0 0730 074A wc,
0 07A6 07B0 wc,
0 0901 0902 wc,
0 093C 093C wc,
0 0941 0948 wc,
0 094D 094D wc,
0 0951 0954 wc,
0 0962 0963 wc,
0 0981 0981 wc,
0 09BC 09BC wc,
0 09C1 09C4 wc,
0 09CD 09CD wc,
0 09E2 09E3 wc,
0 0A01 0A02 wc,
0 0A3C 0A3C wc,
0 0A41 0A42 wc,
0 0A47 0A48 wc,
0 0A4B 0A4D wc,
0 0A70 0A71 wc,
0 0A81 0A82 wc,
0 0ABC 0ABC wc,
0 0AC1 0AC5 wc,
0 0AC7 0AC8 wc,
0 0ACD 0ACD wc,
0 0AE2 0AE3 wc,
0 0B01 0B01 wc,
0 0B3C 0B3C wc,
0 0B3F 0B3F wc,
0 0B41 0B43 wc,
0 0B4D 0B4D wc,
0 0B56 0B56 wc,
0 0B82 0B82 wc,
0 0BC0 0BC0 wc,
0 0BCD 0BCD wc,
0 0C3E 0C40 wc,
0 0C46 0C48 wc,
0 0C4A 0C4D wc,
0 0C55 0C56 wc,
0 0CBC 0CBC wc,
0 0CBF 0CBF wc,
0 0CC6 0CC6 wc,
0 0CCC 0CCD wc,
0 0D41 0D43 wc,
0 0D4D 0D4D wc,
0 0DCA 0DCA wc,
0 0DD2 0DD4 wc,
0 0DD6 0DD6 wc,
0 0E31 0E31 wc,
0 0E34 0E3A wc,
0 0E47 0E4E wc,
0 0EB1 0EB1 wc,
0 0EB4 0EB9 wc,
0 0EBB 0EBC wc,
0 0EC8 0ECD wc,
0 0F18 0F19 wc,
0 0F35 0F35 wc,
0 0F37 0F37 wc,
0 0F39 0F39 wc,
0 0F71 0F7E wc,
0 0F80 0F84 wc,
0 0F86 0F87 wc,
0 0F90 0F97 wc,
0 0F99 0FBC wc,
0 0FC6 0FC6 wc,
0 102D 1030 wc,
0 1032 1032 wc,
0 1036 1037 wc,
0 1039 1039 wc,
0 1058 1059 wc,
1 0000 1100 wc,
2 1100 115f wc,
0 1160 11FF wc,
0 1712 1714 wc,
0 1732 1734 wc,
0 1752 1753 wc,
0 1772 1773 wc,
0 17B4 17B5 wc,
0 17B7 17BD wc,
0 17C6 17C6 wc,
0 17C9 17D3 wc,
0 17DD 17DD wc,
0 180B 180D wc,
0 18A9 18A9 wc,
0 1920 1922 wc,
0 1927 1928 wc,
0 1932 1932 wc,
0 1939 193B wc,
0 200B 200F wc,
0 202A 202E wc,
0 2060 2063 wc,
0 206A 206F wc,
0 20D0 20EA wc,
2 2329 232A wc,
0 302A 302F wc,
2 2E80 303E wc,
0 3099 309A wc,
2 3040 A4CF wc,
2 AC00 D7A3 wc,
2 F900 FAFF wc,
0 FB1E FB1E wc,
0 FE00 FE0F wc,
0 FE20 FE23 wc,
2 FE30 FE6F wc,
0 FEFF FEFF wc,
2 FF00 FF60 wc,
2 FFE0 FFE6 wc,
0 FFF9 FFFB wc,
0 1D167 1D169 wc,
0 1D173 1D182 wc,
0 1D185 1D18B wc,
0 1D1AA 1D1AD wc,
2 20000 2FFFD wc,
2 30000 3FFFD wc,
0 E0001 E0001 wc,
0 E0020 E007F wc,
0 E0100 E01EF wc,
here wc-table - Constant #wc-table

\ inefficient table walk:

: xc-width ( xchar -- n )
    wc-table #wc-table over + swap ?DO
        dup I 2@ within IF  drop I 2 cells + @  UNLOOP EXIT  THEN
    3 cells +LOOP  drop 1 ;

: x-width ( xc-addr u -- n )
    0 rot rot over + swap ?DO
        I xc@+ swap >r xc-width +
    r> I - +LOOP ;

: char  ( "name" -- xchar )  bl word count drop xc@+ nip ;
: [char] ( "name" -- rt:xchar )  char postpone Literal ; immediate

\ switching encoding is only recommended at startup
\ only two encodings are supported: UTF-8 and ISO-LATIN-1

80 Constant utf-8
100 Constant iso-latin-1

: set-internal-encoding  to maxascii ;
: get-internal-encoding  maxascii ;

base !
-------------------------xchar.fs----------------------------

18.8. Experience

Build into Gforth (development version) and recent versions of
bigFORTH.  The MINOS port to VFX Forth comes with an xchar
implementation, as well.  There's also lxf-ntf by Peter Falth.

Open issues are file reading and writing (conversion on the fly or
leave as it is?).  Text files in alien encoding (different from
internal encoding) might have an encoding, which then is translated to
internal encoding on the fly.  E.g.

utf-8 xset-encoding
s" foobar.txt" r/w open-file throw value fd
iso-latin-1 fd set-file-encoding throw

Subsequent READ-FILE, READ-LINE, WRITE-FILE, and WRITE-LINE would
convert utf-8 to iso-latin-1 on the fly, probably replacing
unencodeable xchars with '?' on write.

Informal Appendix:

Many Forth systems today are case insensitive, to accept lower case
standard words.  It is sufficient to be case insensitive for the ASCII
subset to make this work - this saves a large code mapping table for
comparison of other symbols.  Case is mostly an issue of European
languages (latin, greek, and cyrillic), but similar issues exist
between traditional and simplified Chinese (finally being dealt with
by Unihan), and between different Latin code pages in UCS, e.g. full
width vs. normal half width latin letters.

Furthermore, UCS allows composition of glyphs, and also has direct
encoding for composed glyphs, which look the same, but have different
encodings.  This is not a problem for a Forth system to solve.

Some encodings (not UTF-8) might give surprises when you use
a case insensitive ASCII-compare that's 8-bit safe, but not aware of
the current encoding.