Unicode Extension

The ISO core standard allows extending the set of character codes beyond what is defined in its documentation. Jekejeke Prolog supports the Unicode character set as its processor extension. The particular Unicode version that is supported depends on the underlying Java virtual machine. The following definitions replace the corresponding definitions in the previous sections. We indicate a replaced non-terminal by appending a single quote (‘). The Unicode character types are denoted by all upper case.

What concerns fillers we did neither touch the line comments, the block comments nor the end of line character. But we extended the class of layout characters slightly. Including the FORMAT Unicode character type caters for covering the byte order mark (BOM) (‘\xFEFF\’). Further including the CONTROL Unicode character type caters for covering ASCII control characters. We have excluded the non-joiner (‘\x200C\’) and the joiner (‘\x200D\’) hints from layout, so that they can be later used in lower.

layout'         --> SPACE_SEPARATOR |
FORMAT except "\x200C\", "\x200D\".

<BOM>:-      	% The name ':-'.

The definition of strings has received a slight change. We introduced a new character class invalid. This character class includes the character types UNASSIGNED, PRIVATE_USE and SURROGATE. We do also consider the replacement character (‘\xFFFD\’) as an invalid Unicode character. This character usually indicates an invalid byte sequence which could not be converted back to a Unicode sequence during stream read.

invalid         --> UNASSIGNED |
                    PRIVATE_USE |

'\xFFFD\'       % The name '\xFFFD\', an invalid character.
'\xD800\' % The name '\xD800\', a low surrogate.

When reading a term the string definition applies in its original form to the tokenization phase. During parsing strings undergo an additional validation step where strings with invalid characters are sorted out. If needed invalid characters can nevertheless be included in a string by using the backslash (\) to escape the code. Escaping can also be used to include single standing surrogates in strings.

With respect to the delimiters we added all Unicode punctuation character types that correspond to parenthesis or quotes. As a result these characters do not glue with other characters. Further important Prolog punctuations characters such as “,”, “;”, “;” and “|” are detected individually. The delimiter class also contains our invalid character class. An invalid characters delimiter is allowed during tokenization but sorted out during parsing.

delimiter'      --> START_PUNCTUATION |
"," | ";" | "!" | "|" |

«»           	% The name '«' followed by the name '»'.

We will use the Unicode connecting punctuation character type to detect the underscore. The underscore detects the start of variables. We also added the Unicode uppercase character types and the Unicode title character types to the corresponding character class. This leaves fully intact the detection of ASCII variables. But it broadens what will be detected as Unicode variables. For example dashed low lines and certain digraphs now indicate also variables.

We filled the remaining class of lower letters with all remaining Unicode letter character types, all Unicode mark character types and all non-decimal digit number types. Among the mark character types we find for example the combining dieresis (UML) (‘\x308\’). Among the non-decimal digits number types we find roman numbers and fractions. We did not implement any composing or decomposing conversion of character sequences to other character sequences. As a result convertible character sequences are not recognized as identical.

upperscore'    --> UPPERCASE_LETTER |
"\x200C\" | "\x200D\".

﹍A             	% The variable ﹍A (Starts with dashed low line).
Džep % The variable Džep (Starts with digraph dzhe).
Džep % The variable Džep, different from first variable.
a<UML> % The name 'ä'.
ä % The name 'ä', different from first name.
Ⅶ % The name 'Ⅶ' (Roman seven).
⅓ % The name '⅓' (Fraction 1/3).

We added the non-decimal digit numbers to the lower letter class since we cannot offer some sensible number conversion for them. Therefore although they resemble numbers they will only be detected as names. When preceded by a decimal digit they need to be quoted. On the other hand the Unicode decimal digit number character type has a good support. We can use digits in various scripts and they are converted into numbers. To avoid the conversion one has to put the digits in quotes.

digit'	       --> DECIMAL_DIGIT_NUMBER.

'2⅓'		% The name'2⅓'.
2⅓ % The number 2 and the name ⅓.
`٠` % The name '٠' (Arabic zero).
٠ % The number 0.

We have now covered almost all Prolog text character classes. What remains is the graphic character class. We assign to this character class all the Unicode character types that we did not yet assign. The dollar sign ($) is still covered by the Unicode currency symbol character type. Similarly it can be verified that all other ASCII graphic character classes are still covered. Also this character class is broadened by using Unicode character types. For example the euro currency symbol (€) is now also included as a graphic character.

Since graphic characters only glue with themselves and don’t glue with delimiters, alphabetical characters and decimal digits, they can be still written directly adjacent to numbers and non-graphic names. Math symbols are now also part of the graphic character class, so care has to be taken when using these symbols adjacently with each other or in front of a period (.). When in doubt best is to use spaces between these symbols. Useful math symbols are for example the right arrow (→) (‘\x2192\’) or the bottom (⊥) (‘\x22A5\’).

graphic'	--> DASH_PUNCTUATION |
OTHER_PUNCTUATION except ",", ";", "!", "'", "\"" |
MATH_SYMBOL except "|" |
MODIFIER_SYMBOL except "`" |
OTHER_SYMBOL except "\xFFFD\".

\=<>.:?-+*/#@&^~$    % The name '\=<>.:?-+*/#@&^~$'.
2€tax          % The number 2, the name '€' and the name 'tax'.
⊥→⊥.          % The name '⊥→⊥.'.
⊥ → ⊥        % The names'⊥', '→' and '⊥'.

When writing out Prolog terms spaces are automatically put around operators. Also names are quoted when necessary. Additionally the character codes inside quoted names are automatically escaped. The current rule is that all characters belonging to the layout character class except for the space (" ") itself are escaped. Escaping produces a control character, an octal code or a hex code. Hex coding is used for character codes above or equal 512.