Unicode Extension

Author: J. Burse, XLOG Technologies GmbH, Switzerland
Date: 21. March 2011
Version: Preview Jekejeke Prolog 0.8.8

The ISO core standard allows extending the set of character codes beyond what is defined in its documentation. Jekejeke Prolog supports a 16-bit Unicode character set as its processor extension. This character set need not be the same character set that is used to encode resources. We currently support Version 4.0 of Unicode. The following definitions replace the corresponding definitions in the previous sections. We indicate a replaced non-terminal by appending a single quote (‘). The Unicode character types are denoted by all upper case.

What concerns fillers we did neither touch the line comments, nor the block comments and nor the end of line. We only extended the class of spaces. Including the FORMAT Unicode character type caters for covering the byte order mark (BOM) (‘\xFEFF\’). Including the SURROGATE, UNASSIGNED and PRIVATE_USE Unicode character types means that Unicode characters outside the 16-bit Range, above Version 4.0 or outside Version 4.0 are viewed as spaces. If needed they have to be put in quotes.

The definition of strings went also unchanged. On the other hand for words we took the following approach. With respect to the delimiters we added all Unicode punctuation character types that correspond to parenthesis or quotes. As a result these characters do not glue with other characters. Further punctuations were ambiguous among the delimiters and the graphic characters, so that we didn’t use further Unicode punctuation character types for either delimiters or graphic characters.

space'          --> UNASSIGNED |
SURROGATE |
PRIVATE_USE |
SPACE_SEPARATOR |
LINE_SEPARATOR |
PARAGRAPH_SEPARATOR |
CONTROL |
FORMAT.
delimiter' --> START_PUNCTUATION |
END_PUNCTUATION |
INITIAL_QUOTE_PUNCTUATION |
FINAL_QUOTE_PUNCTUATION |
"," | ";" | "!" | "|".

Examples:

<BOM>:-      	% The atom ':-'.
ȷabc % The atom 'abc' (Dotless j after Version 4.0).
'ȷ' % The atom 'ȷ'.
«» % The atom '«' and the atom '»'.

There was one exception concerning the Unicode punctuation character types. We will use the Unicode connecting punctuation character type to detect the underscore. The underscore detects the start of variables. We also added the Unicode uppercase character types and the Unicode title character types to the corresponding character class. This leaves fully intact the detection of ASCII variables. But it broadens what will be detected as Unicode variables. For example dashed low lines and certain digraphs now indicate also variables.

We filled the remaining class of lower letters with all remaining Unicode letter character types, all Unicode mark character types and all non-decimal digit number types. Among the mark character types we find for example the combining dieresis (UML) (‘\x308\’). Among the non-decimal digits number types we find roman numbers and fractions. We did not implement any composing or decomposing conversion of character sequences to other character sequences. As a result convertible character sequences are not recognized as identical.

upperscore'    --> UPPERCASE_LETTER |
TITLECASE_LETTER |
CONNECTOR_PUNCTUATION.
lower' --> LOWERCASE_LETTER |
MODIFIER_LETTER |
OTHER_LETTER |
NON_SPACING_MARK |
ENCLOSING_MARK |
COMBINING_SPACING_MARK |
LETTER_NUMBER |
OTHER_NUMBER.

Examples:

﹍A             	% The variable ﹍A (Starts with dashed low line).
Džep % The variable Džep (Starts with digraph dzhe).
Džep % The variable Džep, different from first variable.
a<UML> % The atom 'ä'.
ä % The name 'ä', different from first name.
Ⅶ % The atom 'Ⅶ' (Roman seven).
⅓ % The atom '⅓' (Fraction 1/3).

We added the non-decimal digit numbers to the lower letter class since we cannot offer some sensible number conversion for them. Therefore although they resemble numbers they will only be detected as atoms. When preceded by a decimal digit they need to be quoted. On the other hand the Unicode decimal digit number character type has a good support. We can use digits in various scripts and they are converted into numbers. To avoid the conversion one has to put the digits in quotes.

digit'	       --> DECIMAL_DIGIT_NUMBER.

Examples:

'2⅓'		% The atom '2⅓'.
2⅓ % Illegal number word.
`٠` % The atom '٠' (Arabic zero).
٠ % The number 0.

We have now covered almost all Prolog text character classes. What remains is the graphic character class. We assign to this character class all the Unicode character types that we did not yet assign. The dollar sign ($) is still covered by the Unicode currency symbol character type. Similarly it can be verified that all other ASCII graphic character classes are still covered. Also this character class is broadened by using Unicode character types. For example the euro currency symbol (€) is now also included as a graphic character.

Since graphic characters don’t glue with delimiters, alphabetical characters and decimal digits, they can be still written directly adjacent to numbers and non-graphic atoms. But graphic characters still glue with themselves when forming graphic atoms. Since math symbols are now also part of the graphic character class, care has to be taken when using these symbols adjacently with each other or in front of a period (.). When in doubt best is to use spaces between these symbols. Useful math symbols are for example the right arrow (→) (‘\x2192\’) or the bottom (⊥) (‘\x22A5\’).

graphic'	--> DASH_PUNCTUATION |
OTHER_PUNCTUATION except ",", ";", "!", "'", "\"" |
MATH_SYMBOL except "|" |
CURRENCY_SYMBOL |
MODIFIER_SYMBOL except "`" |
OTHER_SYMBOL.

Examples:

\=<>.:?-+*/#@&^~$    % The atom '\=<>.:?-+*/#@&^~$'.
2€tax          % The number 2, the atom '€' and the atom 'tax'.
⊥→⊥.          % The atom '⊥→⊥.'.
⊥ → ⊥        % The atoms '⊥', '→' and '⊥'.

When writing out Prolog terms spaces are automatically put around operators. Also atoms are quoted when necessary. Additionally the character codes inside quoted atoms are automatically escaped. The current rule is that all characters belonging to the space character class except for the space (" ") itself are escaped. Escaping produces a control character, an octal code or a hex code. Hex coding is used for character codes above or equal 512.