SQLITE_NOTICE(283): recovered 2 frames from WAL file /fossil/tcl.fossil-wal

Tcl Source Code: Check-in [a927119ed1]

Tcl Source Code

Check-in [a927119ed1]
Login
Bounty program for improvements to Tcl and certain Tcl packages.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:edits
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | dgp-review
Files: files | file ages | folders
SHA3-256: a927119ed1665224864ad3fcffada915e95425e821fa00cb0f6d72c17fbc162a
User & Date: dgp 2020-02-12 16:47:31
Context
2020-02-12
17:10
WIP check-in: dc02be5f4d user: dgp tags: dgp-review
16:47
edits check-in: a927119ed1 user: dgp tags: dgp-review
2020-02-10
20:07
WIP check-in: 53366332b5 user: dgp tags: dgp-review
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to doc/dev/value-history.md.
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
...
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
...
646
647
648
649
650
651
652
653

654
655
656
657
658
659
660
	U+00010000 - U+001FFFFF	11110bbb (10bbbbbb)3
	U+00200000 - U+03FFFFFF	111110bb (10bbbbbb)4
	U+04000000 - U+7FFFFFFF	1111110b (10bbbbbb)5
```
</pre>

FSS-UTF encodes each codepoint in the ASCII range as a single-byte,
the same byte already in use for that character.  Longer byte seqeunces
encode higher codepoints that were less frequently used.  A footnote
in Appendix F makes the observation that encoded sequences of three bytes
are sufficient to cover all 16-bit codepoints of UCS-2 (Unicode). The
longer encodings of four through six bytes prepared the way to cover
codepoints in UCS-4 up to 31 bits.

(Side observation: it would be
................................................................................
easily covering the remainder of the 32-bit range of all of UCS-4.  It 
seems the value of keeping byte **0xFE** out of encoded streams outweighed
the failure to encode the upper half of UCS-4, which had no forseeable use.
The bytes **0xFE** and **0xFF** together make up a Byte Order Mark that
is used to determine endianness of UCS-2 data. When neither byte appears
in UTF-8 or FSS-UTF the confidence of that interpretation is boldened.)

In Tcl 8.1, "UTF-8" encoding is accomplished by the routine

>	**int** **Tcl_UniCharToUtf** ( **int** *ch*, **char** *_str_);

The routine is (nearly) a direct translation of the table of encoding rules
of FSF-UTF.  However it is clear that a decision was made that Tcl would
support "Unicode" which meant UCS-2 which meant 16-bit codepoints as
"Uncode characters".  Unicode characters require a maximum of
three encoded bytes, and Tcl 8.1 source code uses **TCL\_UTF\_MAX**
to represent that limit. It is useful in sizing buffers where encoded
string are to be written.  The public header for Tcl 8.1 includes

>	\#define **TCL\_UTF\_MAX**	3

and

>	typedef **unsigned short Tcl_UniChar**;

However, the body of **Tcl_UniCharToUtf** includes code to
encode up to six-byte sequences representing up to 31-bit codepoints,
protected by a conditional compiler directive

>	\#if **TCL\_UTF\_MAX** > 3

With the default setting, Tcl is able to encode a total of 65,536
distinct values of the *ch* codepoint argument. These are all the
codepoints of UCS-2, the complete capacity of Unicode 1.1. When
**Tcl_UniCharToUtf** is passed a *ch* argument outisde that supported
range, the codepoint **U+FFFD** is encoded in its place. Unicode 1.1
assigns this codepoint the name **REPLACEMENT CHARACTER**. Unicode
prescribes the use of **U+FFFD** to replace an incoming character
whose value is unrepresentable.  In a configuration of Tcl 8.1 with
**TCL\_UTF\_MAX** set to 6, the **REPLACEMENT CHARACTER** would
replace only the *ch* values in the upper half of UCS-4.  Note that
Tcl 8.1 has made a design choice here to handle errors or unsupported
operations via replacement, and not via raising an error.

It is clear that the immediate Unicode support in Tcl was intentionally
limited to UCS-2, while conventions and migration supports were put in place
so that a future (binary incompatible) version of Tcl would expand its support
to UCS-4 with FSS-UTF encoding sequences of up to six bytes, at some point
where the considerable cost in memory and processing burden traded off
against the benefits of support for whatever future codepoint assignments
................................................................................
assigns characters to only 40,635 rows in that table. If we generate all
possible sequences of those 40,635 assigned codepoints, and then encode
each sequence via the FSS-UTF rules, we generate a set of byte sequences
far smaller and more constrained that the general set of all byte sequences.
When we claim that Tcl 8.1 strings are kept in the UTF-8 encoding, we
imply that Tcl 8.1 strings are constrained to a much smaller set of byte
sequences than were permitted for Tcl 8.0 strings.  This raises questions
about both compatibility and decoding of non-conformant byte sequences.






decoding and strictness








|







 







|




|

|


|








|
|

|




|






|
|







 







|
>







575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
...
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
...
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
	U+00010000 - U+001FFFFF	11110bbb (10bbbbbb)3
	U+00200000 - U+03FFFFFF	111110bb (10bbbbbb)4
	U+04000000 - U+7FFFFFFF	1111110b (10bbbbbb)5
```
</pre>

FSS-UTF encodes each codepoint in the ASCII range as a single-byte,
the same byte already used by ASCII for that character.  Longer byte seqeunces
encode higher codepoints that were less frequently used.  A footnote
in Appendix F makes the observation that encoded sequences of three bytes
are sufficient to cover all 16-bit codepoints of UCS-2 (Unicode). The
longer encodings of four through six bytes prepared the way to cover
codepoints in UCS-4 up to 31 bits.

(Side observation: it would be
................................................................................
easily covering the remainder of the 32-bit range of all of UCS-4.  It 
seems the value of keeping byte **0xFE** out of encoded streams outweighed
the failure to encode the upper half of UCS-4, which had no forseeable use.
The bytes **0xFE** and **0xFF** together make up a Byte Order Mark that
is used to determine endianness of UCS-2 data. When neither byte appears
in UTF-8 or FSS-UTF the confidence of that interpretation is boldened.)

In Tcl 8.1, encoding of Unicode codepoints is accomplished by the routine

>	**int** **Tcl_UniCharToUtf** ( **int** *ch*, **char** *_str_);

The routine is (nearly) a direct translation of the table of encoding rules
of FSS-UTF.  However it is clear that a decision was made that Tcl would
support "Unicode" which meant UCS-2 which meant 16-bit codepoints as
"Unicode characters".  Unicode characters require a maximum of
three encoded bytes, and Tcl 8.1 source code uses **TCL\_UTF\_MAX**
to represent that limit. It is useful in sizing buffers where encoded
strings are to be written.  The public header for Tcl 8.1 includes

>	\#define **TCL\_UTF\_MAX**	3

and

>	typedef **unsigned short Tcl_UniChar**;

However, the body of **Tcl_UniCharToUtf** includes code to
encode up to six-byte sequences as prescribed by FSS-UTF representing
up to 31-bit codepoints, protected by a conditional compiler directive

>	\#if **TCL\_UTF\_MAX** > 3 .

With the default setting, Tcl is able to encode a total of 65,536
distinct values of the *ch* codepoint argument. These are all the
codepoints of UCS-2, the complete capacity of Unicode 1.1. When
**Tcl_UniCharToUtf** is passed a *ch* argument outside that supported
range, the codepoint **U+FFFD** is encoded in its place. Unicode 1.1
assigns this codepoint the name **REPLACEMENT CHARACTER**. Unicode
prescribes the use of **U+FFFD** to replace an incoming character
whose value is unrepresentable.  In a configuration of Tcl 8.1 with
**TCL\_UTF\_MAX** set to 6, the **REPLACEMENT CHARACTER** would
replace only the *ch* values in the upper half of UCS-4.  Note that
Tcl 8.1 has made a design choice here to handle unsupported inputs
via replacement, and not via raising an error.

It is clear that the immediate Unicode support in Tcl was intentionally
limited to UCS-2, while conventions and migration supports were put in place
so that a future (binary incompatible) version of Tcl would expand its support
to UCS-4 with FSS-UTF encoding sequences of up to six bytes, at some point
where the considerable cost in memory and processing burden traded off
against the benefits of support for whatever future codepoint assignments
................................................................................
assigns characters to only 40,635 rows in that table. If we generate all
possible sequences of those 40,635 assigned codepoints, and then encode
each sequence via the FSS-UTF rules, we generate a set of byte sequences
far smaller and more constrained that the general set of all byte sequences.
When we claim that Tcl 8.1 strings are kept in the UTF-8 encoding, we
imply that Tcl 8.1 strings are constrained to a much smaller set of byte
sequences than were permitted for Tcl 8.0 strings.  This raises questions
about both compatibility and what a decoder should do with a non-conformant
byte sequence.





decoding and strictness