Tk Source Code

View Ticket
Login
Ticket UUID: a1795648260f658aae9c1209c4f35a5bcb5b684d
Title: Tk 8.6: prevent issues when encountering non-BMP Unicode characters
Type: Bug Version: core-8-6-branch
Submitter: chrstphrchvz Created on: 2019-07-30 20:08:22
Subsystem: 99. Other Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2022-10-28 22:35:03
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2022-10-28 22:35:03
Description:

It is currently possible for users to trigger strange behavior in Tk 8.6 widgets by pasting or typing a non-BMP character into them. The exact behavior observed depends on the platform and the specific widget, but I have not found examples where these characters are handled well. Non-BMP characters may be initially be displayed as the correct glyph, but depending on the cursor's position the surrounding text may move around or disappear, and extra rows may appear. Non-BMP characters may require moving the cursor more than once to skip over them (usually twice on Windows, and 4 times on Aqua and X11). On Aqua, large portions of text widgets may start blinking as the cursor does.

If desirable, I can provide more details or screen recordings.

While full support for non-BMP Unicode is planned for 8.7 (TIP #389)/9.0 (TIP #497), I think that in the meantime it would be desirable for Tk 8.6 (or whenever TCL_UTF_MAX == 3) to address any issues that occur when encountering non-BMP characters, and not require programs to work around them (e.g. by sanitizing input). One way of dealing with these characters might be to substitute them with replacement characters (U+FFFD); this idea has probably been mentioned before ([6c0d7aec]), and might correspond better to what currently happens when attempting to insert non-BMP characters from Tcl.

User Comments: jan.nijtmans added on 2022-10-28 22:35:03:

Since the glyph_indexing_2 branch is merged to trunk now ..... closing


marc_culler (claiming to be Marc Culler) added on 2020-05-23 19:09:56:

I have written a prototype implementation of glyph-based indexing. It is in a branch off of 8.7 called glyph_indexing.

It adds only a small amount of code to tkEntry.c and 400 lines of reusable code in tkMacOSXFont.c (half of which are comments) that implement a TextManager object. (Well, this is C, so it is similar to an object.) The TextManager leverages the NSMutableString class to handle the work of finding grapheme clusters in a unicode string. The interface to the TextManager is specified by C declarations in tkInt.h. There is a conditional compilation switch USE_GLYPH_INDEXING which a platform can use to say whether it provides an implementation of TextManager.


marc_culler (claiming to be Marc Culler) added on 2020-05-19 17:23:51:
I agree that the Canvas Text indices should also be glyph indices.

Does "corruption" mean segmentation faults or invalid encodings?  Certainly we
don't want segmentation faults.  But we currently support creating invalid
encodings.  We allow our users to insert text between a high surrogate and a low
surrogate and that produces a byte sequence with an invalid encoding.  Using
glyph indexes consistently would prevent that kind of corruption, but fixing
just TkUtfPrev can not.

It seems to me that if we can ensure that the new functions cannot cause
segmentation faults, then using glyph indexing as much as possible could only
improve the situation with respect to invalid encodings.  I think it should be
possible to guard against segfaults, because those functions are always being
called from a context where it is known exactly how long the char array is.

jan.nijtmans added on 2020-05-19 16:38:29:

So, how about the functions TextInsert(canvas, itemPtr, index ...) and TextDeleteChars(canvas, itemPtr, first, last). Should "index", "first", "last" be Glyph indices in stead of Character indices too? I think they should, but it is becoming a big change converting the whole of Tk from Character indices to Glyph indices. I simply don't know if it's wise to do that in a 8.6.11 patch release, that's my concern. Changing Tcl_UtfPrev() to TkUtfPrev() is not dangerous, it just prevents the pointer to end in the middle of a surrogate pair. TkNumUtfChars() and TkUtfCharAtIndex() could lead to corruption if somewhere the index is miscalculated.

Better would be - I think - to implement this change in Tk 8.7, and not yet backport it to Tk 8.6. The next 8.6. releases are not far away any more (they should have been there in March already ...). So it can be tested somewhat more in the wild first....

How about that? I'm just being careful ....


marc_culler (claiming to be Marc Culler) added on 2020-05-19 15:16:25:
People conflate sequences of bytes/UniChars/ints with sequences of glyphs but
they are not the same thing. A text-based widget like entry or text deals with
glyphs, not with code points. When you are inserting, deleting, displaying
or placing cursors in one of those widgets you are working with glyphs.  When
you "get" the text from one of these widgets you need to get it encoded as a
sequence of code points determined by some encoding. And sequences of glyphs
can be encoded as sequences of code points.  But, even with unicode, the
correspondence from encoded sequences to glyphs is not given by a function
from code points to glyphs.  (Unicode has "variant selectors" which can follow
a unicode code point to select one of 16 variants of a glyph.)

So I want to advocate for the point of view that text-based widgets need to use
glyph based tools, not code point based tools.  And I think that TkUtfAtIndex
should (and almost is now, except for variant selectors) be a glyph-based tool.
It is by far the most important one, but going from one glyph to the previous
one seems very important too.  I think it is essential for the text-based
widgets to have glyph-based tools available if we want to make editing text
behave rationally.  It is really independent from any aspect of how Tcl handles
sequences of code points.

jan.nijtmans added on 2020-05-19 14:43:59:

The pasting (on macOS) of Emoji on the test widget now works as expected: You can jump over it as expected with the cursor keys. This part is merged to core-8-6-branch now.

Regarding the new functions TkUtfAtIndex() and TkNumUtfChars(), it should be carefully reviewed when to use those functions and when the Tcl* variants. The Tk* variants use different indexes than Tcl, because in Tcl Emoji still count as 2 characters. Therefore I'm not 100% sure yet (say: 75% or so), so I didn't merge it yet. But I like the idea: Yes, something like this is how it should be. If there is a way to make the indexing in entry widgets the same as in Tcl, that would be prefered. But I don't know if that's possible.


marc_culler (claiming to be Marc Culler) added on 2020-05-14 21:02:50:
Please ignore my segfault report.  That was caused by a print statement that
I had added while looking at the issue of making the entry work correctly with
non-BMP characters.  When I saw the commit I wanted to see how it affected
this, and neglected to remove my edits.  With my changes gone, the delete key
deletes one surrogate, as before, but does not segfault.

Hopefully computing string lengths and correctly computing the byte index of a
unicode character will soon be working!  It looks like you are on track to make
that happen.

marc_culler (claiming to be Marc Culler) added on 2020-05-14 20:18:02:

With commit 95cf4257d6 to the bug-a179564826 branch there is a segfault on macOS if you enter an emoji character in an entry widget and then press the delete key. Before this commit the delete action would remove one surrogate and leave the other surrogate in place, printed out as some sort of icon character.

Here is the traceback:

    

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   Tcl                           	0x000000010e2fdd99 Tcl_UtfToUniChar + 4
1   Tk                            	0x000000010e01f271 TkUtfToUniChar + 49 (tkUtil.c:1223)
2   Tk                            	0x000000010e02f713 DeleteChars + 227 (tkEntry.c:2163)
3   Tk                            	0x000000010e0297e6 EntryWidgetObjCmd + 1350 (tkEntry.c:684)
4   Tcl                           	0x000000010e237810 0x10e221000 + 92176
5   Tcl                           	0x000000010e234837 TclNRRunCallbacks + 79
6   Tcl                           	0x000000010e235a4e 0x10e221000 + 84558
7   Tcl                           	0x000000010e235297 Tcl_EvalEx + 26
8   Tk                            	0x000000010dfd3686 Tk_BindEvent + 5686 (tkBind.c:2592)
9   Tk                            	0x000000010dfe2f43 TkBindEventProc + 771 (tkCmds.c:319)
10  Tk                            	0x000000010dff04c3 Tk_HandleEvent + 755
11  Tk                            	0x000000010dff12c9 WindowEventProc + 217 (tkEvent.c:1724)
12  Tcl                           	0x000000010e2de33c Tcl_ServiceEvent + 139
13  Tcl                           	0x000000010e2de5d4 Tcl_DoOneEvent + 319
14  Tk                            	0x000000010dff19e9 Tk_MainLoop + 41 (tkEvent.c:2106)
15  Tk                            	0x000000010e0098e8 Tk_MainEx + 2376 (tkMain.c:379)
16  Wish                          	0x000000010dfc415d 0x10dfc1000 + 12637
17  libdyld.dylib                 	0x00007fff63c677fd start + 1


chrstphrchvz added on 2020-04-08 14:55:31:

I don't think this can be considered "Fixed" given that testing still reveals the issue (at least on Mac where moving cursor backwards ends up in the middle of a character rather than skipping over it completely; tested Tk @ [5513fc0b74], Tcl @ cf27ed353f).

Also, though I have not paid close attention to what has happened on Tcl-side for this issue, I imagine it may be useful to mention/document the internal behavior as intentionally compliant with or at least resembling CESU-8 despite "Utf" in the Tcl function names, just to have an externally-recognized name for the approach it's using.


jan.nijtmans added on 2020-04-06 11:44:41:

Since we did the best we could for 8.6, closing this, for now.

I could happen this ticket has to be re-opened, with the resolution changed to "Won't fix". That depends on what happens with [31aa44375de2c87e]


chrstphrchvz added on 2019-11-11 01:01:04:

See also [a179564826]

Sorry, wrong ticket. I meant [00a27923ee]


chrstphrchvz added on 2019-11-11 00:58:36:

See also [a179564826] which is an earlier example of this issue on Aqua—it was already marked as "Fixed", but I think the more recent change [43e8977179] is a direct fix/workaround for it.


jan.nijtmans added on 2019-11-01 11:07:26:
A little bit more progress, increasing XMaxTransChars from 4 to 7.

jan.nijtmans added on 2019-10-24 21:30:56:
@marc_culler:  Thanks!    Looks good!    This can always be improved further, at least fixing the crash is already a good improvement for now.

marc_culler (claiming to be Marc Culler) added on 2019-10-24 15:52:26:

In 43e89771, which is in the bug-39de9677aa branch, I added a filter which replaces any non-BMP character in a UTF-8 string with U+FFFD before pasting the string into a Tk widget.

By the way, the strange behaviors where text would move or disappear or the cursor would behave incorrectly all seem to be caused by the fact that Tcl cannot count the number of UTF-8 characters in a string which contains 4-byte UTF-8 codes.


jan.nijtmans added on 2019-09-17 08:27:35:
On Windows/UNIX this should work now, with characters below /U3000.

On MacOS (Aqua) more work is needed, therefore not closing this ticket yet.

chrstphrchvz added on 2019-08-07 09:26:06:

Alright, I have tried f6eb4196ee, and did not observe any difference/improvement on Windows, nor on Mac with Aqua or X11.


chrstphrchvz added on 2019-08-07 08:25:19:

See: bug-a179564826 branch

I'd like to give this a try, though so far I've gotten away with Tcl 8.6.9 for testing Tk core-8-6-branch so far on Aqua, and haven't tried Tcl core-8-6-branch. Is this a small enough change that can be applied as a patch to the 8.6.9 release, or are there plenty more changes in 8.6.10 needed?


jan.nijtmans added on 2019-08-01 08:03:51:
The biggest problem is that in Tcl8.6, 4-byte UTF-8 sequences are handled as invalid, which result in being handled as 4 separate bytes. Tcl 8.7 improved that by handling it as 2 UTF-8 sequences, one for the High surrogate and one more for the Low surrogate.

This part of Tcl 8.7 could be backported to 8.6: It wouldn't change anything, except that 4-byte incoming UTF-8 sequence are handled better. Tcl 8.6 will NEVER produce those 4-byte UTF-8 sequences (Tcl 8.7 does!) doing that could give unpleasent surprises to applicationes which don't expect that.

See: [https://core.tcl-lang.org/tcl/info/f6eb4196ee4d6fa8|bug-a179564826] branch