Tcl Source Code

View Ticket
Login
Ticket UUID: 9f9c828fde97a4c8a7a528dec3482ecce0838419
Title: trim issue with chinese characters
Type: Bug Version: 8.6
Submitter: anonymous Created on: 2024-11-05 15:51:41
Subsystem: 44. UTF-8 Strings Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2024-11-07 16:14:59
Resolution: Invalid Closed By: jan.nijtmans
    Closed on: 2024-11-07 16:14:59
Description:
When running the following code we get a corruption on the chinese characters string.

set comment "测试包装"
puts $comment
set comment[string trim $comment]
puts $comment

Result in the following
before trim
测试包装

after trim
测试包�
User Comments: jan.nijtmans added on 2024-11-07 16:14:59:

Found this

Excerpt:

4) a simple solution is to pass the Apache process's LANG environment variable through to the CGI script using Apache's mod_env PassEnv command in the server or virtual host configuration: PassEnv LANG; on Debian/Ubuntu make sure that in /etc/apache2/envvars you have uncommented the line ". /etc/default/locale" so that Apache runs with the system default locale and not the C (Posix) locale (which is also ASCII encoding)

Make sure line similar to LANG="en_US.UTF-8" is present in /etc/default/locale

So, I think this ticket can be closed: It's not a bug in Tcl


oehhar added on 2024-11-06 14:10:31:

Yes, that explains the behaviour.

Unicode 133 is "Next Line" which is stripped-off be the trim command.

That is why it is missing...

Harald


anonymous added on 2024-11-06 14:03:41:
Using the following code:

puts "Encoding: [encoding system]"
set comment "测试包装"

puts [lmap c [split $comment ""] {scan $c %c}]
puts $comment
set comment [string trim $comment]
puts [lmap c [split $comment ""] {scan $c %c}]
puts $comment


Running on command line the output is this:
./test_chinese.tcl
Encoding: utf-8
27979 35797 21253 35013
测试包装
27979 35797 21253 35013
测试包装


Running from CGI webpage:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> <meta charset="UTF-8">
</head>
<body>
Encoding: iso8859-1
230 181 139 232 175 149 229 140 133 232 163 133
测试包装
230 181 139 232 175 149 229 140 133 232 163
测试包�
...
</body>
</html>


########

It's related to encoding handling, the file is saved as UTF-8 but going through Apache. I will check Apache documentation and will update after.

oehhar added on 2024-11-06 11:23:43:

To examine your string without console influence, you may use:

% lmap c [split $comment ""] {scan $c %c}
27979 35797 21253 35013

This prints the decimal unicode codepoints.


oehhar added on 2024-11-06 11:19:43:

Yes, console is always an issue. So, lets make the example use only 8859-1 codepage:

set comment "\u6d4b\u8bd5\u5305\u88c5"
set comment2 [string trim $comment]
if {$comment eq $comment2} {puts identical}

If this does not output "identical", it is the trim function, otherwise, it is something else.

And please provide:

info patchlevel

Thanks for all, Harald


anonymous added on 2024-11-06 10:00:15:
Hi sorry for the missing info:

Red Hat Enterprise Linux release 8.5 (Ootpa)

echo 'puts $tcl_version;exit 0' | tclsh
8.6

It looks it is an issue with TCL using iso8859-1 (default encoding for TCL8.6)


set comment "测试包装"
set comment [encoding convertto utf-8 $comment]


puts [binary encode hex $comment]

puts $comment
set comment [string trim $comment]
puts [binary encode hex $comment]

puts $comment


without [encoding convertto utf-8 $comment] the string representation will not be UTF-8 on command line

But running under Apache as CGI it is using UTF-8 as default, causing this issue.

oehhar added on 2024-11-05 17:00:31:

That is weired !

You mention Version 8.6. For me, on TCL 8.6.14, self compiled on Windows32 with MS-VC6 this does not happen. E.g., after the trim, the data is the same. And your data is on the Unicode base plane, so no surrogate issue.

% set comment "测试包装"
测试包装
% string length $comment
4
% scan $comment %c%c%c%c c1 c2 c3 c4
4
% set c1
27979
% set c2
35797
% set c3
21253
% set c4
35013
% set comment2 [string trim $comment]
测试包装
% set comment2
测试包装

May you give more background, e.g. exact TCL version and platform, compiler,...

Thank you, Harald


Attachments: