Ticket UUID: | 9f9c828fde97a4c8a7a528dec3482ecce0838419 | |||
Title: | trim issue with chinese characters | |||
Type: | Bug | Version: | 8.6 | |
Submitter: | anonymous | Created on: | 2024-11-05 15:51:41 | |
Subsystem: | 44. UTF-8 Strings | Assigned To: | jan.nijtmans | |
Priority: | 5 Medium | Severity: | Important | |
Status: | Closed | Last Modified: | 2024-11-07 16:14:59 | |
Resolution: | Invalid | Closed By: | jan.nijtmans | |
Closed on: | 2024-11-07 16:14:59 | |||
Description: |
When running the following code we get a corruption on the chinese characters string. set comment "测试包装" puts $comment set comment[string trim $comment] puts $comment Result in the following before trim 测试包装 after trim 测试包� | |||
User Comments: |
jan.nijtmans added on 2024-11-07 16:14:59:
Found this Excerpt: 4) a simple solution is to pass the Apache process's LANG environment variable through to the CGI script using Apache's mod_env PassEnv command in the server or virtual host configuration: PassEnv LANG; on Debian/Ubuntu make sure that in /etc/apache2/envvars you have uncommented the line ". /etc/default/locale" so that Apache runs with the system default locale and not the C (Posix) locale (which is also ASCII encoding) Make sure line similar to LANG="en_US.UTF-8" is present in /etc/default/locale So, I think this ticket can be closed: It's not a bug in Tcl oehhar added on 2024-11-06 14:10:31: Yes, that explains the behaviour. Unicode 133 is "Next Line" which is stripped-off be the trim command. That is why it is missing... Harald anonymous added on 2024-11-06 14:03:41: Using the following code: puts "Encoding: [encoding system]" set comment "测试包装" puts [lmap c [split $comment ""] {scan $c %c}] puts $comment set comment [string trim $comment] puts [lmap c [split $comment ""] {scan $c %c}] puts $comment Running on command line the output is this: ./test_chinese.tcl Encoding: utf-8 27979 35797 21253 35013 测试包装 27979 35797 21253 35013 测试包装 Running from CGI webpage: <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> <meta charset="UTF-8"> </head> <body> Encoding: iso8859-1 230 181 139 232 175 149 229 140 133 232 163 133 测试包装 230 181 139 232 175 149 229 140 133 232 163 测试包� ... </body> </html> ######## It's related to encoding handling, the file is saved as UTF-8 but going through Apache. I will check Apache documentation and will update after. oehhar added on 2024-11-06 11:23:43: To examine your string without console influence, you may use: % lmap c [split $comment ""] {scan $c %c} 27979 35797 21253 35013 This prints the decimal unicode codepoints. oehhar added on 2024-11-06 11:19:43: Yes, console is always an issue. So, lets make the example use only 8859-1 codepage: set comment "\u6d4b\u8bd5\u5305\u88c5" set comment2 [string trim $comment] if {$comment eq $comment2} {puts identical} If this does not output "identical", it is the trim function, otherwise, it is something else. And please provide: info patchlevel Thanks for all, Harald anonymous added on 2024-11-06 10:00:15: Hi sorry for the missing info: Red Hat Enterprise Linux release 8.5 (Ootpa) echo 'puts $tcl_version;exit 0' | tclsh 8.6 It looks it is an issue with TCL using iso8859-1 (default encoding for TCL8.6) set comment "测试包装" set comment [encoding convertto utf-8 $comment] puts [binary encode hex $comment] puts $comment set comment [string trim $comment] puts [binary encode hex $comment] puts $comment without [encoding convertto utf-8 $comment] the string representation will not be UTF-8 on command line But running under Apache as CGI it is using UTF-8 as default, causing this issue. oehhar added on 2024-11-05 17:00:31: That is weired ! You mention Version 8.6. For me, on TCL 8.6.14, self compiled on Windows32 with MS-VC6 this does not happen. E.g., after the trim, the data is the same. And your data is on the Unicode base plane, so no surrogate issue. % set comment "测试包装" 测试包装 % string length $comment 4 % scan $comment %c%c%c%c c1 c2 c3 c4 4 % set c1 27979 % set c2 35797 % set c3 21253 % set c4 35013 % set comment2 [string trim $comment] 测试包装 % set comment2 测试包装 May you give more background, e.g. exact TCL version and platform, compiler,... Thank you, Harald |
Attachments:
- test_chinese.tcl [download] added by anonymous on 2024-11-06 10:01:08. [details]