Tcl Source Code

Check-in [4d6aa33b2f]
Login
EuroTcl/OpenACS 11 - 12 JULY 2024, VIENNA

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Rebase to trunk
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | encoding-for-review
Files: files | file ages | folders
SHA3-256: 4d6aa33b2f03318f45cc19c7e0adf0e4c055ae65eaf7d907abf42829f4d389f6
User & Date: jan.nijtmans 2024-06-13 15:12:41
References
2024-06-14
15:09
Import selections of [4d6aa33b2f] (branch: encoding-for-review) and alternate wording. check-in: f5243d7263 user: oehhar tags: encoding-for-review-alt
Context
2024-06-13
15:12
Rebase to trunk Leaf check-in: 4d6aa33b2f user: jan.nijtmans tags: encoding-for-review
12:00
Fix [1d26e580cf]: safe interp can't source files with BOM check-in: 162129dfbf user: jan.nijtmans tags: trunk, main
2024-06-11
10:10
Merge 9.0 check-in: 6da7c67781 user: jan.nijtmans tags: encoding-for-review
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to doc/encoding.n.

1
2

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17





18
19


20




21


22

23
24

25





26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68

69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

138
139
140

141
142
143
144
145
146
147
148

149
150
151
152
153
154
155
156

157
158

159
160
161
162
163
164
165
166
167
168
169
170
171

172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
'\"
'\" Copyright (c) 1998 Scriptics Corporation.

'\"
'\" See the file "license.terms" for information on usage and redistribution
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
'\"
.TH encoding n "8.1" Tcl "Tcl Built-In Commands"
.so man.macros
.BS
.SH NAME
encoding \- Manipulate encodings
.SH SYNOPSIS
\fBencoding \fIoption\fR ?\fIarg arg ...\fR?
.BE
.SH INTRODUCTION
.PP
Strings in Tcl are logically a sequence of Unicode characters.





These strings are represented in memory as a sequence of bytes that
may be in one of several encodings: modified UTF\-8 (which uses 1 to 4


bytes per character), or a custom encoding start as 8 bit binary data.




.PP


Different operating system interfaces or applications may generate

strings in other encodings such as Shift\-JIS.  The \fBencoding\fR
command helps to bridge the gap between Unicode and these other

formats.





.SH DESCRIPTION
.PP
Performs one of several encoding related operations, depending on
\fIoption\fR.  The legal \fIoption\fRs are:
.\" METHOD: convertfrom
.TP
\fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR
.VS "TCL8.7 TIP607, TIP656"
.TP
\fBencoding convertfrom\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding data\fR
.VE "TCL8.7 TIP607, TIP656"
.
Converts \fIdata\fR, which should be in binary string encoded as per
\fIencoding\fR, to a Tcl string. If \fIencoding\fR is not specified, the current
system encoding is used.
.PP
.VS "TCL8.7 TIP607, TIP656"
The \fB-profile\fR option determines the command behavior in the presence
of conversion errors. See the \fBPROFILES\fR section below for details. Any premature
termination of processing due to errors is reported through an exception if
the \fB-failindex\fR option is not specified.
.PP
If the \fB-failindex\fR is specified, instead of an exception being raised
on premature termination, the result of the conversion up to the point of the
error is returned as the result of the command. In addition, the index
of the source byte triggering the error is stored in \fBvar\fR. If no
errors are encountered, the entire result of the conversion is returned and
the value \fB-1\fR is stored in \fBvar\fR.
.VE "TCL8.7 TIP607, TIP656"
.\" METHOD: convertto
.TP
\fBencoding convertto\fR ?\fIencoding\fR? \fIdata\fR
.TP
\fBencoding convertto\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding data\fR
.
Convert \fIstring\fR to the specified \fIencoding\fR. The result is a Tcl binary
string that contains the sequence of bytes representing the converted string in
the specified encoding. If \fIencoding\fR is not specified, the current system
encoding is used.
.PP
.VS "TCL8.7 TIP607, TIP656"
The \fB-profile\fR and \fB-failindex\fR options have the same effect as
described for the \fBencoding convertfrom\fR command.

.VE "TCL8.7 TIP607, TIP656"
.\" METHOD: dirs
.TP
\fBencoding dirs\fR ?\fIdirectoryList\fR?
.
Tcl can load encoding data files from the file system that describe
additional encodings for it to work with. This command sets the search
path for \fB*.enc\fR encoding data files to the list of directories
\fIdirectoryList\fR. If \fIdirectoryList\fR is omitted then the
command returns the current list of directories that make up the
search path. It is an error for \fIdirectoryList\fR to not be a valid
list. If, when a search for an encoding data file is happening, an
element in \fIdirectoryList\fR does not refer to a readable,
searchable directory, that element is ignored.
.\" METHOD: names
.TP
\fBencoding names\fR
.
Returns a list containing the names of all of the encodings that are
currently available.
The encodings
.QW utf-8
and
.QW iso8859-1
are guaranteed to be present in the list.
.\" METHOD: profiles
.TP
\fBencoding profiles\fR
.VS "TCL8.7 TIP656"
Returns a list of the names of encoding profiles. See \fBPROFILES\fR below.

.VE "TCL8.7 TIP656"
.\" METHOD: system
.TP
\fBencoding system\fR ?\fIencoding\fR?
.
Set the system encoding to \fIencoding\fR. If \fIencoding\fR is
omitted then the command returns the current system encoding.  The
system encoding is used whenever Tcl passes strings to system calls.
.\" Do not put .VS on whole section as that messes up the bullet list alignment
.SH PROFILES
.PP
.VS "TCL8.7 TIP656"
Operations involving encoding transforms may encounter several types of
errors such as invalid sequences in the source data, characters that
cannot be encoded in the target encoding and so on.
A \fIprofile\fR prescribes the strategy for dealing with such errors
in one of two ways:
.VE "TCL8.7 TIP656"
.
.IP \(bu
.VS "TCL8.7 TIP656"
Terminating further processing of the source data. The profile does not
determine how this premature termination is conveyed to the caller. By default,
this is signalled by raising an exception. If the \fB-failindex\fR option
is specified, errors are reported through that mechanism.
.VE "TCL8.7 TIP656"
.IP \(bu
.VS "TCL8.7 TIP656"
Continue further processing of the source data using a fallback strategy such
as replacing or discarding the offending bytes in a profile-defined manner.
.VE "TCL8.7 TIP656"
.PP
The following profiles are currently implemented with \fBstrict\fR being
the default if the \fB-profile\fR is not specified.
.VS "TCL8.7 TIP656"
.TP
\fBstrict\fR
.
The \fBstrict\fR profile always stops processing when an conversion error is

encountered. The error is signalled via an exception or the \fB-failindex\fR
option mechanism. The \fBstrict\fR profile implements a Unicode standard
conformant behavior.

.TP
\fBtcl8\fR
.
The \fBtcl8\fR profile always follows the first strategy above and corresponds
to the behavior of encoding transforms in Tcl 8.6. When converting from an
external encoding \fBother than utf-8\fR to Tcl strings with the \fBencoding
convertfrom\fR command, invalid bytes are mapped to their numerically equivalent
code points. For example, the byte 0x80 which is invalid in ASCII would be

mapped to code point U+0080. When converting from \fButf-8\fR, invalid bytes
that are defined in CP1252 are mapped to their Unicode equivalents while those
that are not fall back to the numerical equivalents. For example, byte 0x80 is
defined by CP1252 and is therefore mapped to its Unicode equivalent U+20AC while
byte 0x81 which is not defined by CP1252 is mapped to U+0081. As an additional
special case, the sequence 0xC0 0x80 is mapped to U+0000.

When converting from Tcl strings to an external encoding format using

\fBencoding convertto\fR, characters that cannot be represented in the
target encoding are replaced by an encoding-dependent character, usually

the question mark \fB?\fR.
.TP
\fBreplace\fR
.
Like the \fBtcl8\fR profile, the \fBreplace\fR profile always continues
processing on conversion errors but follows a Unicode standard conformant
method for substitution of invalid source data.

When converting an encoded byte sequence to a Tcl string using
\fBencoding convertfrom\fR, invalid bytes
are replaced by the U+FFFD REPLACEMENT CHARACTER code point.

When encoding a Tcl string with \fBencoding convertto\fR,

code points that cannot be represented in the
target encoding are transformed to an encoding-specific fallback character,
U+FFFD REPLACEMENT CHARACTER for UTF targets and generally `?` for other
encodings.
.VE "TCL8.7 TIP656"
.SH EXAMPLES
.PP
These examples use the utility proc below that prints the Unicode code points
comprising a Tcl string.
.PP
.CS
proc codepoints s {join [lmap c [split $s {}] {
    string cat U+ [format %.6X [scan $c %c]]}]
}
.CE
.PP
Example 1: convert a byte sequence in Japanese euc-jp encoding to a TCL string:
.PP
.CS
% codepoints [\fBencoding convertfrom\fR euc-jp "\exA4\exCF"]
U+00306F
.CE
.PP
The result is the unicode codepoint
.QW "\eu306F" ,
which is the Hiragana letter HA.
.VS "TCL8.7 TIP607, TIP656"
.PP
Example 2: Error handling based on profiles:
.PP
The letter \fBA\fR is Unicode character U+0041 and the byte "\ex80" is invalid


>








|

|



|
>
>
>
>
>
|
|
>
>
|
>
>
>
>

>
>
|
>
|
<
>
|
>
>
>
>
>


|
<








<
|
|


|
|
<
|
<
<
|
<
|
|
|







|
<
<
|
<

<
|
>





<
<
|
|
|
|
<
<
|




|
<









|
>





|
|
|




<
<
<
|
|
<
<
<
<
<
<
<
<
<
<
<
<
<
<

|
<
|


|
<
>
|
<
<
>


|
<
|
|
<
|
>
|
|
|
|
|
|
|
<
>
|
|
>
|


<
<
<
<
|
<
|
|
|
<
>
|
|
|
|



|
|







|


|



|







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

39
40
41
42
43
44
45
46
47
48

49
50
51
52
53
54
55
56

57
58
59
60
61
62

63


64

65
66
67
68
69
70
71
72
73
74
75


76

77

78
79
80
81
82
83
84


85
86
87
88


89
90
91
92
93
94

95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117



118
119














120
121

122
123
124
125

126
127


128
129
130
131

132
133

134
135
136
137
138
139
140
141
142

143
144
145
146
147
148
149




150

151
152
153

154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
'\"
'\" Copyright (c) 1998 Scriptics Corporation.
'\" Copyright (c) 2023 Nathan Coulter
'\"
'\" See the file "license.terms" for information on usage and redistribution
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
'\"
.TH encoding n "8.1" Tcl "Tcl Built-In Commands"
.so man.macros
.BS
.SH NAME
encoding \- Work with encodings
.SH SYNOPSIS
\fBencoding \fIoperation\fR ?\fIarg arg ...\fR?
.BE
.SH INTRODUCTION
.PP
In Tcl every string is composed of Unicode values.  Text may be encoded into an
encoding such as cp1252, iso8859-1, Shift\-JIS, utf-8, utf-16, etc. Not every
Unicode value is encodable in every encoding, and some encodings can encode
values that are not available in Unicode.
.PP
Even though Unicode is for encoding the written texts of human languages, any
sequence of bytes can be encoded as the first 255 Unicode values. In particular,
iso8859-1 is an encoding (a superset of classic ASCII) for a subset of Unicode
in which each byte is a Unicode value of 255
or less; any sequence of bytes can be considered to be a Unicode string
encoded in iso8859-1.  To work with binary data in Tcl, decode it from
iso8859-1 when reading it in, and encode it into iso8859-1 when writing it out,
ensuring that each character in the string has a value of 255 or less.
Decoding such a string does nothing, and encoding encoding such a string also
does nothing.
.PP
For example, the following is true:
.CS

set text {In Tcl binary data is treated as Unicode text and it just works.}
set encoded [\fBencoding convertto\fR iso8859-1 $text]

expr {$text eq $encoded}; #-> 1
.CE
The following is also true:
.CS
set decoded [\fBencoding convertfrom\fR iso8859-1 $text]
expr {$text eq $decoded}; #-> 1
.CE
.SH DESCRIPTION
.PP
Performs one of the following encoding \fIoperations\fR:

.\" METHOD: convertfrom
.TP
\fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR
.VS "TCL8.7 TIP607, TIP656"
.TP
\fBencoding convertfrom\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding data\fR
.VE "TCL8.7 TIP607, TIP656"
.

Decodes \fIdata\fR encoded in \fIencoding\fR. If \fIencoding\fR is not
specified the current system encoding is used.
.PP
.VS "TCL8.7 TIP607, TIP656"
\fB\-profile\fR determines how invalid data for the encoding are handled.  See
the \fBPROFILES\fR section below for details.  Returns an error if decoding

fails.  However, if \fB\-failindex\fR given, returns the result of the


conversion up to the point of termination, and stores in \fBvar\fR the index of

the character that could not be converted. If no errors are encountered the
entire result of the conversion is returned and the value \fB-1\fR is stored in
\fBvar\fR.
.VE "TCL8.7 TIP607, TIP656"
.\" METHOD: convertto
.TP
\fBencoding convertto\fR ?\fIencoding\fR? \fIdata\fR
.TP
\fBencoding convertto\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding data\fR
.
Converts \fIstring\fR to \fIencoding\fR.  If \fIencoding\fR is not given, the


current system encoding is used.

.VS "TCL8.7 TIP607, TIP656"

See \fBencoding convertfrom\fR for the meaning of \fB\-profile\fR and
\fB\-failindex\fR.
.VE "TCL8.7 TIP607, TIP656"
.\" METHOD: dirs
.TP
\fBencoding dirs\fR ?\fIdirectoryList\fR?
.


Sets the search path for \fB*.enc\fR encoding data files to the list of
directories given by \fIdirectoryList\fR.  If \fIdirectoryList\fR is not given,
returns the current list of directories that make up the search path.  It is
not an error for an item in \fIdirectoryList\fR to not refer to a readable,


searchable directory.
.\" METHOD: names
.TP
\fBencoding names\fR
.
Returns a list of the names of available encodings.

The encodings
.QW utf-8
and
.QW iso8859-1
are guaranteed to be present in the list.
.\" METHOD: profiles
.TP
\fBencoding profiles\fR
.VS "TCL8.7 TIP656"
Returns a list of names of available encoding profiles. See \fBPROFILES\fR
below.
.VE "TCL8.7 TIP656"
.\" METHOD: system
.TP
\fBencoding system\fR ?\fIencoding\fR?
.
Sets the system encoding to \fIencoding\fR. If \fIencoding\fR is not given,
returns the current system encoding.  The system encoding is used to pass
strings to system calls.
.\" Do not put .VS on whole section as that messes up the bullet list alignment
.SH PROFILES
.PP
.VS "TCL8.7 TIP656"



Each \fIprofile\fR is a distinct strategy for dealing with invalid data for an
encoding.














.PP
The following profiles are currently implemented.

.VE "TCL8.7 TIP656"
.TP
\fBstrict\fR
.VS "TCL8.7 TIP656"

The default profile.  The operation fails when invalid data for the encoding
are encountered.


.VE "TCL8.7 TIP656"
.TP
\fBtcl8\fR
.VS "TCL8.7 TIP656"

Provides for behaviour identical to that of Tcl 8.6: When
decoding, for encodings \fBother than utf-8\fR, each invalid byte is interpreted

as the Unicode value given by that one byte. For example, the byte 0x80, which
is invalid in the ASCII encoding would be mapped to the Unicode value U+0080.
For \fButf-8\fR, each invalid byte that is a valid CP1252 character is
interpreted as the Unicode value for that character, while each byte that is
not is treated as the Unicode value given by that one byte. For example, byte
0x80 is defined by CP1252 and is therefore mapped to its Unicode equivalent
U+20AC while byte 0x81 which is not defined by CP1252 is mapped to U+0081. As
an additional special case, the sequence 0xC0 0x80 is mapped to U+0000.
.RS

.PP
When encoding, each character that cannot be represented in the encoding is
replaced by an encoding-dependent character, usually the question mark \fB?\fR.
.RE
.VE "TCL8.7 TIP656"
.TP
\fBreplace\fR




.VS "TCL8.7 TIP 656"

When decoding, invalid bytes are replaced by U+FFFD, the Unicode REPLACEMENT
CHARACTER.
.RS

.PP
When encoding, Unicode values that cannot be represented in the target encoding
are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT
CHARACTER for UTF targets, and generally `?` for other encodings.
.RE
.VE "TCL8.7 TIP656"
.SH EXAMPLES
.PP
These examples use the utility proc below that prints the Unicode value for
each character in a string.
.PP
.CS
proc codepoints s {join [lmap c [split $s {}] {
    string cat U+ [format %.6X [scan $c %c]]}]
}
.CE
.PP
Example 1: Convert from euc-jp:
.PP
.CS
% codepoints [\fBencoding convertfrom\fR euc-jp \exA4\exCF]
U+00306F
.CE
.PP
The result is the Unicode value
.QW "\eu306F" ,
which is the Hiragana letter HA.
.VS "TCL8.7 TIP607, TIP656"
.PP
Example 2: Error handling based on profiles:
.PP
The letter \fBA\fR is Unicode character U+0041 and the byte "\ex80" is invalid
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
% codepoints [\fBencoding convertfrom\fR -profile strict ascii A\ex80]
unexpected byte sequence starting at index 1: '\ex80'
.CE
.PP
Example 3: Get partial data and the error location:
.PP
.CS
% codepoints [\fBencoding convertfrom\fR -profile strict -failindex idx ascii AB\ex80]
U+000041 U+000042
% set idx
2
.CE
.PP
Example 4: Encode a character that is not representable in ISO8859-1:
.PP
.CS
% \fBencoding convertto\fR iso8859-1 A\eu0141
A?
% \fBencoding convertto\fR -profile strict iso8859-1 A\eu0141
unexpected character at index 1: 'U+000141'
% \fBencoding convertto\fR -profile strict -failindex idx iso8859-1 A\eu0141
A
% set idx
1
.CE
.VE "TCL8.7 TIP607, TIP656"
.PP
.SH "SEE ALSO"
Tcl_GetEncoding(3), fconfigure(n)
.SH KEYWORDS
encoding, unicode
.\" Local Variables:
.\" mode: nroff
.\" End:







|












|













193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
% codepoints [\fBencoding convertfrom\fR -profile strict ascii A\ex80]
unexpected byte sequence starting at index 1: '\ex80'
.CE
.PP
Example 3: Get partial data and the error location:
.PP
.CS
% codepoints [\fBencoding convertfrom\fR -failindex idx ascii AB\ex80]
U+000041 U+000042
% set idx
2
.CE
.PP
Example 4: Encode a character that is not representable in ISO8859-1:
.PP
.CS
% \fBencoding convertto\fR iso8859-1 A\eu0141
A?
% \fBencoding convertto\fR -profile strict iso8859-1 A\eu0141
unexpected character at index 1: 'U+000141'
% \fBencoding convertto\fR -failindex idx iso8859-1 A\eu0141
A
% set idx
1
.CE
.VE "TCL8.7 TIP607, TIP656"
.PP
.SH "SEE ALSO"
Tcl_GetEncoding(3), fconfigure(n)
.SH KEYWORDS
encoding, unicode
.\" Local Variables:
.\" mode: nroff
.\" End: