Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Make the documentation of [encoding] more concise and readable. |
---|---|
Downloads: | Tarball | ZIP archive | SQL archive |
Timelines: | family | ancestors | descendants | both | trunk | main |
Files: | files | file ages | folders |
SHA3-256: |
92ee031f32ea98ba257a6354c8c1ae3b |
User & Date: | pooryorick 2023-03-27 12:16:36.891 |
References
2024-06-11
| ||
09:30 | Backout [92ee031f32ea98ba]: "Make the documentation of [encoding] more concise and readable". Will b... check-in: 4178d18b52 user: jan.nijtmans tags: trunk, main | |
2023-03-28
| ||
17:52 | Merge trunk [92ee031f32]: Make the documentation of [encoding] more concise and readable. check-in: e2c12eab9b user: pooryorick tags: unchained | |
Context
2024-06-04
| ||
11:03 | Let's review the encoding.n changes in 8.7/trunk (which were never backported to 8.6) check-in: f033e2bffd user: jan.nijtmans tags: backout-encoding-doc | |
2023-03-28
| ||
17:52 | Merge trunk [92ee031f32]: Make the documentation of [encoding] more concise and readable. check-in: e2c12eab9b user: pooryorick tags: unchained | |
2023-03-27
| ||
14:44 | Merge 8.7 check-in: 245407f036 user: jan.nijtmans tags: trunk, main | |
12:16 | Make the documentation of [encoding] more concise and readable. check-in: 92ee031f32 user: pooryorick tags: trunk, main | |
12:14 | Make the documentation of [encoding] more concise and readable. check-in: 71bfc6e708 user: pooryorick tags: core-8-branch | |
11:43 | Merge 8.7 check-in: 4ebee22ffd user: jan.nijtmans tags: trunk, main | |
Changes
Changes to doc/encoding.n.
1 2 3 4 5 6 7 8 9 10 | '\" '\" Copyright (c) 1998 Scriptics Corporation. '\" '\" See the file "license.terms" for information on usage and redistribution '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. '\" .TH encoding n "8.1" Tcl "Tcl Built-In Commands" .so man.macros .BS .SH NAME | | | | > > > > > | | > | > > > > < < > > | > > | > > > > > | < < | | | | < < | < | < | | | | < < | | < < < | | | | < < | | < | > | | | < < < | | < < < < < < < < < < < < < < | < < < > | < | > | | | | | | < | | < < | < < < < < | < < | < | | | < | | | | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | '\" '\" Copyright (c) 1998 Scriptics Corporation. '\" '\" See the file "license.terms" for information on usage and redistribution '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. '\" .TH encoding n "8.1" Tcl "Tcl Built-In Commands" .so man.macros .BS .SH NAME encoding \- Work with encodings .SH SYNOPSIS \fBencoding \fIoperation\fR ?\fIarg arg ...\fR? .BE .SH INTRODUCTION .PP In Tcl every string is composed of Unicode values. Text may be encoded into an encoding such as cp1252, iso8859-1, Shitf\-JIS, utf-8, utf-16, etc. Not every Unicode vealue is encodable in every encoding, and some encodings can encode values that are not available in Unicode. .PP Even though Unicode is for encoding the written texts of human languages, any sequence of bytes can be encoded as the first 255 Unicode values. iso8859-1 an encoding for a subset of Unicode in which each byte is a Unicode value of 255 or less. Thus, any sequence of bytes can be considered to be a Unicode string encoded in iso8859-1. To work with binary data in Tcl, decode it from iso8859-1 when reading it in, and encode it into iso8859-1 when writing it out, ensuring that each character in the string has a value of 255 or less. Decoding such a string does nothing, and encoding encoding such a string also does nothing. .PP For example, the following is true: .CS set text {In Tcl binary data is treated as Unicode text and it just works.} set encoded [encoding convertto iso8859-1 $text] expr {$text eq $encoded}; #-> 1 .CE The following is also true: .CS set decoded [encoding convertfrom iso8859-1 $text] expr {$text eq $decoded}; #-> 1 .CE .SH DESCRIPTION .PP Performs one of the following encoding \fIoperations\fR: .TP \fBencoding convertfrom\fR ?\fIencoding\fR? \fIdata\fR .TP \fBencoding convertfrom\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding\fR \fIdata\fR . Decodes \fIdata\fR encoded in \fIencoding\fR. If \fIencoding\fR is not specified the current system encoding is used. .VS "TCL8.7 TIP607, TIP656" \fB-profile\fR determines how invalid data for the encoding are handled. See the \fBPROFILES\fR section below for details. Returns an error if decoding fails. However, if \fB-failindex\fR given, returns the result of the conversion up to the point of termination, and stores in \fBvar\fR the index of the character that could not be converted. If no errors are encountered the entire result of the conversion is returned and the value \fB-1\fR is stored in \fBvar\fR. .VE "TCL8.7 TIP607, TIP656" .TP \fBencoding convertto\fR ?\fIencoding\fR? \fIdata\fR .TP \fBencoding convertto\fR ?\fB-profile \fIprofile\fR? ?\fB-failindex var\fR? \fIencoding\fR \fIdata\fR . Converts \fIstring\fR to \fIencoding\fR. If \fIencoding\fR is not given, the current system encoding is used. .VS "TCL8.7 TIP607, TIP656" See \fBencoding convertfrom\fR for the meaning of \fB-profile\fR and \fB-failindex\fR. .VE "TCL8.7 TIP607, TIP656" .TP \fBencoding dirs\fR ?\fIdirectoryList\fR? . Sets the search path for \fB*.enc\fR encoding data files to the list of directories given by \fIdirectoryList\fR. If \fIdirectoryList\fR is not given, returns the current list of directories that make up the search path. It is not an error for an item in \fIdirectoryList\fR to not refer to a readable, searchable directory. .TP \fBencoding names\fR . Returns a list of the names of available encodings. The encodings .QW utf-8 and .QW iso8859-1 are guaranteed to be present in the list. .VS "TCL8.7 TIP656" .TP \fBencoding profiles\fR Returns a list of names of available encoding profiles. See \fBPROFILES\fR below. .VE "TCL8.7 TIP656" .TP \fBencoding system\fR ?\fIencoding\fR? . Sets the system encoding to \fIencoding\fR. If \fIencoding\fR is not given, returns the current system encoding. The system encoding is used to pass strings to system calls. .\" Do not put .VS on whole section as that messes up the bullet list alignment .SH PROFILES .PP .VS "TCL8.7 TIP656" Each \fIprofile\fR is a distinct strategy for dealing with invalid data for an encoding. .PP The following profiles are currently implemented. .VS "TCL8.7 TIP656" .TP \fBtcl8\fR . The default profile. Provides for behaviour identical to that of Tcl 8.6: When decoding, for encodings \fBother than utf-8\fR, each invalid byte is interpreted as the Unicode value given by that one byte. For example, the byte 0x80, which is invalid in the ASCII encoding would be mapped to the Unicode value U+0080. For \fButf-8\fR, each invalid byte that is a valid CP1252 character is interpreted as the Unicode value for that character, while each byte that is not is treated as the Unicode value given by that one byte. For example, byte 0x80 is defined by CP1252 and is therefore mapped to its Unicode equivalent U+20AC while byte 0x81 which is not defined by CP1252 is mapped to U+0081. As an additional special case, the sequence 0xC0 0x80 is mapped to U+0000. When encoding, each character that cannot be represented in the encoding is replaced by an encoding-dependent character, usually the question mark \fB?\fR. .TP \fBstrict\fR . The operation fails when invalid data for the encoding are encountered. .TP \fBreplace\fR . When decoding, invalid bytes are replaced by U+FFFD, the Unicode REPLACEMENT CHARACTER. When encoding, Unicode values that cannot be represented in the target encoding are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT CHARACTER for UTF targets, and generally `?` for other encodings. .VE "TCL8.7 TIP656" .SH EXAMPLES .PP These examples use the utility proc below that prints the Unicode value for each character in a string. .PP .CS proc codepoints s {join [lmap c [split $s {}] { string cat U+ [format %.6X [scan $c %c]]}] } .CE .PP Example 1: Convert from euc-jp: .PP .CS % codepoints [\fBencoding convertfrom\fR euc-jp \exA4\exCF] U+00306F .CE .PP The result is the Unicode value .QW "\eu306F" , which is the Hiragana letter HA. .VS "TCL8.7 TIP607, TIP656" .PP Example 2: Error handling based on profiles: .PP The letter \fBA\fR is Unicode character U+0041 and the byte "\ex80" is invalid |
︙ | ︙ |