Tcl Library Source Code

View Ticket
Login
Ticket UUID: 1f900bdf6bb6369987598e8c5f36a1e0f62b9798
Title: improve ncgi's decode procedure
Type: Patch Version: 1.18
Submitter: anonymous Created on: 2017-09-01 23:43:46
Subsystem: ncgi Assigned To: aku
Priority: 7 High Severity: Minor
Status: Closed Last Modified: 2019-06-24 18:28:28
Resolution: Accepted Closed By: aku
    Closed on: 2019-06-24 18:28:28
Description:
In practice www-url-encoded POST params can use encodings other than UTF-8 (think of legacy Tcl applications that use one of the ISO-8859-x charsets). In this case URL parameters can contain references to 8-bit code points (in the form of %[A-F8-9][A-F0-9]) that are not valid UTF-8 code points.

For example, %DC can be used as a percent encoding for the german umlaut Ü (if a Tcl application is based on ISO-8859-1). Currently, the decode procedure does not decode %DC as all one byte UTF-8 code points must start with [0-7].

This commit improves the handling of one byte percent encoded non-ASCII characters in the form of %[A-F8-9][A-F0-9]. It allows to use ncgi in application contexts that do not use UTF-8 as the base encoding.

A pull request was created on Github. See https://github.com/tcltk/tcllib/pull/13 for more details. Thanks!
User Comments: aku added on 2019-06-24 18:28:28: (text/x-fossil-wiki)
Accepted and applied. Needed an update of test ncgi-3.10.

Integrated, see commit [2adb057376].

aku added on 2019-06-13 05:39:27:
Pulled the relevant commits from the closed PR in.


use -nocase switch in regsub commands to make regex shorter 

---- modules/ncgi/ncgi.tcl
@@ -271,11 +271,11 @@ proc ::ncgi::decode {str} {

    set str [string map [list + { } "\\" "\\\\" \[ \\\[ \] \\\]] $str]

    # prepare to process all %-escapes
-   regsub -all -- {%([Ee][A-Fa-f0-9])%([89ABab][A-Fa-f0-9])%([89ABab][A-Fa-f0-9])} \
+   regsub -all -nocase -- {%([E][A-F0-9])%([89AB][A-F0-9])%([89AB][A-F0-9])} \
	$str {[encoding convertfrom utf-8 [DecodeHex \1\2\3]]} str
-   regsub -all -- {%([CDcd][A-Fa-f0-9])%([89ABab][A-Fa-f0-9])}                     \
+   regsub -all -nocase -- {%([CD][A-F0-9])%([89AB][A-F0-9])}                     \
	$str {[encoding convertfrom utf-8 [DecodeHex \1\2]]} str
-   regsub -all -- {%([0-7][A-Fa-f0-9])} $str {\\u00\1} str
+   regsub -all -nocase -- {%([0-7][A-F0-9])} $str {\\u00\1} str

    # process \u unicode mapped chars
    return [subst -novar $str]
----
improved handling of one byte encodings

In practice www-url-encoded POST params can use encodings other than UTF-8
(think of legacy Tcl applications that use one of the ISO-8859-x charsets). In
this case URL parameters can contain references to 8-bit code points (in the
form of `%[A-F0-9][A-F0-9]`) that are not valid UTF-8 code points.

For example, `%DC` can be used as a percent encoding for the german umlaut `Ü`
(if a Tcl application is based on ISO-8859-1). Currently, the `decode`
procedure does not decode `%DC` as all one byte UTF-8 code points must start
with `[0-7]`.

This commit improves the handling of one byte percent encoded non-ASCII
characters. It allows to use ncgi in application contexts that do not use
UTF-8 as the base encoding.

---- modules/ncgi/ncgi.tcl
@@ -275,7 +275,7 @@ proc ::ncgi::decode {str} {

	$str {[encoding convertfrom utf-8 [DecodeHex \1\2\3]]} str
    regsub -all -nocase -- {%([CD][A-F0-9])%([89AB][A-F0-9])}                     \
	$str {[encoding convertfrom utf-8 [DecodeHex \1\2]]} str
-   regsub -all -nocase -- {%([0-7][A-F0-9])} $str {\\u00\1} str
+   regsub -all -nocase -- {%([A-F0-9][A-F0-9])} $str {\\u00\1} str

    # process \u unicode mapped chars
    return [subst -novar $str]
----