Tcl Source Code

View Ticket
Login
Ticket UUID: a31caff05780aabdd20a8468673340435862dbf9
Title: encoding command does not allow -strict to be used with -failindex
Type: Bug Version: 8.7
Submitter: apnadkarni Created on: 2023-01-14 13:22:25
Subsystem: 16. Commands A-H Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2023-01-27 21:37:37
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2023-01-27 21:37:37
Description:
% encoding convertto -failindex i -strict utf-8 a
wrong # args: should be "encoding convertto ?-nocomplain? ?-strict? ?-failindex var? ?encoding? data"
User Comments: jan.nijtmans added on 2023-01-27 21:37:37:

Since -strict can now be combined with -failindex, the discussion lead to a useful implementation, this ticket can now be closed.

There are a lot of comments regarding -nocomplain, they don't belong in this ticket, so are irrelevant here. Also discussion which TIP preceded which other TIP doesn't make sense here.

Only one remark below, I want to point out:

> Why not choose either usint "--" or making "encoding" mandatory, and proceed that way?
That would be a useful discussion, but - please - in another ticket and - if possible - together with a patch ;-)

Now closing!


pooryorick added on 2023-01-23 12:55:35:

Thank you, Harald, for pointing that out. I didn't realize that TIP 601 preceded TIP 346, although I also wasn't sure based on prior discussions. The creation date of TIP 346 is 02-Feb-2009, and there is no creation date, no vote date, and no implementation date on TIP 601. Did the implementation of TIP 601 also precede the implementation of TIP 346? Why did TIP 346 not reference TIP 601 and make its relationship to that TIP clear? Why did TIP 346 advertise itself as "Error on Failed String Encodings" when TIP 601 already proposed to error on failed string encodings? Why does TIP 346 propose to "raise an explicit error when non-mappable characters are met", when TIP 601 already specified behaviour for invalid characters/bytes? Why does TIP 346 specify optional behaviour for "\xC0\x80" when TIP 601 already specified behaviour for that character sequence?

No need to respond to all these questions. If TIP 601 really preceded TIP 346, that is yet another argument in favor of withdrawing TIP 346. The two TIPs together are incoherent.


oehhar added on 2023-01-23 10:10:20:

Dear Nathan, thank you for the contribution. May I add something probably unrelated: The text of TIP346 is more recent that TIP 601 and TIP607, despite of the lower number. This may change the way or agumenting "what came first". That is also why TIP346 invalidated big parts (like the rationale) of TIP607 and 601 - in came later.

Thanks again, Harald


pooryorick added on 2023-01-23 09:58:27:

"... the constraint was already there, as side-effect of TIP #601."

But that was not the intent of TIP 601. It was not discussed in the TIP, and was only an inadvertent side effect. Tcl shouldn't be a hack job.

TIP 601 itself is a side effect of the mistake made in TIP 346: specifying an option ("-strict") that not only determines whether an error will be thrown, but also determines what constitutes an error. Because the control of control of two different functionalities was lumped together, TIP 601 came along to provide an option to separate them, Now we have the unfortunate situation that if one wants errors to be raised on encoding problems, but doesn't want the additional behaviour specified in TIP 346, there's no combination of options for that. Instead, what Tcl has is two options, "-strict" and "-nocomplain", that work together in a confusing way to provide functionality no one wants (or understands).

There was never a TIP proposed to specify that encoding names should not start with a "-", and therefore was never a vote on that change. Constraining the set of values in situation-specific ways is the type of booby trap that bites programmers years down the road after a language has evolved in unexpected ways. This is one example of how recent TIPs have not been properly vetted.

As Harald said, TIP 346 combined disparate functionality into one option. This is bad. TIPs that have so far only affected unreleased version of Tcl should not be sacred cows. Better sometimes to go back to the drawing board before the damage takes root.


oehhar added on 2023-01-23 07:53:32:

Dear Jan,

thank you for your message. Please accept my personal response as follows:

My reasoning is, that -strict is mixing two points, reporting mode and alternate encoding switch.

By doing that, the alternate encodings may never be used together with the other reporting methods. This is unfortune and broken by design.

The caused damage is not important, as long as -strict does not grow. Following TIP 346, it only changes the detection of bytes xC0x80 in encoding utf-8. There is no other effect. Thus -strict and -nocomplain are synonymes expect for the upper bytes in only one encoding. That is acceptable, to have a different option name for one encoding, and another for other encodings.

Unfortunately, the TIP tells, that this may change and send its destruction to other encodings. If we agree, that strict will not grow and only infect one codepoint, I am with you and the damage is limitted. Of cause, it is lots of code in encoding and fconfigure which is basically redundant, but that is the result of a partly unfortune TIP and must probably be accepted. The TCT has decided.

Thank you and take care, Harald


oehhar added on 2023-01-23 07:52:45:

Dear Jan,

thank you for your message. Please accept my personal response as follows:

My reasoning is, that -strict is mixing two points, reporting mode and alternate encoding switch.

By doing that, the alternate encodings may never be used together with the other reporting methods. This is unfortune and broken by design.

The caused damage is not important, as long as -strict does not grow. Following TIP 346, it only changes the detection of bytes xC0x80 in encoding utf-8. There is no other effect. Thus -strict and -nocomplain are synonymes expect for the upper bytes in only one encoding. That is acceptable, to have a different option name for one encoding, and another for other encodings.

Unfortunately, the TIP tells, that this may change and send its destruction to other encodings. If we agree, that strict will not grow and only infect one codepoint, I am with you and the damage is limitted. Of cause, it is lots of code in encoding and fconfigure which is basically redundant, but that is the result of a partly unfortune TIP and must probably be accepted. The TCT has decided.

Thank you and take care, Harald


jan.nijtmans added on 2023-01-22 23:09:40:

I'm not in favor to bring TIP #346 back to the drawing table: It basically is a good idea. Since this TIP mentions: "This -strict option cannot be combined with the already existing -nocomplain and -failvar options", accepting this ticket means that the sentence should be changed to "This -strict option cannot be combined with the already existing -nocomplain option".

Doing that, everything is consistent again. Left to do: what should be done when used in combination with -nocomplain? There's a reasoning behind it, I'll come back on that later.

One more remark on "It requires a new constraint on encoding names that wasn't there before.". That's not true at all, since the constraint was already there, as side-effect of TIP #601.


pooryorick added on 2023-01-20 14:02:14:

The proposed fix, "Allow "-strict" immediately before or after "-failindex var"", has some disadvantages:

It requires a new constraint on encoding names that wasn't there before.
It requires prohibiting "-nocomplain" in when "-failindex" is given. Otherwise, the argument parser breaks. This seems stem from a desire to make minimal changes to the existing argument parser, which is already not so complex.
By very-specifically ordering the allowed options, it closes the door on future evolution of [encoding] options, which I believe will be needed.

Reworking the parser to accept options in any order, as I have done in [6e15578c88cf5750], is not so hard. Why not choose either usint "--" or making "encoding" mandatory, and proceed that way?


oehhar added on 2023-01-20 13:38:22:

Jan, thank you for the implementation. I looked over it.

It would be really great, to:

  • control the reporting mode only by -nocomplain and -failindex, e.g. to decide, if an error is thrown (no option), error index returned (option -failindex), or if any error causes the use of the replacement character "?" or verbatim input byte inclusion (option -nocomplain).
  • the option -strict only changes the encoding table (e.g. which input bytes are valid and which are not)

This would require to change TIP 346 to not implement the 2nd bullet of chapter Specification.

When I see the implementation, at least from the comments, there is still a mixure that "-strict" changes the reporting mode.

generic/tclEncoding.c: 2392: * If in input mode, and -strict or -failindex is specified: This is an error.

tests/cmdAH.test 241 } -returnCodes 1 -result {wrong # args: should be "encoding convertfrom ?-nocomplain|?-strict? ?-failindex var?? ?encoding? data"}

I think, the error result should be: wrong # args: should be "encoding convertfrom ?-nocomplain|-failindex var? ?-strict? ?encoding? data"

So, -strict should be totally independent from reporting mode, which is -nocomplain or -failindex.

Thank you and take care, Harald

Personal opinion: I am also in favor to widthdraw #346 as error. We already did this with TIP 454 after positive vote, as there were compatibility concerns. TIP454 may get back into life, if Tk9.0 will be approchaed ;-).


pooryorick added on 2023-01-20 13:04:03:
TIP 346 can be paraphrased as, "raise an error when there is an encoding
problem, but also redefine what constitutes an encoding problem".  That's just
not very workable.  Would be better to withdraw it and go back to a clean
drawing board.

pooryorick added on 2023-01-20 12:28:06:

TIP 601 is basically a "fix" for TIP 346: It proposes the functionality that TIP 346 should have proposed, but a new name had to be chosen because TIP 601 already occupied "-strict".


jan.nijtmans added on 2023-01-20 12:01:26:

TIP #346 mentions: "This -strict option cannot be combined with the already existing -nocomplain and -failvar options".

Now it's becoming clear that it's necessary to allow -strict and -failvar to be used together (see title of this issue!). That doesn't make TIP #346 completely bogus, it's something that can be corrected. That's what this ticket is about.

Proposed fix here

A consequence of TIP #601 (and also of TIP #346) is that the proposed syntax:

    encoding convertfrom ?-nocomplain? ?encoding? data
    encoding convertto ?-nocomplain? ?encoding? data

makes the ?encoding? argument ambiguous: You cannot have encodings like "-n" or "-noc" any more, it would be interpreted as "-nocomplain" and ?encoding? will get the default value (the system encoding). Is that a problem, does it make TIP #601 bogus? I don't think so. No existing encoding start with "-".


oehhar added on 2023-01-20 07:58:37:

Dear Nathan,

about the ambiguity, you are right, and this is not beautiful. Possible solutions (mandatory encoding, end of options option "--", no encoding with same name as an option) may be discussed further.

Other commands (regexp, switch) also have this ambiguity and choose the "end of options option "--"). I personally think, this would be the best solution.

Thank you for your constant work, Harald


pooryorick added on 2023-01-20 00:47:13:

Another point: Jan argues that "-failindex" implies "-strict", but TIP 346 purports to raise an error when there is an encoding problem. "-failindex" does not raise an error, so it implies "not -strict"! "-failindex" implies both X and the opposite of X because TIP 346 was improperly formulated and doesn't make sense. That TIP should should be withdrawn. It's causing far too much trouble.


pooryorick added on 2023-01-20 00:38:32:

In the second case, is the first occurrence of "-strict" an option or an encoding name? If it is known, for example that an encoding name never starts with "-", the ambiguity can be resolved, but that's not the case, and therefore its unresolvable.

In [6e15578c88cf5750], on the tcl-encodingdefaultstrict branch, I've modified [encoding ...] to make the "encoding" argument mandatory so it can handle any number of options in any order. On this branch, "strict" doesn't mean anything more than "raise an error on encoding failures". I think it should mean exactly that, and nothing more.


oehhar added on 2023-01-19 12:22:28:

Well, as long as there are no encoding names equal to an option name or its abreviation, it can be sorted out (data is "-strict"):

  • encoding convertfrom -nocomplain -strict
  • encoding convertfrom -nocomplain -strict -strict
  • encoding convertfrom -nocomplain utf8 -strict
  • encoding convertfrom -failindex abc -strict
  • encoding convertfrom -failindex abc utf8 -strict
  • encoding convertfrom -failindex abc -strict -strict
  • encoding convertfrom -failindex abc -strict utf8 -strict

So, case 2 and 3 are ambiguous on the positioning, requiring to check, that the encoding name is "-strict". This is not beautiful but doable.

Am I missing something? Harald


pooryorick added on 2023-01-19 11:55:57:

Yes, "-nocomplain" was implemented as an option, but when it is provided, no other option can be provided. Allowing multiple options would make the ?encoding? argument ambiguous.


oehhar added on 2023-01-19 09:37:00:

Dear Nathan, please allow me to add my personal opioion:

Well, I have implemented the "-nocomplain var" in the encoding command. I don't see any issues to add additional options.

My main concern was the byte compiler, as those commands are "partly" byte-compiled. But I tried it using special test cases for this and it worked.

Take care, Harald


pooryorick added on 2023-01-18 22:17:54:

Normally one wouldn't make an unnecessary argument a syntax error. It would just be a no-op. The reason that Jan is pushing to have one option imply another is that there is only room for [encoding convert...] to take one option before it becomes ambiguous. I think trying to keep the option count at only one is a lost cause, and [encoding convert] should make the "encoding" option mandatory. That would free up syntactic space and allow for more than one option.


pooryorick added on 2023-01-18 13:12:24:

Jan wrote, "'-strict' is just a flag, every encoding can do with it whatever it likes (so-to-say)."

In that case, what does "-strict" mean?


jan.nijtmans added on 2023-01-18 12:40:13:

> apnadkarni added on 2023-01-18 10:50:14: > > For "encoding convertfrom", using -failindex without -strict makes no sense. > Not the case. As I wrote on the mailing list,

Indeed, sorry. I meant "encoding convertto" here. But .... even then my statement was not correct in general. "-strict" is just a flag, every encoding can do with it whatever it likes (so-to-say).


pooryorick added on 2023-01-18 12:31:16:

There is no need to provide an option to configure the encoder/decoder to handle surrogate characters in some special way. The Unicode specification is clear enough on their use, and where they may and may not occur. There has never been an example provided in any TIP or message of when "-strict" in its current incarnation ("disallow surrogate characters in utf encoding forms") might be needed, and there never will be. Surrogate characters are already disallowed in those encoding forms by the Unicode specification. The encoder/decoder can just follow the specification on this count.

"-strict" is a useless option, and "-nocomplain" was introduced to provide the functionality that should have been provided by "-strict". "-nocomplain" is the wrong name for the option that controls raising an error on an encoding error, because there are too many ways not to complain. There should be one option that specifically says "raise an error", and other options that configures the encoder not to complain in different useful ways.


jan.nijtmans added on 2023-01-18 12:29:39:
@pouryorick wrote:
> "-strict" should not be a Unicode or utf-specific setting
I was specifally refering to utf-8, immediately before the "-should" you refered to. Nothing is mixed up if you read the full sentence and consider the context of the conversation.

@Harald. Thank you for agreeing on the approach.

pooryorick added on 2023-01-18 12:08:32:

Jan wrote, "...should only check for invalid byte sequences, nothing more. Adding "-strict", then it should check for surrogates too."

"-strict" should not be a Unicode or utf-specific setting. It should be general option for handling encodings in general. Tcl is getting things all mixed up.


oehhar added on 2023-01-18 11:13:55:

Jan, thank you for your text:

I see a way out here: without "-strict" the "encoding convertto -failindex pos utf-8" should only check for invalid byte sequences, nothing more. Adding "-strict", then it should check for surrogates too. Then the examples in the TIP #607 will work again as described in the TIP, and -strict will work as expected as well. 

I totally agree, thank you!

Harald


apnadkarni added on 2023-01-18 10:50:14:
> For "encoding convertfrom", using -failindex without -strict makes no sense.

Not the case. As I wrote on the mailing list,

% encoding convertfrom -failindex i utf-32 \x41\x00\x00\x00\x7f\xff\xff\x7f
A
% set i
4

As expected. Thus cannot have -failindex imply -strict. In particular, when you want the default semantics (allow invalid bytes, surrogates etc.) but not "out of range" code points, and want the failing index, not an exception.

Whether this is useful or not is a different question but the way semantics are currently defined, I don't think -failindex can imply -strict.

/Ashok

jan.nijtmans added on 2023-01-18 10:22:14:

Harald wrote: > I have no way out here. But now defining, that (the IMHO toxic "-strict") is implied in "-failindex" makes everything worse and not better

I see a way out here: without "-strict" the "encoding convertto -failindex pos utf-8" should only check for invalid byte sequences, nothing more. Adding "-strict", then it should check for surrogates too. Then the examples in the TIP #607 will work again as described in the TIP, and -strict will work as expected as well.

It's just a matter of implementation .....

For "encoding convertfrom", using -failindex without -strict makes no sense.


oehhar added on 2023-01-18 10:20:42:

I personally would prefer to remove -strict and keep -.nocomplain. "-strict" is only for UTF-8 (switches to alternate version of the encoding), "-nocomplain" is for all encodings.

Thank you, Harald


pooryorick added on 2023-01-18 08:37:37:

This sure is confusing. On the trunk-encodingdefaultstrict branch, I removed at the C all flags and code segments related to -nocomplain, and the entire test suite passes there with only minor adjustments regarding no -nocomplain. I would suggest that unless someone can add a test to that branch showing a need for -nocomplain, that it is really nothing more than the nonstrict case.

My definition of "strict" on that branch is, "raises an error when there is an encoding error", and I believe that is what the actual function of strict has evolved to be. Prove me wrong with a failing test added to trunk-encodingdefaultstrict.


oehhar added on 2023-01-16 12:25:39:

Dear Jan, about the examples:

"TIP 346" came after TIP 601 and 607, making the examples of TIP607 invalid.

TIP601 already solved the point, that "enocding convertfrom utf-8" does add invalides bytes as verbatim.

Then, TIP 346 came and changed this again and mixed it all up. TIP346 defined a new (so far unknown) encoding "utf8 without -strict" which is in the result in big parts identical to "encoding convertfrom -nocomplain utf8".

The difference is:

  • the new encoding "utf8 without -strict" is only with utf-8.
  • "-nocomplain" covers all encodings.

So, in consequence, the following are equivalent, expect the codepoint "\xC0\x80":

  • encoding converfrom -nocomplain -strict utf8 $data
  • encoding convertfrom utf8 $data

This is not the case for other encodings. where -strict just has no effect:

  • encoding converfrom -nocomplain utf16 $data
  • encoding convertfrom utf16 $data

The first will include any invalid byte verbatim. The 2nd will throw an error on invalid bytes.

I have no way out here. But now defining, that (the IMHO toxic "-strict") is implied in "-failindex" makes everything worse and not better.

Sorry, this is my personal opinion, not important.

Take care, Harald


oehhar added on 2023-01-16 11:09:28:

Dear Jan, thanks for the comment.

For me, the "-strict" option is an encoding variant. So, it is together with the encoding itself.

The only "-strict" relevant TIP on my radar is TIP 346 The TIP says, that:

  • for utf-8 encoding, the byte sequence "\xC0\x80" is accepted without "-strict" and not accepted with it.
  • invalid byte sequence handling in UTF-8. Replace any not encodable and incomplete byte by the the byte itself

So, the TIP (without -strict) defines a UTF-8 encoding variant, where not encodable bytes are replaced by the data itself. IMHO, this is just a broken by design utf-8 variant, but ok, the TIP is as it is.

The 2nd bullet of the TIP sould be handled by the error reporting mode, as with all other multi-byte encodings.

Unfortunately, utf-8 is different. Here we have the situation, that we have two different encodings, one with an interesting replacement method of invalid bytes, one without.

I hope, this interesting special handling will not go to other multi-byte encodings like shift-jis, big5, UTF-16....

Thank you again, Harald


jan.nijtmans added on 2023-01-16 10:16:36:

I'm OK with changing the implementation, making -strict independant from -nocomplain/-failindex. It's just that -failindex doesn't make much sense without -strict, so the example in TIP #607 (below) won't work without -strict.

> Example for 'Incomplete multi-byte sequence'
>  encoding convertfrom -failindex Pos utf-8 A\xC4
> The byte 'xC4' announces a multi byte sequence. Then, nothing follows.
>
> The command will return 'A' with value 1 in variable 'Pos'.

Of course, we can post-edit the TIP, adding "-strict" to this example. Otherwise:

$ tclsh8.7
% encoding convertfrom -failindex Pos utf-8 A\xC4
AÄ
% set Pos
-1

This is different from what the TIP specifies.


oehhar added on 2023-01-15 10:20:03:

Sorry, Jan, no. "Encoding error reporting mode" (-nocomplain, -failindex) are different to "-strict". "-strict" is an encoding variant switch. There are two encodings with the same -encoding name, one with "-strict", one without "-strict". This has nothing to do with reporting mode.

Sorry, Harald


jan.nijtmans added on 2023-01-14 22:23:32:

Proposed fix [7cd1856ca4a693e5|here]

The -failindex option always was intended to imply -strict (otherwise invalid byte-sequences would never be reported), but the documentation and the error-message was not so clear about it. Therefore - also - updated the documentation.

Hope it's clearer now.