Tcl Source Code

View Ticket
Login
Ticket UUID: 1665628
Title: destructive scan of a ByteArray
Type: RFE Version: None
Submitter: dvrsn Created on: 2007-02-21 21:44:38
Subsystem: 12. ByteArray Object Assigned To: dkf
Priority: 5 Medium Severity:
Status: Open Last Modified: 2009-02-21 22:56:39
Resolution: None Closed By:
    Closed on:
Description:
There is no way to efficiently modify a ByteArray, either by appending data to the end or by removing data from the front (as might be useful for a binary stream buffer).  

Consider one task to read from a binary socket and buffer the data and another to decode counted strings from the front of the binary string:

  variable buffer
  proc read_handler {sock} {
    variable buffer
    append buffer [read $sock]
  }
  
  proc process_buffer {} {
    variable buffer
    if {[string length $buffer] > 0} {
      binary scan $buffer "I" dlen
      binary scan $buffer "x4 a$dlen" data
      binary scan $buffer "x4 x$dlen a*" buffer
    }
  }

The first problem is that "append" only operates on string reps, which requires the entire ByteArray to be scanned and converted to the utf-8 (or unicode) encoding.  (append also appears to unset the length of the string, meaning that a subsequent "string length" must recompute it).

The second problem is that scanning a few bytes from the front of the ByteArray and the modifying the original means that the entire ByteArray needs to be copied.  (maintaining a separate cursor is efficient but clumsy).

"append" should probably do the sensible thing and append ByteArrays efficiently (as well as setting the length correctly), but since this is specifically dealing with binary strings it may make sense to implement these operations entirely as subcommands of "binary".

I suggest
  binary append varName ?arg arg ...?
to append the ByteArray values of all args to the named binary value; and 
  binary dscan varName formatString ?varName varName ...?
dscan = "destructive scan", with functionality identical to "binary scan" except that it takes a varName instead of a string, and it changes the ByteArray object to have the start offset at the end of the format arguments   This would need the ByteArray object to store an offset indicating where the start of the data is.  The length computation of a ByteArray object would need to be changed to use the offset but this should be contained within tclBinary.c.  The destructive operation could either just change the offset or actually modify the underlying bytes.  If it works by changing the offset then there would need to be some point at which it would actually modify the bytes because the memory between the start of the bytes and the offset is wasted; this point could be some hardcoded maximum or a heuristic measure based on the overall size of the data.  (i.e., if the wasted space is larger than the used space, copy the live data to the beginning of the allocated space, limiting the wasted space to 100% overhead and using a non-overlapping move operation).

Another method for modifying a binary object could be
  binary set varName formatSting ?arg arg ...?
which would be something like "binary format" working on an existing bytearray.
User Comments: dkf added on 2009-02-21 22:56:39:
FRQ=Feature ReQuest

I prefer to discuss TIP designs on the tcl-core mailing list. But in this case... the majority of the TIP discusses features at the Tcl level. There's only one sentence which is likely to be a problem, and that's the one which talks about allowing Tcl_GetByteArrayFromObj to return TCL_ERROR. That implies a total change of type signature for that function.

But it is far better to discuss this on tcl-core, and in any case there is no hurry as it won't be in 8.6. It's also off-topic for *this*  FRQ anyway, which is about efficient update APIs.

[email protected] added on 2009-02-21 22:11:31:
Well, you were so quick to answer that one that I did TIP 346 in between ;-)
Now may I ask why you, as the TIP editor, tidied it up it without a remark, if it was so clearly unacceptable ?

(Also, looking back at it, I see the sentence about [encoding system] is inaccurate. Indeed T_GBAFO converts the UTF8 strep to ISO-8859-1 regardless of the surrounding encoding system. Sorry about that.)

Last question: FRQ== Fédération de Rugby du Québec ?

dkf added on 2009-02-21 16:40:44:

data_type - 360894

Right now, Tcl_GetByteArrayFromObj is not an API that can fail. (To be exact, it can't report the nature of the failure since it does not take a Tcl_Interp argument.) Going through the API to change this would be substantially disruptive, and equally so for third-party code. And Tcl_Panic is not appropriate either; that shouldn't be used for cases that are not C-level programming errors (we do it for memory management only because we don't know a good way to recover in general...) Given all this, I'm unwilling to countenance what you're proposing; the adding of new failure modes is tricky.

The original problem seems solved enough that this can be a FRQ.

ferrieux added on 2009-02-05 22:39:35:
Not in a hurry, but I'd happily TIP it for 8.7. Rotten branches must be cut frankly sometimes.
Would you veto it ?

dkf added on 2009-02-05 22:21:02:
You want (and have time) to go through and audit all the code to see what new error cases that introduces?

ferrieux added on 2009-02-05 22:07:16:
Thanks Donal.
Still, you didn't answer:

Why don't we raise an error in the conversion to byte array in that case ?
After all, conversions *to* internal reps are allowed to fail (eg
ill-formed lists).These situations occur when the system encoding doesn't allow proper
mapping of some of the input. Why hide that data loss ?

dkf added on 2009-02-05 21:41:29:
I've done that already (see 2568434). Still keeping this issue open as a master.

Currently converted things:
  Tcl_AppendObjToObj
  Tcl_GetCharLength
  Tcl_GetRange
  Tcl_GetUniChar
  TEBC:INST_CONCAT1
  TEBC:INST_STR_CMP
  TEBC:INST_STR_LEN
  TEBC:INST_STR_INDEX
  TEBC:INST_STR_MATCH
This means, for example, that [append] now does the Right Thing whether or not it is compiled.

ferrieux added on 2009-02-05 21:04:23:
Why don't we raise an error in the conversion to byte array in that case ? After all, conversions *to* internal reps are allowed to fail (eg ill-formed lists).
These situations occur when the system encoding doesn't allow proper mapping of some of the input. Why hide that data loss ?

Regarding the isCanonical (here isLossless) flag: that was my suggestion for all similar situations in 2564363...
But I admit I'd rather avoid it by simply forbidding loss of sync.

In the meantime, it seems I need to add the constraint to the CONCAT1 code, right ?

dkf added on 2009-02-05 18:00:18:
If you convert a string with characters in the \u0100-\uffff range to a byte array, the high byte of that character is dropped. If you modify the byte array then (e.g. by Tcl_SetByteArrayLength) the string rep will be thrown away. At that point, you'll have lost information. By requiring that there be no string rep in the first place, that means that there cannot be any information to lose and the operation is safe. (Otherwise, the fallback approach of working with strings is used, which produces correct results but more slowly.) Theoretically, we could be more sophisticated and record whether the string rep is generated from the bytearray or vice versa and use that as the safety constraint - we do the equivalent with lists - but the change is more intrusive and complex, and most practical bytearrays won't have string reps anyway.

ferrieux added on 2009-02-05 16:52:56:
This is very reminiscent of a similar optimization I made on INST_CONCAT1 for byte arrays last year. However, I don't see a trace of the "no string rep" constraint. So, either this constraint is needed and I must fix the CONCAT1 code, or it is not and you can remove it from your code. Can you elaborate on "the string-based formal semantics of the routine is circumventable" ?

dkf added on 2009-02-05 16:34:34:
I've enhanced Tcl_AppendObjToObj so that it does byte array concatenation safely and (reasonably) efficiently. It only does this optimization when both objects are bytearrays *and* the object being updated has no string rep (not a big problem; bytearrays don't usually have one unless people do things like printing them out for debugging purposes) - without that safety constraint, the string-based formal semantics of the routine is circumventable. This makes [append] of bytearrays to a variable containing a bytearray much faster according to my (informal) timing.

Other basic operations still need work, so leaving this issue open.

dkf added on 2007-02-24 20:51:35:
Logged In: YES 
user_id=79902
Originator: NO

I think it is reasonable to expect that [string index], [string range] and concatenation (but not [concat]) would work reasonably efficiently on bytearray objs. Moreover, making this happen can be done at this stage of the release cycle (unlike adding new subcommands to [binary], which would require TIPping and would formally constitute a new feature instead of an "optimization").

dvrsn added on 2007-02-23 09:10:58:

File Deleted - 217289:

dvrsn added on 2007-02-23 09:09:53:

File Added - 217319: tclBinary.patch

Logged In: YES 
user_id=24996
Originator: YES

Improved patch, handles shared objects correctly.

Also changes 'dscan' to modify the ByteArray only if all arguments were processed, so that for example 

 binary dscan foo "a a1000" char data

doesn't modify foo if it has less than 1001 bytes.
File Added: tclBinary.patch

dvrsn added on 2007-02-23 05:11:26:

File Added - 217289: tclBinary.patch

Logged In: YES 
user_id=24996
Originator: YES

Attached is a patch (against 8.4.14) implementing the "set" and "dscan" subcommands.  I'm not sure is the variable handling is correct (particularly, should the reference count be incremented when creating a new object?)

Rather than a separate "append" subcommand, the pattern would be

binary set buffer "@[string length $buffer] a*" $dataToAppend

Note too, that "set" will shrink the existing ByteArray if requested (meaning it will shrink if the format does not end with "@[string length $buffer]").  For an in-place/destructive modification, this is probably the right thing to do.



File Added: tclBinary.patch

dvrsn added on 2007-02-23 01:45:12:
Logged In: YES 
user_id=24996
Originator: YES

I will try the suggestion in my application, but the general problem with 

binary scan [read $sock 4] I dlen
binary scan [read $sock $dlen] a* data

is that when $sock is non-blocking you are not guaranteed to actually read $dlen bytes; while if it is blocking then it can incur an unacceptably long pause in event handling.  Beyond that, if the "data" itself contains additional counted strings then this technique is of no help since you are no longer reading from a stream.

I wonder if the problem has not been noticed more because it only shows up (as do so many others) with scale -- I did not notice it until my buffer sizes were several megabytes.

dgp added on 2007-02-22 22:07:12:
Logged In: YES 
user_id=80530
Originator: NO


While the observations appear to
be correct, and we should consider
addressing what we can...

...the way to avoid these problems
is to not use a "buffer" in the
first place, I think.  Tcl is not
C and C coding patterns often don't
map well onto it.

Instead of one huge [read] into a
"buffer" from which you struggle
to process things due to some
limitations and inefficiencies
of the base commands, why not
only [read] as many bytes as you
need at a time to do what you need 
to do?

binary scan [read $sock 4] I dlen
binary scan [read $sock $dlen] a* data

I think the availability of this
simple workaround is why we haven't noticed
these issues much before.

Attachments: