# TIP 475: Add [string insert] Command and C API
Author: Andy Goth ([email protected])
State: Rejected
Type: Project
Vote: Done
Created: 22-Sep-2017
Post-History:
Keywords: Tcl,string,insert
Tcl-Version: 8.7
-----
# Abstract
This TIP proposes a [`string insert`] subcommand for inserting a substring at a
given index. This new [`string insert`] command is to be the string analogue of
[`linsert`], much like how [`string replace`] resembles [`lreplace`].
This TIP additionally proposes a `Tcl_ReplaceObj`() public function to provide
C-level access to the underlying functionality used to implement [`string
insert`] and [`string replace`].
# Rationale
## [`string insert`] Command
Substring insertion is a basic string operation not directly available in
current Tcl. Substring insertion can be synthesized from existing string
commands, but the numerous legal forms of indexing yield significant difficulty.
A novice user cannot be expected to know, much less implement, all possible
index formats. Thus it is reasonable to provide a standard substring insertion
command.
The current design of [`string replace`] expressly (albeit inexplicably)
prevents its use for performing string insertion. TIP 323 originally proposed
to extend [`string replace`] to allow string insertion, but this aspect of TIP
323 was [withdrawn]
(https://core.tcl-lang.org/tips/fdiff?v1=6e0ba0ee9838accc&v2=34809f1432fc528f&sbs=1)
for the sake of compatibility.
> *Clarification: TIP 475 proposes no changes to the semantics of [`string
> replace`].*
To mirror the behavior of [`linsert`], [`string insert`] at `end` should append
to the string. This is in conflict with [`string replace`], were it to be
extended to permit replacing an empty substring.
[`lreplace`] of an empty range at `end` inserts elements immediately before the
final element, whereas appending requires the start index to be `end+1`. The
hypothetical extended [`string replace`] would mirror [`lreplace`] and therefore
would not have the end-relative indexing semantics needed to implement [`string
insert`].
In conclusion, it is most straightforward to simply provide a [`string insert`]
command with the same semantics as [`linsert`].
## `Tcl_ReplaceObj`() Function
The proposed [`string insert`] cannot be implemented in terms of [`string
replace`], even if [`string replace`] were extended to permit replacing an empty
substring. However, this fact only argues against implementing [`string
insert`] as a Tcl script which calls [`string replace`]. It is indeed possible
to implement both [`string insert`] and [`string replace`] in terms of a common
C function supporting the superset of required capabilities.
Not only is it possible, but it is also preferable to take this approach.
Minimizing the number of code paths increases the potential benefit of any given
optimization, improves instruction cache utilization, and reduces both the
possibility for bugs and the number of interactions to be audited.
Furthermore, this new function should be publicly exported using the stubs
interface so extension code can call it without having to enter the interpreter.
The desirability of providing C APIs is made clear by the existence of the the
[FlightAware [`array`] enumeration C API project]
(https://core.tcl-lang.org/tcl/timeline?t=amg-array-enum-c-api).
# Current Behavior
Currently available methods for string insertion are awkward and do not handle
end-relative and [TIP 176-style](176.md) indexing without extraordinary effort.
To demonstrate the surprising degree of complexity, the following is a pure Tcl
script reference implementation intended to handle all possible index formats
and corner cases:
<a name="ref"></a>
>
# Pure Tcl implementation of [string insert] command.
proc ::tcl::string::insert {string index insertString} {
# Convert end-relative and TIP 176 indexes to simple integers.
if {[regexp -expanded {
^(end(?![\t\n\v\f\r ]) # "end" is never followed by whitespace
|[\t\n\v\f\r ]*[+-]?\d+) # m, with optional leading whitespace
(?:([+-]) # op, omitted when index is "end"
([+-]?\d+))? # n, omitted when index is "end"
[\t\n\v\f\r ]*$ # optional whitespace (unless "end")
} $index _ m op n]} {
# Convert first index to an integer.
switch $m {
end {set index [string length $string]}
default {scan $m %d index}
}
>
# Add or subtract second index, if provided.
switch $op {
+ {set index [expr {$index + $n}]}
- {set index [expr {$index - $n}]}
}
} elseif {![string is integer -strict $index]} {
# Reject invalid indexes.
return -code error "bad index \"$index\": must be\
integer?\[+-\]integer? or end?\[+-\]integer?"
}
>
# Concatenate the pre-insert, insertion, and post-insert strings.
string cat [string range $string 0 [expr {$index - 1}]] $insertString\
[string range $string $index end]
}
>
# Bind [string insert] to [::tcl::string::insert].
namespace ensemble configure string -map [dict replace\
[namespace ensemble configure string -map]\
insert ::tcl::string::insert]
More sample implementations can be found on the [Additional String Functions]
(http://wiki.tcl.tk/44#pagetoc706ab8bb) page of the Tcler's Wiki, but at time of
writing, they do not handle end-relative indexing nor can be used to append to a
string. Since they are implemented in terms of [`string replace`] and do not
perform any index arithmetic of their own, they actually do support TIP 176
indexes.
# Compatibility Considerations
The existence of a command named [`string insert`] breaks any existing code that
assumes [`string in`] is an unambiguous abbreviation for [`string index`]. Two
options exist:
1. Special-case [`string in`] to mean [`string index`].
2. Take no special action, in which case [`string in`] becomes an error.
This TIP proposes option #2 because abbreviations are not guaranteed to be
stable in the long term. This TIP targets Tcl 8.7, the first alpha version of
which was recently released. Thus there are three reasons why compatibility is
deemphasized in this situation.
# Specification
Add a new [`string insert`] command:
> **string insert** *string index insertString*
> Returns a copy of *string* with *insertString* inserted at the *index*'th
> character. *index* may be specified as described in the [**STRING
> INDICES**](https://www.tcl.tk/man/tcl/TclCmd/string.htm#M54) section.
> If *index* is start-relative, the first character inserted in the returned
> string will be at the specified index. If *index* is end-relative, the last
> character inserted in the returned string will be at the specified index.
> If *index* is at or before the start of *string* (e.g., *index* is **0**),
> *insertString* is prepended to *string*. If *index* is at or after the end of
> *string* (e.g., *index* is **end**), *insertString* is appended to *string*.
Add a new `Tcl_ReplaceObj`() stubs-exported function:
> Tcl\_Obj \*
> **Tcl_ReplaceObj**(*interp, objPtr, startIndex, removeCount, insertObjPtr*)
>
Tcl\_Interp \**interp* (in) | If non-NULL, interpreter in which error result messages are stored.
Tcl\_Obj \**objPtr* (in/out) | Points to a value to manipulate.
int *startIndex* (in) | The index of the first character to insert, replace, or remove.
int *removeCount* (in) | The number of characters to replace or remove.
Tcl\_Obj \**insertObjPtr* (in/out) | The value to insert or replace. If NULL, interpreted as empty string.
> **Tcl\_ReplaceObj** inserts, replaces, or removes a substring in *objPtr* and
> returns a pointer to the new or updated string object. If *objPtr* or
> *insertObjPtr* are unshared, one of them will be modified in place, and its
> address will be returned. If this optimization is unavailable, a new string
> object will be allocated to store the result, its address will be returned,
> and neither *objPtr* nor *insertObjPtr* will be modified. Should a memory
> allocation error occur, **Tcl\_ReplaceObj** returns NULL and (if *interp* is
> non-NULL) places error information in the result value of *interp*.
> The choice of operation performed by **Tcl\_ReplaceObj** is determined by
> *removeCount* and the length of *insertObjPtr*. To insert, *removeCount* is
> zero, and *insertObjPtr* is the insertion substring. To replace,
> *removeCount* is the number of characters to replace, and *insertObjPtr* is
> the replacement substring. To remove, *removeCount* is the number of
> characters to remove, and *insertObjPtr* is NULL or empty string.
> *startIndex* is the index of the first character in *objPtr* to insert,
> replace, or remove. Negative *startIndex* is taken to be zero, and large
> *startIndex* is automatically limited to be no greater than the length of
> *objPtr*. In a similar manner, negative *removeCount* is taken to be zero,
> and large *removeCount* is automatically limited such that the sum of
> *startIndex* and *removeCount* is no greater than the length of *objPtr*.
> Consequently, if *startIndex* is at the end of *objPtr*, *insertObjPtr* will
> be appended to *objPtr*, and *removeCount* will be ignored.
Implement both the bytecoded and non-bytecoded [`string insert`] and [`string
replace`] commands in terms of `Tcl_ReplaceObj`().
# Reference Implementation
A pure Tcl reference implementation is given [above](#ref).
The [`amg-string-insert`]
(https://core.tcl-lang.org/tcl/timeline?t=amg-string-insert) branch in the Tcl Fossil
repository provides an optimized C implementation, complete with documentation,
bytecode compilation, and a full set of test cases exercising all code paths.
This C implementation creates a new `strinsert` bytecode, also known as
`STR_INSERT` or `INST_STR_INSERT` in varying contexts. The decision was made to
keep `strinsert` separate from the existing `strreplace` opcode due to
differences in end-relative indexing.
Implementing [`string insert`] in terms of `strreplace` would require modifying
the interface to `strreplace` and would therefore break compatibility with
[tclquadcode] (https://core.tcl-lang.org/tclquadcode/) and potentially other projects
that depend on the \[[`tcl::unsupported::assemble`](http://wiki.tcl.tk/28070)]
and \[[`tcl::unsupported::disassemble`](http://wiki.tcl.tk/21445)] commands.
Despite the use of the word "separate" above, the `strinsert` and `strreplace`
opcodes are both implemented in terms of the common `Tcl_ReplaceObj`() function,
as are the non-bytecoded [`string insert`] and [`string replace`] commands.
This function was written by distilling common code found across the baseline
implementations of the various subroutines that now call it.
# Open Issues
The `Tcl_ReplaceObj`() function header comment includes [this TODO note]
(https://core.tcl-lang.org/tcl/artifact?name=6934c9ea66d17949&ln=3312-3315):
> Memory allocation failure is only checked when concatenating shared, non-pure
> byte array, non-pure Unicode character array strings. Need to commit to a
> consistent memory allocation failure handling policy.
This comment is describing the [case]
(https://core.tcl-lang.org/tcl/artifact?name=6934c9ea66d17949&ln=3575-3592) in which
`Tcl_ReplaceObj`() calls `TclStringCatObjv`() to concatenate its two arguments.
`TclStringCatObjv`() returns TCL_ERROR on memory allocation failure, and
`Tcl_ReplaceObj`() dutifully translates that to a NULL return, with detailed
error information in the interpreter result.
In every other circumstance, `Tcl_ReplaceObj`() does *not* check for memory
allocation errors. Many, if not all, of the underlying functions it uses call
`Tcl_Panic`() on allocation error, so `Tcl_ReplaceObj`() has no opportunity to
handle errors itself. Therefore, when `Tcl_ReplaceObj`()'s documentation claims
it notifies its caller of memory allocation errors, it is making a promise it
can only rarely keep.
# Copyright
This document is placed in public domain.