80.md at [6851915258]

Login

File tip/80.md artifact 80f062973b part of check-in 6851915258


# TIP 80: Additional Options for 'lsearch'
	Author:         Tom Wilkason <[email protected]>
	Author:         Tom Wilkason <[email protected]>
	State:          Final
	Type:           Project
	Vote:           Done
	Created:        02-Jan-2002
	Post-History:   
	Discussions-To: news:comp.lang.tcl
	Tcl-Version:    8.4
-----

# Abstract

This TIP proposes additional options for the _lsearch_ command to
return and work with all matching items in the return rather than the
first matching item. Additional options are also added.

# Rationale

The _lsearch_ function works well for finding the first item in a
list that matches a pattern.  However it is often useful to find all
of the items in the list that match a pattern.  This TIP proposes
adding options to return the entire list of matches.  With this
capability, additional options are proposed to return the data rather
than the indices \(since you often want to work the the data anyway\),
and to add an option to return the logical exclusion of the matching
items \(i.e. those that don't match the search pattern\).

# Specification

I propose the following options be added to _lsearch_:

Option: _-start index_

 > Initiates the list search starting at _index_, which can be
   any valid list index \(such as 0 , end , end-1 ...\) 

Option: _-all_

 > Returns a list of all indices that match the search condition
   \(rather than the first one\).  The indices are returned low to high
   order.  For a no match condition, a \{\} \(empty list\) is returned.
   If the the _-all_ or _-inline_ switches are not specified, a
   -1 is returned for a no match condition just as is done now.

Option: _-inline_

 > Returns a single item \(or a list of items for _-all_\) of the data
   that matches the search condition rather than the index \(or
   indices\).  An empty result or empty list \(_-all_\) is returned for
   a no match condition.  The data is returned in original list order.
   This option is useful when you want to iterate over the returned
   data anyway.  e.g.

	    foreach item [lsearch -all -inline -glob $someList *stuff] {
	       # deal with item
	    }

Option: _-not_

 > Negates the sense of the search condition \(i.e. what doesn't
   match\).  When used with the _-inline_ or _-all_ options, the
   return set will be the items that do match.  If all items match
   then a \{\} is returned.  Without the _-all_ option, the first item
   in the list that does not match will be returned.

These can be combined as needed and yield some powerful capabilities
when iterating over sub-lists \(esp. with the new _lset_ command\).

# Reference Implementation

Changes to the _Tcl\_LsearchObjCmd_ command in _generic/tclCmdIL.c_
are needed along with documentation and test code.  The changes to the
8.4 head version of _tclCmdIL.c_ are available below.

	/*
	 *----------------------------------------------------------------------
	 *
	 * Tcl_LsearchObjCmd --
	 *
	 *      This procedure is invoked to process the "lsearch" Tcl command.
	 *      See the user documentation for details on what it does.
	 *
	 * Results:
	 *      A standard Tcl result.
	 *
	 * Side effects:
	 *      See the user documentation.
	 *
	 *----------------------------------------------------------------------
	 */
	
	int
	Tcl_LsearchObjCmd(clientData, interp, objc, objv)
	    ClientData clientData;      /* Not used. */
	    Tcl_Interp *interp;         /* Current interpreter. */
	    int objc;                   /* Number of arguments. */
	    Tcl_Obj *CONST objv[];      /* Argument values. */
	{
	    char *bytes, *patternBytes;
	    int i, match, mode, index, result, listc, length, elemLen;
	    int useStart=-1, offset, allData=0, returnInline=0;
	    int dataType, isIncreasing, lower, upper, patInt, objInt, notMatch=0;
	    double patDouble, objDouble;
	    Tcl_Obj *patObj, **listv, *listPtr, *startPtr = NULL;
	    static CONST char *options[] = {
	        "-all", "-ascii", "-decreasing", "-dictionary",
	        "-exact", "-glob", "-increasing", "-inline",
	        "-integer", "-not", "-real", "-regexp",
	        "-sorted", "-start", NULL
	    };
	    enum options {
	        LSEARCH_ALL, LSEARCH_ASCII, LSEARCH_DECREASING, LSEARCH_DICTIONARY,
	        LSEARCH_EXACT, LSEARCH_GLOB, LSEARCH_INCREASING, LSEARCH_INLINE,
	        LSEARCH_INTEGER, LSEARCH_NOT, LSEARCH_REAL, LSEARCH_REGEXP,
	        LSEARCH_SORTED, LSEARCH_START
	    };
	
	    enum datatypes {
	        ASCII, DICTIONARY, INTEGER, REAL
	    };
	
	    enum modes {
	        EXACT, GLOB, REGEXP, SORTED
	    };
	
	    mode = GLOB;
	    dataType = ASCII;
	    isIncreasing = 1;
	    /* Note: This counts options as possible list|patterns */
	    if (objc < 3) {
	        Tcl_WrongNumArgs(interp, 1, objv, "?options? list pattern");
	        return TCL_ERROR;
	    }
	    for (i = 1; i < objc-2; i++) {
	        if (Tcl_GetIndexFromObj(interp, objv[i], options, "option", 0, &index)
	                != TCL_OK) {
	            return TCL_ERROR;
	        }
	        switch ((enum options) index) {
	            case LSEARCH_ASCII:         /* -ascii */
	                dataType = ASCII;
	                break;
	            case LSEARCH_NOT:           /* -not */
	                notMatch = 1;
	                break;
	            case LSEARCH_ALL:           /* -all */
	                allData = 1;
	                listPtr = Tcl_NewListObj(0, (Tcl_Obj **) NULL);
	                break;
	            case LSEARCH_INLINE:        /* -inline */
	                returnInline = 1;
	                break;
	            case LSEARCH_START:         /* -start index */
	                useStart = ++i;         /* Use next arg as offset index */
	                if (objc-i < 2) {
	                    Tcl_SetResult(interp,
	                            "missing argument to -start option", TCL_STATIC);
	                }
	                break;
	            case LSEARCH_DECREASING:    /* -decreasing */
	                isIncreasing = 0;
	                break;
	            case LSEARCH_DICTIONARY:    /* -dictionary */
	                dataType = DICTIONARY;
	                break;
	            case LSEARCH_EXACT:         /* -exact */
	                mode = EXACT;
	                break;
	            case LSEARCH_INCREASING:    /* -increasing */
	                isIncreasing = 1;
	                break;
	            case LSEARCH_INTEGER:       /* -integer */
	                dataType = INTEGER;
	                break;
	            case LSEARCH_GLOB:          /* -glob */
	                mode = GLOB;
	                break;
	            case LSEARCH_REAL:          /* -real */
	                dataType = REAL;
	                break;
	            case LSEARCH_REGEXP:        /* -regexp */
	                mode = REGEXP;
	                break;
	            case LSEARCH_SORTED:        /* -sorted */
	                mode = SORTED;
	                break;
	        }
	    }
	     
	
	    /*
	     * -start option processing:
	     * Ensure we get a unique copy of command line arg for start index
	     */
	    if (useStart > 0) {
	        startPtr = Tcl_DuplicateObj(objv[useStart]);
	    }
	
	    /*
	     * Make sure the list argument is a list object and get its length and
	     * a pointer to its array of element pointers.
	     */
	    result = Tcl_ListObjGetInline(interp, objv[objc - 2], &listc, &listv);
	    if (result != TCL_OK) {
	        return result;
	    }
	    /*
	     * Retrieve user specified start offset.
	     */
	    if (useStart > 0) {
	        result = TclGetIntForIndex(interp, startPtr, /*end*/ listc-1, &offset);
	        Tcl_DecrRefCount(startPtr); /* free unneeded obj */
	
	        if (result != TCL_OK) {
	           return result;
	        } else if (offset < 0) {
	           offset = 0;
	        }
	    } else {
	       offset = 0;
	    }
	
	    /*
	     * Process the pattern
	     */
	    patObj = objv[objc - 1];
	    patternBytes = NULL;
	    if ((enum modes) mode == EXACT || (enum modes) mode == SORTED) {
	        switch ((enum datatypes) dataType) {
	            case ASCII:
	            case DICTIONARY:
	                patternBytes = Tcl_GetStringFromObj(patObj, &length);
	                break;
	            case INTEGER:
	                result = Tcl_GetIntFromObj(interp, patObj, &patInt);
	                if (result != TCL_OK) {
	                    return result;
	                }
	                break;
	            case REAL:
	                result = Tcl_GetDoubleFromObj(interp, patObj, &patDouble);
	                if (result != TCL_OK) {
	                    return result;
	                }
	                break;
	        }
	    } else {
	        patternBytes = Tcl_GetStringFromObj(patObj, &length);
	    }
	
	    /*
	     * Set default index value to -1, indicating failure; if we find the
	     * item in the course of our search, index will be set to the correct
	     * value.
	     */
	    index = -1;
	    match = 0;
	    if ((enum modes) mode == SORTED && allData == FALSE) {
	        /*
	         * If the data is sorted, we can do a more intelligent search.
	         * Note that there is no point in being smart when -all was
	         * specified; in that case, we have to look at all items anyway.
	         */
	        lower = offset-1 /*-1*/;
	        upper = listc;
	        while (lower + 1 != upper) {
	            i = (lower + upper)/2;
	            switch ((enum datatypes) dataType) {
	                case ASCII: {
	                    bytes = Tcl_GetString(listv[i]);
	                    match = strcmp(patternBytes, bytes);
	                    break;
	                }
	                case DICTIONARY: {
	                    bytes = Tcl_GetString(listv[i]);
	                    match = DictionaryCompare(patternBytes, bytes);
	                    break;
	                }
	                case INTEGER: {
	                    result = Tcl_GetIntFromObj(interp, listv[i], &objInt);
	                    if (result != TCL_OK) {
	                        return result;
	                    }
	                    if (patInt == objInt) {
	                        match = 0;
	                    } else if (patInt < objInt) {
	                        match = -1;
	                    } else {
	                        match = 1;
	                    }
	                    break;
	                }
	                case REAL: {
	                    result = Tcl_GetDoubleFromObj(interp, listv[i],
	                            &objDouble);
	                    if (result != TCL_OK) {
	                        return result;
	                    }
	                    if (patDouble == objDouble) {
	                        match = 0;
	                    } else if (patDouble < objDouble) {
	                        match = -1;
	                    } else {
	                        match = 1;
	                    }
	                    break;
	                }
	            }
	            if (match == 0) {
	                /*
	                 * Normally, binary search is written to stop when it
	                 * finds a match.  If there are duplicates of an element in
	                 * the list, our first match might not be the first occurance.
	                 * Consider:  0 0 0 1 1 1 2 2 2
	                 * To maintain consistancy with standard lsearch semantics,
	                 * we must find the leftmost occurance of the pattern in the
	                 * list.  Thus we don't just stop searching here.  This
	                 * variation means that a search always makes log n
	                 * comparisons (normal binary search might "get lucky" with
	                 * an early comparison).
	                 */
	                index = i;
	                upper = i;
	            } else if (match > 0) {
	                if (isIncreasing) {
	                    lower = i;
	                } else {
	                    upper = i;
	                }
	            } else {
	                if (isIncreasing) {
	                    upper = i;
	                } else {
	                    lower = i;
	                }
	            }
	        }
	    } else {
	        for (i = offset; i < listc; i++) {
	            match = 0;
	            switch ((enum modes) mode) {
	                case SORTED:
	                case EXACT: {
	                    switch ((enum datatypes) dataType) {
	                        case ASCII: {
	                            bytes = Tcl_GetStringFromObj(listv[i], &elemLen);
	                            if (length == elemLen) {
	                                match = (memcmp(bytes, patternBytes,
	                                        (size_t) length) == 0);
	                            }
	                            break;
	                        }
	                        case DICTIONARY: {
	                            bytes = Tcl_GetString(listv[i]);
	                            match =
	                                (DictionaryCompare(bytes, patternBytes) == 0);
	                            break;
	                        }
	                        case INTEGER: {
	                            result = Tcl_GetIntFromObj(interp, listv[i],
	                                    &objInt);
	                            if (result != TCL_OK) {
	                                return result;
	                            }
	                            match = (objInt == patInt);
	                            break;
	                        }
	                        case REAL: {
	                            result = Tcl_GetDoubleFromObj(interp, listv[i],
	                                    &objDouble);
	                            if (result != TCL_OK) {
	                                return result;
	                            }
	                            match = (objDouble == patDouble);
	                            break;
	                        }
	                    }
	                    break;
	                }
	                case GLOB: {
	                    match = Tcl_StringMatch(Tcl_GetString(listv[i]),
	                            patternBytes);
	                    break;
	                }
	                case REGEXP: {
	                    match = Tcl_RegExpMatchObj(interp, listv[i], patObj);
	                    if (match < 0) {
	                        return TCL_ERROR;
	                    }
	                    break;
	                }
	            }
	            /* Invert match condition for -not */
	            if (notMatch) {
	                match = (match != 0 ? 0 : 1);
	            }
	
	            /* Process the possible match for this element */
	            if (match != 0) {
	                if (allData) {
	                    if (returnInline) {
	                        /* Append data */
	                        Tcl_ListObjAppendElement(interp, listPtr,listv[i]);
	                    } else {
	                        /* Append index */
	                        Tcl_ListObjAppendElement(interp, listPtr,Tcl_NewIntObj(i));
	                    }
	                } else {
	                    index = i;
	                    break;
	                }
	            }
	        }
	    }
	    /*
	     * Return either a list (-all) or a single element
	     */
	    if (allData) {
	        Tcl_SetObjResult(interp,listPtr);
	    } else {
	        if (returnInline) {
	            if (index < 0) { /* Return a null */
	                Tcl_SetObjResult(interp,Tcl_NewObj());
	            } else {         /* Return one datum */
	                Tcl_SetObjResult(interp,listv[index]);
	            }
	        } else {
	            Tcl_SetIntObj(Tcl_GetObjResult(interp), index);
	        }
	    }
	    return TCL_OK;
	}

# Notes

The changes to _lsearch_ are entirely backward compatible and do no
change the behaviour or performance of the command for existing
options.  Moreover, these changes should not impact any of the other
list changes in [[22]](22.md), [[33]](33.md) or [[45]](45.md).

# Copyright

This document has been placed in the public domain.

# Appendix

The benchmarks below denote the expected speed increase of using the
new options vs. tcl only implementations.  Your mileage may vary.

	##
	# performs a lsearch -all -inline -glob search
	#
	proc lsearch_dataGLOB {listData pattern} {
	    set result [list]
	    foreach item $listData {
	        if {[string match $pattern $item]} {
	            lappend result $item
	        }
	    }
	    return $result
	}
	##
	# performs a lsearch -all -inline -regexp search
	#
	proc lsearch_dataRE {listData pattern} {
	    set result [list]
	    foreach item $listData {
	        if {[regexp $pattern $item]} {
	            lappend result $item
	        }
	    }
	    return $result
	}
	##
	# performs a lsearch -all -glob search
	#
	proc lsearch_allGLOB {listData pattern} {
	    set result [list]
	    set count 0
	    foreach item $listData {
	        if {[string match $pattern $item]} {
	            lappend result $count
	        }
	        incr count
	    }
	    return $result
	}
	
	# Build a 2K list of data
	catch {unset LIST}
	time {lappend LIST someStuff} 1000
	time {lappend LIST otherStuff} 1000
	
	# Case with all data matching in a 2K list 2.8x speedup
	puts "#C implementation [time {lsearch -glob -all -inline $LIST *Stuff} 100]"
	#=> C implementation 3766 microseconds per iteration
	puts "#tcl implementation [time {lsearch_dataGLOB $LIST *Stuff} 100]"
	#=> tcl implementation 10815 microseconds per iteration
	
	# Case with all data matching but returning indicies 3X speed up
	puts "#C implementation [time {lsearch -glob -all $LIST *Stuff} 100]"
	#=> C implementation 4305 microseconds per iteration
	puts "#tcl implementation [time {lsearch_allGLOB $LIST *Stuff} 100]"
	#=> tcl implementation 13277 microseconds per iteration
	
	# Case with no matching data 8X speed up
	puts "#C implementation [time {lsearch -glob -all -inline $LIST none*} 100]"
	#=> C implementation 646 microseconds per iteration
	puts "#tcl implementation [time {lsearch_dataGLOB $LIST none*} 100]"
	#=> tcl implementation 5354 microseconds per iteration
	
	
	# Repeat with RE, note more time spent in RE engine 2X speedup
	puts "#C implementation [time {lsearch -regexp -all -inline $LIST Stuff} 100]"
	#=> C implementation 35260 microseconds per iteration
	puts "#tcl implementation [time {lsearch_dataRE $LIST Stuff} 100]"
	#=> tcl implementation 62292 microseconds per iteration
	
	# Case with no matching data 2X speedup
	puts "#C implementation [time {lsearch -regexp -all -inline $LIST none*} 100]"
	#=> C implementation 14815 microseconds per iteration
	puts "#tcl implementation [time {lsearch_dataRE $LIST none} 100]"
	#=> tcl implementation 30553 microseconds per iteration