[comment {-*- tcl -*- doctools manpage}]
[comment {$Id: $}]
[manpage_begin list n 1.2.2]
[copyright {2003 by Kevin B. Kenny. All rights reserved}]
[moddesc {Tcl Data Structures}]
[titledesc {Procedures for manipulating lists}]
[require Tcl 8.0]
[require struct [opt 1.2.1]]
[description]

[para]

The [cmd ::struct::list] namespace contains several useful commands for
processing Tcl lists. Generally speaking, they implement algorithms more
complex or specialized than the ones provided by Tcl itself.

[section COMMANDS]
[list_begin definitions]

[call [cmd ::struct::list::longestCommonSubsequence] \
     [arg {sequence1 sequence2 ?maxOccurs?}]]
Returns the longest common subsequence of [arg sequence1] and
[arg sequence2]. If the [arg maxOccurs] parameter is provided, the
common subsequence is restricted to elements that occur no more than
[arg maxOccurs] times in [arg sequence2].
[nl]
The return value is a list of two lists of equal length. 
The first sublist is of indices into [arg sequence1], and the
second sublist is of indices into [arg sequence2].  Each corresponding
pair of indices corresponds to equal elements in the sequences;
the sequence returned is the longest possible.

[call [cmd ::struct::list::longestCommonSubsequence2] \
     [arg {sequence1 sequence2 ?maxOccurs?}]]
Returns an approximation to the longest common sequence of [arg sequence1] and
[arg sequence2]. If the [arg maxOccurs] parameter is omitted, the
subsequence computed is exactly the longest common subsequence; otherwise,
the longest common subsequence is approximated by first determining the
longest common sequence of only those elements that occur no more than
[arg maxOccurs] times in [arg sequence2], and then using that result to
align the two lists, determining the longest common subsequences of the
sublists between the two elements.
[nl]
As with [cmd ::struct::list::longestCommonSubsequence],
The return value is a list of two lists of equal length. 
The first sublist is of indices into [arg sequence1], and the
second sublist is of indices into [arg sequence2].  Each corresponding
pair of indices corresponds to equal elements in the sequences.
The sequence approximates the longest common subsequence.

[list_end]

[section {LONGEST COMMON SUBSEQUENCE AND FILE COMPARISON}]

[para]

The [cmd ::struct::list::longestCommonSubsequence] command forms the
core of a flexible system for doing differential comparisons of files,
similar to the capability offered by the Unix command [syscmd diff].

While this procedure is quite rapid for many tasks of file
comparison, its performance degrades severely if [arg sequence2]
contains many equal elements (as, for instance, when using this
procedure to compare two files, a quarter of whose lines are blank.
This drawback is intrinsic to the algorithm used (see the Reference
for details).

[para]
One approach to dealing with the performance problem that is
sometimes effective in practice is arbitrarily to exclude elements
that appear more than a certain number of times.  This number is
provided as the [arg maxOccurs] parameter.  If frequent lines are
excluded in this manner, they will not appear in the common subsequence
that is computed; the result will be the longest common subsequence
of infrequent elements. 
The procedure [cmd ::struct::list::longestCommonSubsequence2]
implements this heuristic.  It
functions as a wrapper around [cmd ::struct::list::longestCommonSubsequence]; 
it computes the longest
common subsequence of infrequent elements, and then subdivides the
subsequences that lie between the matches to approximate the true
longest common subsequence.

[section REFERENCES]
J. W. Hunt and M. D. McIlroy, "An algorithm for differential 
file comparison," Comp. Sci. Tech. Rep. #41, Bell Telephone 
Laboratories (1976). Available on the Web at the second
author's personal site: [uri http://www.cs.dartmouth.edu/~doug/]

[keywords list diff differential comparison common subsequence]
[manpage_end]