Index: doc/20190216callframe/callframe.md
==================================================================
--- doc/20190216callframe/callframe.md
+++ doc/20190216callframe/callframe.md
@@ -1,10 +1,9 @@
# Optimizing data motion into and out of callframes in Tcl
Kevin B Kenny
-16 February 2019
## Introduction
As of the current version ([0718166269]), the handling of variables in
the callframe reflects some fairly fundamental oversights in the effect of
@@ -22,11 +21,11 @@
author, to help with understanding what optimizations are safe and
profitable, and how to compute them. It will, however, hopefully aid
those who come after in understanding how the the decisions that were
made.
-## Back to basics - the code generation strategy
+### Back to basics - the code generation strategy
One general strategy has been overwhelmingly successful in other areas
of quadcode manipulation. Rather than attempting to add operations
that handle the awkward cases when they are detected, it has been much
easier to generate code for the worst case always, and then optimize
@@ -49,11 +48,11 @@
Nevertheless, all of these tests become redundant if a second access
to the variable is dominated by the first. In that case, partial
redundancy elimination will eliminate all five of these steps and
simply use the pure value in the arithmetic operation. The partial
redundancy elimination is based largely on the value-based algorithms
-developed in [SIMP96][]
+developed in \[SIMP96\].
We therefore assume here that we will take the brutally simple
approach of generating code that:
+ whenever a local variable (whether potentially aliased or not) is
@@ -96,13 +95,312 @@
to the same memory). His technique also does not contemplate downward
code motion, where a STORE operation that does not affect a loop's
operation might be moved to after the loop exit. To address these
deficiencies, it falls to us to develop a better theory of Tcl
variable accesses.
+
+### Trace semantics
+
+As soon as we get into discussion of non-local variables, the question
+of traces comes up. In this particular writeup, a very limited concept
+of traces is admitted:
+
+ * We assume that any traces present are 'well behaved' in the sense
+ that they do not invalidate the compilation (for instance, by
+ redefining or changing the resolution of compiled commands,
+ creating new aliases among local variables or changing their
+ values).
+
+ * We assume that skipping traces is permissible to a limited
+ extent. In particular, repeated writes to the same variable may
+ fire only a trace for the last write, while repeated reads may
+ fire only a single read trace. We do, however, respect the
+ ordering among reads or writes; a procedure that reads a variable,
+ changes its value, and puts it back will fire a read trace and a
+ write trace in the correct sequence. We also promise that any read
+ trace will fire at a time when all prior writes have been traced.
+
+These constraints are weak enough that many loop-invariant reads
+and writes can be optimized away. For instance, a procedure that
+repeatedly reads the same global value in a loop can fire a trace only
+for the first read, and one that accumulates a result through multiple
+writes of the same value can defer the write until after the
+loop. They nevertheless are strong enough that the common cases (the
+read trace on `::env`, and the write traces on variables that appear
+in Tk user interfaces) will work - and, in fact, may avoid needless
+work relative to interpreted code.
+
+### Types of optimization to be considered
+
+The particular optimizations considered in this document all consist
+of eliminating code motion to and from the callframe. There are a
+number of these that we will attempt to cover:
+
+ * Dead LOAD - An instruction moves a value from the callframe, but
+ the result is not used anywhere in the program. This case should
+ already be covered by the dead code analysis in quadcode.
+
+ * LOAD-LOAD - An instruction moves a value from the callframe, and
+ an earlier instruction has already put the Tcl value into an
+ SSA value. It can be proven that the value in the callframe has
+ not changed, so the second move can be eliminated.
+
+ * STORE-LOAD - An instruction moves a value from the callframe, and
+ an earlier instruction has moved an SSA value into the callframe.
+ It can be proven that the value in the callframe has not changed.
+ The move from the callframe can be eliminated.
+
+ * Loop-invariant LOAD - An instruction inside a loop moves a
+ value from the callframe, and it can be proven that nothing inside
+ the loop alters the value. The move from the callframe can be
+ hoisted to before the start of the loop. (Note that in this
+ case, partial redundancy elimination actually works by creating
+ a second move outside the loop, and reducing the optimization to
+ the LOAD-LOAD case.)
+
+ * Dead STORE - An instruction moves a value to the callframe, but
+ the variable in the callframe is not aliased and is not referred
+ to again before procedure exit.
+
+ * LOAD-STORE - An instruction moves a value to the callframe, but
+ the value came from an earlier move from the same place in the
+ callframe, and it can be proven that the value has not changed.
+ The move to the callframe can be eliminated.
+
+ * STORE-STORE - An instruction moves a value to the callframe, and
+ there is a _later_ instruction on each code path that moves a
+ value to the same place in the callframe, and nothing in between
+ the two instructions might read the value. It is safe to eliminate
+ the first store.
+
+ * Loop invariant STORE - An instruction withing a loop moves a value
+ to the callframe, and nothing inside the loop loads the value.
+ (This condition will commonly be introduced by STORE-LOAD
+ elimination.) In this case, the STORE may be pushed to after the
+ loop exit(s). It is expected that it will be simplest to introduce
+ speculative move instructions at each loop exit, which will then
+ cause STORE-STORE optimization to eliminate the move inside the
+ loop. (It is possible that some of the speculative instructions
+ can then be optimized away by other means.)
+
+From among these, it is useful to separate those requiring
+_anterograde_ analysis (proceeding from a procedure's entry to its
+exits) from those requiring _retrograde_ analysis (proceeding from the
+procedure exits to the entry). LOAD-LOAD, STORE-LOAD, loop-invariant
+LOAD, and LOAD-STORE are all anterograde. They depend on the
+availability of SSA counterparts to callframe values at a particular
+point in the program and on whether the use of a value in the
+callframe can be anticipated at a given point in the program.
+By contrast, dead-LOAD, dead-STORE, STORE-STORE, and loop-invariant
+STORE are retrograde. They depend on whether it is true that all paths
+forward to the program exit will contain a move to the callframe or a
+use of a callframe value, or whether some path contains a reference to
+a callframe value.
+
+The pass structure of the two types of analysis is different. While
+the 'iterate to convergence' structure would allow visiting the blocks
+of a program in any order, performance demands that anterograde
+analysis visit dominators before the blocks that they dominate, while
+retrograde analysis visits postdominators before the blocks that they
+postdominate. With that ordering, a given iteration to convergence
+will require only a number of passes proportional to the loop nesting
+depth.
+
+## TASK 1: Availability analysis
+
+The first part of anterograde analysis will begin with developing the
+set of callframe values that are known to be available as SSA values.
+This involves looking at the instructions of a block in order, and
+accumulating the mapping of name and callframe instance to SSA value.
+
+Initially, the available set on exit from each block is empty.
+
+We iterate through a basic block, applying the following
+rules. Initially, the set of available values is the set from the
+block's dominator, or Ø if the block is the entry block or unreachable.
+
++ `entry` - All values listed on the instruction acquire `Nothing` as
+ their values.
+
++ `moveFromCallFrame` - The result of the instruction becomes the
+ SSA value corresponding to the given variable name in the given
+ callframe.
+
++ `moveToCallFrame` - The name-value mapping in the result callframe
+ is set up as follows:
+
+ + The available values in the source callframe become available
+ values in the destination callframe.
+
+ + For each name specified in the instruction from left to right, the
+ available value for the name becomes the corresponding value in
+ the instruction, and all other names that potentially alias the
+ given name lose their values.
+
++ `invoke[Expanded]` - The available values in the source callframe
+ become available in the destination callframe. Then, any names that the
+ invoked command might change and any names that might alias them,
+ lose their available values.
+
++ `directSet` etc. - Any name that might be an alias of the changing
+ variable loses its available value. (If the changing variable name
+ cannot be determined, all names might be aliases, because in this
+ case we cannot determine that the name is fully qualified.)
+
++ `variable`, `upvar`, `nsupvar` - Any existing value for the given
+ name and any possible alias thereof is no longer available. Note
+ that this is actually a program error, but we need to retain correct
+ behaviour in the case where the error is caught.
+
++ φ - If several callframes meet at a φ, the set of names that have
+ available values will be the intersection of the available sets of
+ the input callframes.
+
+ For each name in that intersection:
+
+ + If all the incoming values are the same, that value becomes the value of
+ the name in the result callframe.
+
+ + If values are different, then we find the available values for each
+ name on exit from the predecessor blocks, and construct a φ
+ instruction that combines them. We search for a φ that combines
+ them.
+
+ + If such a φ is found, its result is the new value for the given
+ name.
+
+ + If no such φ is found, we insert such a φ speculatively into the
+ header of the block, and give its result as the value for the
+ given name.
+
+At the end of the block, after applying these rules, we have the
+block's new available set. The calculation of this set may in turn
+change the available values on φ instructions at the start of loops,
+so the calculation must be applied to convergence over all blocks in
+the program.
+
+Those who are versed in formal data flow analysis are welcome to
+separate the descriptions of the operations given above into separate
+sets `AVAIL_GEN` and `AVAIL_KILL` so that the problem can be given as
+the forward analysis:
+
+![\texttt{AVAIL\_OUT}[B] = \texttt{AVAIL\_GEN}[B]
+~ \cup ~ \left(
+\overline{\texttt{AVAIL\_KILL}[B]}
+~ \cap \bigcap_{P\in \texttt{pred}[B]} \texttt{AVAIL\_OUT}[P]
+\right)
+](./avail.svg)
+
+This separation will be made in the code if it turns out that it
+enables combining of passes with other callframe optimizations.
+
+
+### Availability analysis: implications
+
+Calculating the available values already exposes opportunities for
+optimization. In particular, LOAD-LOAD and STORE-LOAD redundancies can
+be eliminated by checking each `moveFromCallFrame` operation in turn
+for whether it is accessing an available value. If it is, it is
+eliminated and its result is replaced with the available value.
+
+### Availability analysis: equivalence sets
+
+LOAD-STORE pairs can also be detected, by examining 'moveToCallFrame'
+instructions and eliminating them if the variable being written is
+available, and known to be equal to the Tcl value in the callframe.
+
+Therefore, in the availability analysis, as well as accumulating the
+sets of available Tcl values and the names in the callframe that
+correspond to them, we must track the sets of names in a given
+callframe that correspond correspond to each SSA value. That way,
+when we examine a `moveToCallFrame` instruction, we can detect
+that the given value is already in the callframe under the given name.
+
+## TASK 2: Live Tcl value analysis
+
+The first part of retrograde analysis needs to compute the sets of Tcl
+variables whose values are possibly needed at the entry to each block
+in the program. This analysis involves examining the instructions of
+the block in _reverse_ order (that is, visiting postdominators before
+the blocks that they postdominate) and accumulating the sets of needed
+values.
+
+Initially, the required set at each block is empty.
+
+When beginning to examine a block, if the block is an exit, then the
+set of required variables is the set of local variables that may be
+aliased (by virtue of appearing in `[upvar]`, `[namespace
+upvar]/[global]`, or `[variable]`. This assertion mirrors the
+requirement that a compiled procedure must have all the nonlocal
+variables in sync when it leaves. The set is marked as being relative
+to the input callframe of the `return` instruction.
+
+If a block is not an exit block, then the set of required variables is
+the union of the sets of variables required at the entry to its
+successors. If a successor contains a φ operation that produces the
+callframe to which the variable references refer, then the callframe
+reference of each variable is updated to the appropriate input to the
+φ.
+
+Then, back to front, the sets get updated by examining individual
+instructions:
+
++ `moveFromCallFrame` - The given variable, and all possible aliases
+ are added to the set of variables whose values are required.
+
++ `moveToCallFrame` - The variables required will be relative to the
+ input callframe. Starting with the set of variables required in the
+ output callframe, we examine the variables given in the instruction
+ from left to right. For each of those, the given variable is removed
+ from the set of variables whose values are required. All possible
+ aliases of the given variable are added to the set.
+
++ `invoke[expanded]` - The variables required will be relative to the
+ input callframe. We start with the set required relative to the
+ output callframe. All variables that may be read or written, and
+ all possible aliases, are added to the set. (Since we do not in
+ general examine whether a procedure must alter a given variable,
+ but only whether it may alter it, we require in general that the
+ previous value of the variable must already be in sync.)
+
++ `directGet` etc. - The variables required will be relative to the
+ input callframe. We start with the set required relative to the
+ output callframe. Any variable that might alias the variable being
+ retrieved or set in the instruction is added to the set.
+
++ `variable`, `upvar`, `nsupvar` - The variables required will be
+ relative to the input callframe. We start with the set required
+ relative to the output callframe. The target variable and any
+ possible aliases are added to the set.
+
++ φ - Analysis stops at a φ instruction that references a callframe.
+ The translation to the correct input callframe will happen when
+ analyzing the predecessor block, as noted above.
+
+When we arrive at a loop's entry, we may produce a greater set of
+values that must be in sync at the bottom of the loop, so just as with
+the availability analysis, the liveness analysis must be iterated to
+convergence.
+
+### Liveness analysis: implications
+
+When liveness analysis is complete, it informs removal of certain
+`moveToCallFrame` instructions. Any `moveToCallFrame` instruction that
+designates a variable that is not live in the given callframe may be
+safely removed. Either it is a dead STORE (that is, neither it nor
+any potential alias will ever be accessed again), or it participates
+in a STORE-STORE pair.
+
+## TASK 3: Anticipability analysis and loop-invariant loads
+
+
+
+## TASK 4: Loop-invariant stores
+
+
## References
-[SIMP96]: .
\[SIMP96\].
Loren Taylor Simpson. 1996. Value-Driven Redundancy
Elimination. Ph.D. Dissertation. Rice University, Houston, TX,
USA. AAI9631092. []