Check-in [643eefa7e6]
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2019 Conference, Houston/TX, US, Nov 4-8
Send your abstracts to [email protected]
or submit via the online form by Sep 9.

Many hyperlinks are disabled.

Overview
Comment: Write up first couple of tasks family | ancestors | descendants | both | files | file ages | folders 643eefa7e6840b36661d424a5ffe47b3b57ac4ebb6f5865585ef20ac2c1b12b0 kbk 2019-02-18 04:25:47
Context
 2019-02-25 04:41 More about code motion check-in: 2f51c8d73b user: kbk tags: kbk-refactor-callframe 2019-02-18 04:25 Write up first couple of tasks check-in: 643eefa7e6 user: kbk tags: kbk-refactor-callframe 2019-02-16 21:52 Fix some Markdown typos check-in: 9a45fec3f8 user: kbk tags: kbk-refactor-callframe
Changes

Changes to doc/20190216callframe/callframe.md.

     1      1   # Optimizing data motion into and out of callframes in Tcl
2      2
3      3   Kevin B Kenny
4      4
5         -16 February 2019
6      5
7      6   ## Introduction
8      7
9      8   As of the current version ([0718166269]), the handling of variables in
10      9   the callframe reflects some fairly fundamental oversights in the effect of
11     10   Tcl operations on variables. Essentially, what it does is to introduce
12     11   moveToCallFrame and moveToCallFrame operations around a partial
................................................................................
20     19
21     20   This discussion is written at least partly for the benefit of its
22     21   author, to help with understanding what optimizations are safe and
23     22   profitable, and how to compute them. It will, however, hopefully aid
24     23   those who come after in understanding how the the decisions that were
26     25
27         -## Back to basics - the code generation strategy
26  +### Back to basics - the code generation strategy
28     27
29     28   One general strategy has been overwhelmingly successful in other areas
30     29   of quadcode manipulation. Rather than attempting to add operations
31     30   that handle the awkward cases when they are detected, it has been much
32     31   easier to generate code for the worst case always, and then optimize
33     32   it away. The code for variable accesses, for instance, is truly
34     33   horrific when initially generated:
................................................................................
47     46       that is, extract the machine-native representation.
48     47
49     48   Nevertheless, all of these tests become redundant if a second access
50     49   to the variable is dominated by the first. In that case, partial
51     50   redundancy elimination will eliminate all five of these steps and
52     51   simply use the pure value in the arithmetic operation. The partial
53     52   redundancy elimination is based largely on the value-based algorithms
54         -developed in [SIMP96][]
53  +developed in $SIMP96$.
55     54
56     55   We therefore assume here that we will take the brutally simple
57     56   approach of generating code that:
58     57
59     58     + whenever a local variable (whether potentially aliased or not) is
60     59       loaded, generate a moveFromCallFrame instruction transferring a
61     60   	Tcl value into an SSA value.
................................................................................
94     93   amenable to his technique owing to anti-dependencies (a STORE
95     94   operation may not be hoisted above a LOAD operation that might refer
96     95   to the same memory). His technique also does not contemplate downward
97     96   code motion, where a STORE operation that does not affect a loop's
98     97   operation might be moved to after the loop exit. To address these
99     98   deficiencies, it falls to us to develop a better theory of Tcl
100     99   variable accesses.
100  +
101  +### Trace semantics
102  +
103  +As soon as we get into discussion of non-local variables, the question
104  +of traces comes up. In this particular writeup, a very limited concept
105  +of traces is admitted:
106  +
107  +  * We assume that any traces present are 'well behaved' in the sense
108  +    that they do not invalidate the compilation (for instance, by
109  +    redefining or changing the resolution of compiled commands,
110  +	creating new aliases among local variables or changing their
111  +    values).
112  +
113  +  * We assume that skipping traces is permissible to a limited
114  +    extent. In particular, repeated writes to the same variable may
115  +    fire only a trace for the last write, while repeated reads may
116  +    fire only a single read trace. We do, however, respect the
117  +    ordering among reads or writes; a procedure that reads a variable,
118  +    changes its value, and puts it back will fire a read trace and a
119  +    write trace in the correct sequence. We also promise that any read
120  +    trace will fire at a time when all prior writes have been traced.
121  +
122  +These constraints are weak enough that many loop-invariant reads
123  +and writes can be optimized away. For instance, a procedure that
124  +repeatedly reads the same global value in a loop can fire a trace only
125  +for the first read, and one that accumulates a result through multiple
126  +writes of the same value can defer the write until after the
127  +loop. They nevertheless are strong enough that the common cases (the
128  +read trace on ::env, and the write traces on variables that appear
129  +in Tk user interfaces) will work - and, in fact, may avoid needless
130  +work relative to interpreted code.
131  +
132  +### Types of optimization to be considered
133  +
134  +The particular optimizations considered in this document all consist
135  +of eliminating code motion to and from the callframe. There are a
136  +number of these that we will attempt to cover:
137  +
138  +  * Dead LOAD - An instruction moves a value from the callframe, but
139  +    the result is not used anywhere in the program. This case should
140  +    already be covered by the dead code analysis in quadcode.
141  +
142  +  * LOAD-LOAD - An instruction moves a value from the callframe, and
143  +    an earlier instruction has already put the Tcl value into an
144  +    SSA value. It can be proven that the value in the callframe has
145  +    not changed, so the second move can be eliminated.
146  +
147  +  * STORE-LOAD - An instruction moves a value from the callframe, and
148  +    an earlier instruction has moved an SSA value into the callframe.
149  +	It can be proven that the value in the callframe has not changed.
150  +	The move from the callframe can be eliminated.
151  +
152  +  * Loop-invariant LOAD - An instruction inside a loop moves a
153  +    value from the callframe, and it can be proven that nothing inside
154  +	the loop alters the value. The move from the callframe can be
155  +	hoisted to before the start of the loop. (Note that in this
156  +	case, partial redundancy elimination actually works by creating
157  +	a second move outside the loop, and reducing the optimization to
159  +
160  +  * Dead STORE - An instruction moves a value to the callframe, but
161  +    the variable in the callframe is not aliased and is not referred
162  +	to again before procedure exit.
163  +
164  +  * LOAD-STORE - An instruction moves a value to the callframe, but
165  +    the value came from an earlier move from the same place in the
166  +    callframe, and it can be proven that the value has not changed.
167  +	The move to the callframe can be eliminated.
168  +
169  +  * STORE-STORE - An instruction moves a value to the callframe, and
170  +    there is a _later_ instruction on each code path that moves a
171  +    value to the same place in the callframe, and nothing in between
172  +    the two instructions might read the value. It is safe to eliminate
173  +    the first store.
174  +
175  +  * Loop invariant STORE - An instruction withing a loop moves a value
176  +    to the callframe, and nothing inside the loop loads the value.
177  +    (This condition will commonly be introduced by STORE-LOAD
178  +    elimination.) In this case, the STORE may be pushed to after the
179  +    loop exit(s). It is expected that it will be simplest to introduce
180  +	speculative move instructions at each loop exit, which will then
181  +	cause STORE-STORE optimization to eliminate the move inside the
182  +    loop. (It is possible that some of the speculative instructions
183  +    can then be optimized away by other means.)
184  +
185  +From among these, it is useful to separate those requiring
186  +_anterograde_ analysis (proceeding from a procedure's entry to its
187  +exits) from those requiring _retrograde_ analysis (proceeding from the
189  +LOAD, and LOAD-STORE are all anterograde. They depend on the
190  +availability of SSA counterparts to callframe values at a particular
191  +point in the program and on whether the use of a value in the
192  +callframe can be anticipated at a given point in the program.
194  +STORE are retrograde. They depend on whether it is true that all paths
195  +forward to the program exit will contain a move to the callframe or a
196  +use of a callframe value, or whether some path contains a reference to
197  +a callframe value.
198  +
199  +The pass structure of the two types of analysis is different. While
200  +the 'iterate to convergence' structure would allow visiting the blocks
201  +of a program in any order, performance demands that anterograde
202  +analysis visit dominators before the blocks that they dominate, while
203  +retrograde analysis visits postdominators before the blocks that they
204  +postdominate. With that ordering, a given iteration to convergence
205  +will require only a number of passes proportional to the loop nesting
206  +depth.
207  +
208  +## TASK 1: Availability analysis
209  +
210  +The first part of anterograde analysis will begin with developing the
211  +set of callframe values that are known to be available as SSA values.
212  +This involves looking at the instructions of a block in order, and
213  +accumulating the mapping of name and callframe instance to SSA value.
214  +
215  +Initially, the available set on exit from each block is empty.
216  +
217  +We iterate through a basic block, applying the following
218  +rules. Initially, the set of available values is the set from the
219  +block's dominator, or Ø if the block is the entry block or unreachable.
220  +
221  ++ entry - All values listed on the instruction acquire Nothing as
222  +  their values.
223  +
224  ++ moveFromCallFrame - The result of the instruction becomes the
225  +  SSA value corresponding to the given variable name in the given
226  +  callframe.
227  +
228  ++ moveToCallFrame - The name-value mapping in the result callframe
229  +  is set up as follows:
230  +
231  +  + The available values in the source callframe become available
232  +    values in the destination callframe.
233  +
234  +  + For each name specified in the instruction from left to right, the
235  +    available value for the name becomes the corresponding value in
236  +    the instruction, and all other names that potentially alias the
237  +	given name lose their values.
238  +
239  ++ invoke[Expanded] - The available values in the source callframe
240  +  become available in the destination callframe.  Then, any names that the
241  +  invoked command might change and any names that might alias them,
242  +  lose their available values.
243  +
244  ++ directSet etc. - Any name that might be an alias of the changing
245  +  variable loses its available value. (If the changing variable name
246  +  cannot be determined, all names might be aliases, because in this
247  +  case we cannot determine that the name is fully qualified.)
248  +
249  ++ variable, upvar, nsupvar - Any existing value for the given
250  +  name and any possible alias thereof is no longer available. Note
251  +  that this is actually a program error, but we need to retain correct
252  +  behaviour in the case where the error is caught.
253  +
254  ++ φ - If several callframes meet at a φ, the set of names that have
255  +  available values will be the intersection of the available sets of
256  +  the input callframes.
257  +
258  +  For each name in that intersection:
259  +
260  +  + If all the incoming values are the same, that value becomes the value of
261  +    the name in the result callframe.
262  +
263  +  + If values are different, then we find the available values for each
264  +    name on exit from the predecessor blocks, and construct a φ
265  +    instruction that combines them. We search for a φ that combines
266  +    them.
267  +
268  +    + If such a φ is found, its result is the new value for the given
269  +      name.
270  +
271  +    + If no such φ is found, we insert such a φ speculatively into the
272  +      header of the block, and give its result as the value for the
273  +      given name.
274  +
275  +At the end of the block, after applying these rules, we have the
276  +block's new available set. The calculation of this set may in turn
277  +change the available values on φ instructions at the start of loops,
278  +so the calculation must be applied to convergence over all blocks in
279  +the program.
280  +
281  +Those who are versed in formal data flow analysis are welcome to
282  +separate the descriptions of the operations given above into separate
283  +sets AVAIL_GEN and AVAIL_KILL so that the problem can be given as
284  +the forward analysis:
285  +
286  +![\texttt{AVAIL\_OUT}[B] = \texttt{AVAIL\_GEN}[B]
287  +~ \cup ~ \left(
288  +\overline{\texttt{AVAIL\_KILL}[B]}
289  +~ \cap \bigcap_{P\in \texttt{pred}[B]} \texttt{AVAIL\_OUT}[P]
290  +\right)
291  +](./avail.svg)
292  +
293  +This separation will be made in the code if it turns out that it
294  +enables combining of passes with other callframe optimizations.
295  +
296  +
297  +### Availability analysis: implications
298  +
299  +Calculating the available values already exposes opportunities for
301  +be eliminated by checking each moveFromCallFrame operation in turn
302  +for whether it is accessing an available value. If it is, it is
303  +eliminated and its result is replaced with the available value.
304  +
305  +### Availability analysis: equivalence sets
306  +
307  +LOAD-STORE pairs can also be detected, by examining 'moveToCallFrame'
308  +instructions and eliminating them if the variable being written is
309  +available, and known to be equal to the Tcl value in the callframe.
310  +
311  +Therefore, in the availability analysis, as well as accumulating the
312  +sets of available Tcl values and the names in the callframe that
313  +correspond to them, we must track the sets of names in a given
314  +callframe that correspond correspond to each SSA value. That way,
315  +when we examine a moveToCallFrame instruction, we can detect
316  +that the given value is already in the callframe under the given name.
317  +
318  +## TASK 2: Live Tcl value analysis
319  +
320  +The first part of retrograde analysis needs to compute the sets of Tcl
321  +variables whose values are possibly needed at the entry to each block
322  +in the program. This analysis involves examining the instructions of
323  +the block in _reverse_ order (that is, visiting postdominators before
324  +the blocks that they postdominate)  and accumulating the sets of needed
325  +values.
326  +
327  +Initially, the required set at each block is empty.
328  +
329  +When beginning to examine a block, if the block is an exit, then the
330  +set of required variables is the set of local variables that may be
331  +aliased (by virtue of appearing in [upvar], [namespace
332  +upvar]/[global], or [variable]. This assertion mirrors the
333  +requirement that a compiled procedure must have all the nonlocal
334  +variables in sync when it leaves. The set is marked as being relative
335  +to the input callframe of the return instruction.
336  +
337  +If a block is not an exit block, then the set of required variables is
338  +the union of the sets of variables required at the entry to its
339  +successors. If a successor contains a φ operation that produces the
340  +callframe to which the variable references refer, then the callframe
341  +reference of each variable is updated to the appropriate input to the
342  +φ.
343  +
344  +Then, back to front, the sets get updated by examining individual
345  +instructions:
346  +
347  ++ moveFromCallFrame - The given variable, and all possible aliases
348  +  are added to the set of variables whose values are required.
349  +
350  ++ moveToCallFrame - The variables required will be relative to the
351  +  input callframe. Starting with the set of variables required in the
352  +  output callframe, we examine the variables given in the instruction
353  +  from left to right. For each of those, the given variable is removed
354  +  from the set of variables whose values are required. All possible
355  +  aliases of the given variable are added to the set.
356  +
357  ++ invoke[expanded] - The variables required will be relative to the
358  +  input callframe. We start with the set required relative to the
359  +  output callframe. All variables that may be read or written, and
360  +  all possible aliases, are added to the set. (Since we do not in
361  +  general examine whether a procedure must alter a given variable,
362  +  but only whether it may alter it, we require in general that the
363  +  previous value of the variable must already be in sync.)
364  +
365  ++ directGet etc. - The variables required will be relative to the
366  +  input callframe. We start with the set required relative to the
367  +  output callframe. Any variable that might alias the variable being
368  +  retrieved or set in the instruction is added to the set.
369  +
370  ++ variable, upvar, nsupvar - The variables required will be
371  +  relative to the input callframe. We start with the set required
372  +  relative to the output callframe. The target variable and any
373  +  possible aliases are added to the set.
374  +
375  ++ φ - Analysis stops at a φ instruction that references a callframe.
376  +  The translation to the correct input callframe will happen when
377  +  analyzing the predecessor block, as noted above.
378  +
379  +When we arrive at a loop's entry, we may produce a greater set of
380  +values that must be in sync at the bottom of the loop, so just as with
381  +the availability analysis, the liveness analysis must be iterated to
382  +convergence.
383  +
384  +### Liveness analysis: implications
385  +
386  +When liveness analysis is complete, it informs removal of certain
387  +moveToCallFrame instructions. Any moveToCallFrame instruction that
388  +designates a variable that is not live in the given callframe may be
389  +safely removed. Either it is a dead STORE (that is, neither it nor
390  +any potential alias will ever be accessed again), or it participates
391  +in a STORE-STORE pair.
392  +
393  +## TASK 3: Anticipability analysis and loop-invariant loads
394  +
395  +
396  +
397  +## TASK 4: Loop-invariant stores
398  +
399  +
101    400
102    401   ## References
103    402
104         -[SIMP96]: <https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>.
105    403   $SIMP96$.
106    404   Loren Taylor Simpson. 1996. Value-Driven Redundancy
107    405   Elimination. Ph.D. Dissertation. Rice University, Houston, TX,
108    406   USA. AAI9631092. [<https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>]