Check-in [643eefa7e6]
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2019 Conference, Houston/TX, US, Nov 4-8
Send your abstracts to [email protected]
or submit via the online form by Sep 9.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Write up first couple of tasks
Timelines: family | ancestors | descendants | both | kbk-refactor-callframe
Files: files | file ages | folders
SHA3-256: 643eefa7e6840b36661d424a5ffe47b3b57ac4ebb6f5865585ef20ac2c1b12b0
User & Date: kbk 2019-02-18 04:25:47
Context
2019-02-25
04:41
More about code motion check-in: 2f51c8d73b user: kbk tags: kbk-refactor-callframe
2019-02-18
04:25
Write up first couple of tasks check-in: 643eefa7e6 user: kbk tags: kbk-refactor-callframe
2019-02-16
21:52
Fix some Markdown typos check-in: 9a45fec3f8 user: kbk tags: kbk-refactor-callframe
Changes

Changes to doc/20190216callframe/callframe.md.

     1      1   # Optimizing data motion into and out of callframes in Tcl
     2      2   
     3      3   Kevin B Kenny
     4      4   
     5         -16 February 2019
     6      5   
     7      6   ## Introduction
     8      7   
     9      8   As of the current version ([0718166269]), the handling of variables in
    10      9   the callframe reflects some fairly fundamental oversights in the effect of
    11     10   Tcl operations on variables. Essentially, what it does is to introduce
    12     11   `moveToCallFrame` and `moveToCallFrame` operations around a partial
................................................................................
    20     19   
    21     20   This discussion is written at least partly for the benefit of its
    22     21   author, to help with understanding what optimizations are safe and
    23     22   profitable, and how to compute them. It will, however, hopefully aid
    24     23   those who come after in understanding how the the decisions that were
    25     24   made.
    26     25   
    27         -## Back to basics - the code generation strategy
           26  +### Back to basics - the code generation strategy
    28     27   
    29     28   One general strategy has been overwhelmingly successful in other areas
    30     29   of quadcode manipulation. Rather than attempting to add operations
    31     30   that handle the awkward cases when they are detected, it has been much
    32     31   easier to generate code for the worst case always, and then optimize
    33     32   it away. The code for variable accesses, for instance, is truly
    34     33   horrific when initially generated:
................................................................................
    47     46       that is, extract the machine-native representation.
    48     47   
    49     48   Nevertheless, all of these tests become redundant if a second access
    50     49   to the variable is dominated by the first. In that case, partial
    51     50   redundancy elimination will eliminate all five of these steps and
    52     51   simply use the pure value in the arithmetic operation. The partial
    53     52   redundancy elimination is based largely on the value-based algorithms
    54         -developed in [SIMP96][]
           53  +developed in \[SIMP96\].
    55     54   
    56     55   We therefore assume here that we will take the brutally simple
    57     56   approach of generating code that:
    58     57   
    59     58     + whenever a local variable (whether potentially aliased or not) is
    60     59       loaded, generate a `moveFromCallFrame` instruction transferring a
    61     60   	Tcl value into an SSA value.
................................................................................
    94     93   amenable to his technique owing to anti-dependencies (a STORE
    95     94   operation may not be hoisted above a LOAD operation that might refer
    96     95   to the same memory). His technique also does not contemplate downward
    97     96   code motion, where a STORE operation that does not affect a loop's
    98     97   operation might be moved to after the loop exit. To address these
    99     98   deficiencies, it falls to us to develop a better theory of Tcl
   100     99   variable accesses.
          100  +
          101  +### Trace semantics
          102  +
          103  +As soon as we get into discussion of non-local variables, the question
          104  +of traces comes up. In this particular writeup, a very limited concept
          105  +of traces is admitted:
          106  +
          107  +  * We assume that any traces present are 'well behaved' in the sense
          108  +    that they do not invalidate the compilation (for instance, by
          109  +    redefining or changing the resolution of compiled commands,
          110  +	creating new aliases among local variables or changing their
          111  +    values).
          112  +	
          113  +  * We assume that skipping traces is permissible to a limited
          114  +    extent. In particular, repeated writes to the same variable may
          115  +    fire only a trace for the last write, while repeated reads may
          116  +    fire only a single read trace. We do, however, respect the
          117  +    ordering among reads or writes; a procedure that reads a variable,
          118  +    changes its value, and puts it back will fire a read trace and a
          119  +    write trace in the correct sequence. We also promise that any read
          120  +    trace will fire at a time when all prior writes have been traced.
          121  +	
          122  +These constraints are weak enough that many loop-invariant reads
          123  +and writes can be optimized away. For instance, a procedure that
          124  +repeatedly reads the same global value in a loop can fire a trace only
          125  +for the first read, and one that accumulates a result through multiple
          126  +writes of the same value can defer the write until after the
          127  +loop. They nevertheless are strong enough that the common cases (the
          128  +read trace on `::env`, and the write traces on variables that appear
          129  +in Tk user interfaces) will work - and, in fact, may avoid needless
          130  +work relative to interpreted code.
          131  +
          132  +### Types of optimization to be considered
          133  +
          134  +The particular optimizations considered in this document all consist
          135  +of eliminating code motion to and from the callframe. There are a
          136  +number of these that we will attempt to cover:
          137  +
          138  +  * Dead LOAD - An instruction moves a value from the callframe, but
          139  +    the result is not used anywhere in the program. This case should
          140  +    already be covered by the dead code analysis in quadcode.
          141  +	
          142  +  * LOAD-LOAD - An instruction moves a value from the callframe, and
          143  +    an earlier instruction has already put the Tcl value into an
          144  +    SSA value. It can be proven that the value in the callframe has
          145  +    not changed, so the second move can be eliminated.
          146  +	
          147  +  * STORE-LOAD - An instruction moves a value from the callframe, and
          148  +    an earlier instruction has moved an SSA value into the callframe.
          149  +	It can be proven that the value in the callframe has not changed.
          150  +	The move from the callframe can be eliminated.
          151  +	
          152  +  * Loop-invariant LOAD - An instruction inside a loop moves a
          153  +    value from the callframe, and it can be proven that nothing inside
          154  +	the loop alters the value. The move from the callframe can be
          155  +	hoisted to before the start of the loop. (Note that in this
          156  +	case, partial redundancy elimination actually works by creating
          157  +	a second move outside the loop, and reducing the optimization to
          158  +	the LOAD-LOAD case.)
          159  +	
          160  +  * Dead STORE - An instruction moves a value to the callframe, but
          161  +    the variable in the callframe is not aliased and is not referred
          162  +	to again before procedure exit.
          163  +	
          164  +  * LOAD-STORE - An instruction moves a value to the callframe, but
          165  +    the value came from an earlier move from the same place in the
          166  +    callframe, and it can be proven that the value has not changed.
          167  +	The move to the callframe can be eliminated.
          168  +	
          169  +  * STORE-STORE - An instruction moves a value to the callframe, and
          170  +    there is a _later_ instruction on each code path that moves a
          171  +    value to the same place in the callframe, and nothing in between
          172  +    the two instructions might read the value. It is safe to eliminate
          173  +    the first store.
          174  +	
          175  +  * Loop invariant STORE - An instruction withing a loop moves a value
          176  +    to the callframe, and nothing inside the loop loads the value.
          177  +    (This condition will commonly be introduced by STORE-LOAD
          178  +    elimination.) In this case, the STORE may be pushed to after the
          179  +    loop exit(s). It is expected that it will be simplest to introduce
          180  +	speculative move instructions at each loop exit, which will then
          181  +	cause STORE-STORE optimization to eliminate the move inside the
          182  +    loop. (It is possible that some of the speculative instructions
          183  +    can then be optimized away by other means.)
          184  +
          185  +From among these, it is useful to separate those requiring
          186  +_anterograde_ analysis (proceeding from a procedure's entry to its
          187  +exits) from those requiring _retrograde_ analysis (proceeding from the
          188  +procedure exits to the entry). LOAD-LOAD, STORE-LOAD, loop-invariant
          189  +LOAD, and LOAD-STORE are all anterograde. They depend on the
          190  +availability of SSA counterparts to callframe values at a particular
          191  +point in the program and on whether the use of a value in the
          192  +callframe can be anticipated at a given point in the program.
          193  +By contrast, dead-LOAD, dead-STORE, STORE-STORE, and loop-invariant
          194  +STORE are retrograde. They depend on whether it is true that all paths
          195  +forward to the program exit will contain a move to the callframe or a
          196  +use of a callframe value, or whether some path contains a reference to
          197  +a callframe value.
          198  +
          199  +The pass structure of the two types of analysis is different. While
          200  +the 'iterate to convergence' structure would allow visiting the blocks
          201  +of a program in any order, performance demands that anterograde
          202  +analysis visit dominators before the blocks that they dominate, while
          203  +retrograde analysis visits postdominators before the blocks that they
          204  +postdominate. With that ordering, a given iteration to convergence
          205  +will require only a number of passes proportional to the loop nesting
          206  +depth.
          207  +
          208  +## TASK 1: Availability analysis
          209  +
          210  +The first part of anterograde analysis will begin with developing the
          211  +set of callframe values that are known to be available as SSA values.
          212  +This involves looking at the instructions of a block in order, and
          213  +accumulating the mapping of name and callframe instance to SSA value.
          214  +
          215  +Initially, the available set on exit from each block is empty.
          216  +
          217  +We iterate through a basic block, applying the following
          218  +rules. Initially, the set of available values is the set from the
          219  +block's dominator, or Ø if the block is the entry block or unreachable.
          220  +
          221  ++ `entry` - All values listed on the instruction acquire `Nothing` as
          222  +  their values.
          223  +
          224  ++ `moveFromCallFrame` - The result of the instruction becomes the
          225  +  SSA value corresponding to the given variable name in the given
          226  +  callframe.
          227  +
          228  ++ `moveToCallFrame` - The name-value mapping in the result callframe
          229  +  is set up as follows:
          230  +  
          231  +  + The available values in the source callframe become available
          232  +    values in the destination callframe.
          233  +	
          234  +  + For each name specified in the instruction from left to right, the
          235  +    available value for the name becomes the corresponding value in
          236  +    the instruction, and all other names that potentially alias the
          237  +	given name lose their values.
          238  +	
          239  ++ `invoke[Expanded]` - The available values in the source callframe
          240  +  become available in the destination callframe.  Then, any names that the
          241  +  invoked command might change and any names that might alias them,
          242  +  lose their available values.
          243  +
          244  ++ `directSet` etc. - Any name that might be an alias of the changing
          245  +  variable loses its available value. (If the changing variable name
          246  +  cannot be determined, all names might be aliases, because in this
          247  +  case we cannot determine that the name is fully qualified.)
          248  +  
          249  ++ `variable`, `upvar`, `nsupvar` - Any existing value for the given
          250  +  name and any possible alias thereof is no longer available. Note
          251  +  that this is actually a program error, but we need to retain correct
          252  +  behaviour in the case where the error is caught.
          253  +  
          254  ++ φ - If several callframes meet at a φ, the set of names that have
          255  +  available values will be the intersection of the available sets of
          256  +  the input callframes. 
          257  +  
          258  +  For each name in that intersection:
          259  +  
          260  +  + If all the incoming values are the same, that value becomes the value of
          261  +    the name in the result callframe. 
          262  +  
          263  +  + If values are different, then we find the available values for each
          264  +    name on exit from the predecessor blocks, and construct a φ
          265  +    instruction that combines them. We search for a φ that combines
          266  +    them.
          267  +  
          268  +    + If such a φ is found, its result is the new value for the given
          269  +      name. 
          270  +  
          271  +    + If no such φ is found, we insert such a φ speculatively into the
          272  +      header of the block, and give its result as the value for the
          273  +      given name.
          274  +  
          275  +At the end of the block, after applying these rules, we have the
          276  +block's new available set. The calculation of this set may in turn
          277  +change the available values on φ instructions at the start of loops,
          278  +so the calculation must be applied to convergence over all blocks in
          279  +the program.
          280  +
          281  +Those who are versed in formal data flow analysis are welcome to
          282  +separate the descriptions of the operations given above into separate
          283  +sets `AVAIL_GEN` and `AVAIL_KILL` so that the problem can be given as
          284  +the forward analysis:
          285  +
          286  +![\texttt{AVAIL\_OUT}[B] = \texttt{AVAIL\_GEN}[B] 
          287  +~ \cup ~ \left( 
          288  +\overline{\texttt{AVAIL\_KILL}[B]} 
          289  +~ \cap \bigcap_{P\in \texttt{pred}[B]} \texttt{AVAIL\_OUT}[P] 
          290  +\right) 
          291  +](./avail.svg)
          292  +
          293  +This separation will be made in the code if it turns out that it
          294  +enables combining of passes with other callframe optimizations.
          295  +
          296  +
          297  +### Availability analysis: implications
          298  +
          299  +Calculating the available values already exposes opportunities for
          300  +optimization. In particular, LOAD-LOAD and STORE-LOAD redundancies can
          301  +be eliminated by checking each `moveFromCallFrame` operation in turn
          302  +for whether it is accessing an available value. If it is, it is
          303  +eliminated and its result is replaced with the available value.
          304  +
          305  +### Availability analysis: equivalence sets
          306  +
          307  +LOAD-STORE pairs can also be detected, by examining 'moveToCallFrame'
          308  +instructions and eliminating them if the variable being written is
          309  +available, and known to be equal to the Tcl value in the callframe.
          310  +
          311  +Therefore, in the availability analysis, as well as accumulating the
          312  +sets of available Tcl values and the names in the callframe that
          313  +correspond to them, we must track the sets of names in a given
          314  +callframe that correspond correspond to each SSA value. That way, 
          315  +when we examine a `moveToCallFrame` instruction, we can detect
          316  +that the given value is already in the callframe under the given name.
          317  +
          318  +## TASK 2: Live Tcl value analysis
          319  +
          320  +The first part of retrograde analysis needs to compute the sets of Tcl
          321  +variables whose values are possibly needed at the entry to each block
          322  +in the program. This analysis involves examining the instructions of
          323  +the block in _reverse_ order (that is, visiting postdominators before
          324  +the blocks that they postdominate)  and accumulating the sets of needed
          325  +values.
          326  +
          327  +Initially, the required set at each block is empty.
          328  +
          329  +When beginning to examine a block, if the block is an exit, then the
          330  +set of required variables is the set of local variables that may be
          331  +aliased (by virtue of appearing in `[upvar]`, `[namespace
          332  +upvar]/[global]`, or `[variable]`. This assertion mirrors the
          333  +requirement that a compiled procedure must have all the nonlocal
          334  +variables in sync when it leaves. The set is marked as being relative
          335  +to the input callframe of the `return` instruction.
          336  +
          337  +If a block is not an exit block, then the set of required variables is
          338  +the union of the sets of variables required at the entry to its
          339  +successors. If a successor contains a φ operation that produces the
          340  +callframe to which the variable references refer, then the callframe
          341  +reference of each variable is updated to the appropriate input to the 
          342  +φ.
          343  +
          344  +Then, back to front, the sets get updated by examining individual
          345  +instructions:
          346  +
          347  ++ `moveFromCallFrame` - The given variable, and all possible aliases
          348  +  are added to the set of variables whose values are required.
          349  +  
          350  ++ `moveToCallFrame` - The variables required will be relative to the 
          351  +  input callframe. Starting with the set of variables required in the
          352  +  output callframe, we examine the variables given in the instruction
          353  +  from left to right. For each of those, the given variable is removed 
          354  +  from the set of variables whose values are required. All possible
          355  +  aliases of the given variable are added to the set.
          356  +  
          357  ++ `invoke[expanded]` - The variables required will be relative to the
          358  +  input callframe. We start with the set required relative to the
          359  +  output callframe. All variables that may be read or written, and
          360  +  all possible aliases, are added to the set. (Since we do not in
          361  +  general examine whether a procedure must alter a given variable,
          362  +  but only whether it may alter it, we require in general that the
          363  +  previous value of the variable must already be in sync.)
          364  +  
          365  ++ `directGet` etc. - The variables required will be relative to the
          366  +  input callframe. We start with the set required relative to the
          367  +  output callframe. Any variable that might alias the variable being
          368  +  retrieved or set in the instruction is added to the set.
          369  +  
          370  ++ `variable`, `upvar`, `nsupvar` - The variables required will be
          371  +  relative to the input callframe. We start with the set required
          372  +  relative to the output callframe. The target variable and any
          373  +  possible aliases are added to the set.
          374  +
          375  ++ φ - Analysis stops at a φ instruction that references a callframe.
          376  +  The translation to the correct input callframe will happen when
          377  +  analyzing the predecessor block, as noted above.
          378  +  
          379  +When we arrive at a loop's entry, we may produce a greater set of
          380  +values that must be in sync at the bottom of the loop, so just as with
          381  +the availability analysis, the liveness analysis must be iterated to
          382  +convergence.
          383  +
          384  +### Liveness analysis: implications
          385  +
          386  +When liveness analysis is complete, it informs removal of certain
          387  +`moveToCallFrame` instructions. Any `moveToCallFrame` instruction that
          388  +designates a variable that is not live in the given callframe may be
          389  +safely removed. Either it is a dead STORE (that is, neither it nor
          390  +any potential alias will ever be accessed again), or it participates
          391  +in a STORE-STORE pair. 
          392  +
          393  +## TASK 3: Anticipability analysis and loop-invariant loads
          394  +
          395  +
          396  +
          397  +## TASK 4: Loop-invariant stores
          398  +
          399  +
   101    400   
   102    401   ## References
   103    402   
   104         -[SIMP96]: <https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>.
   105    403   \[SIMP96\]. 
   106    404   Loren Taylor Simpson. 1996. Value-Driven Redundancy
   107    405   Elimination. Ph.D. Dissertation. Rice University, Houston, TX,
   108    406   USA. AAI9631092. [<https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>]