Many hyperlinks are disabled.

Use anonymous login
to enable hyperlinks.

Overview

Comment: | Write up first couple of tasks |
---|---|

Timelines: | family | ancestors | descendants | both | kbk-refactor-callframe |

Files: | files | file ages | folders |

SHA3-256: |
643eefa7e6840b36661d424a5ffe47b3 |

User & Date: | kbk 2019-02-18 04:25:47 |

Context

2019-02-25
| ||

04:41 | More about code motion check-in: 2f51c8d73b user: kbk tags: kbk-refactor-callframe | |

2019-02-18
| ||

04:25 | Write up first couple of tasks check-in: 643eefa7e6 user: kbk tags: kbk-refactor-callframe | |

2019-02-16
| ||

21:52 | Fix some Markdown typos check-in: 9a45fec3f8 user: kbk tags: kbk-refactor-callframe | |

Changes

Changes to doc/20190216callframe/callframe.md.

1 1 # Optimizing data motion into and out of callframes in Tcl 2 2 3 3 Kevin B Kenny 4 4 5 -16 February 2019 6 5 7 6 ## Introduction 8 7 9 8 As of the current version ([0718166269]), the handling of variables in 10 9 the callframe reflects some fairly fundamental oversights in the effect of 11 10 Tcl operations on variables. Essentially, what it does is to introduce 12 11 `moveToCallFrame` and `moveToCallFrame` operations around a partial ................................................................................ 20 19 21 20 This discussion is written at least partly for the benefit of its 22 21 author, to help with understanding what optimizations are safe and 23 22 profitable, and how to compute them. It will, however, hopefully aid 24 23 those who come after in understanding how the the decisions that were 25 24 made. 26 25 27 -## Back to basics - the code generation strategy 26 +### Back to basics - the code generation strategy 28 27 29 28 One general strategy has been overwhelmingly successful in other areas 30 29 of quadcode manipulation. Rather than attempting to add operations 31 30 that handle the awkward cases when they are detected, it has been much 32 31 easier to generate code for the worst case always, and then optimize 33 32 it away. The code for variable accesses, for instance, is truly 34 33 horrific when initially generated: ................................................................................ 47 46 that is, extract the machine-native representation. 48 47 49 48 Nevertheless, all of these tests become redundant if a second access 50 49 to the variable is dominated by the first. In that case, partial 51 50 redundancy elimination will eliminate all five of these steps and 52 51 simply use the pure value in the arithmetic operation. The partial 53 52 redundancy elimination is based largely on the value-based algorithms 54 -developed in [SIMP96][] 53 +developed in \[SIMP96\]. 55 54 56 55 We therefore assume here that we will take the brutally simple 57 56 approach of generating code that: 58 57 59 58 + whenever a local variable (whether potentially aliased or not) is 60 59 loaded, generate a `moveFromCallFrame` instruction transferring a 61 60 Tcl value into an SSA value. ................................................................................ 94 93 amenable to his technique owing to anti-dependencies (a STORE 95 94 operation may not be hoisted above a LOAD operation that might refer 96 95 to the same memory). His technique also does not contemplate downward 97 96 code motion, where a STORE operation that does not affect a loop's 98 97 operation might be moved to after the loop exit. To address these 99 98 deficiencies, it falls to us to develop a better theory of Tcl 100 99 variable accesses. 100 + 101 +### Trace semantics 102 + 103 +As soon as we get into discussion of non-local variables, the question 104 +of traces comes up. In this particular writeup, a very limited concept 105 +of traces is admitted: 106 + 107 + * We assume that any traces present are 'well behaved' in the sense 108 + that they do not invalidate the compilation (for instance, by 109 + redefining or changing the resolution of compiled commands, 110 + creating new aliases among local variables or changing their 111 + values). 112 + 113 + * We assume that skipping traces is permissible to a limited 114 + extent. In particular, repeated writes to the same variable may 115 + fire only a trace for the last write, while repeated reads may 116 + fire only a single read trace. We do, however, respect the 117 + ordering among reads or writes; a procedure that reads a variable, 118 + changes its value, and puts it back will fire a read trace and a 119 + write trace in the correct sequence. We also promise that any read 120 + trace will fire at a time when all prior writes have been traced. 121 + 122 +These constraints are weak enough that many loop-invariant reads 123 +and writes can be optimized away. For instance, a procedure that 124 +repeatedly reads the same global value in a loop can fire a trace only 125 +for the first read, and one that accumulates a result through multiple 126 +writes of the same value can defer the write until after the 127 +loop. They nevertheless are strong enough that the common cases (the 128 +read trace on `::env`, and the write traces on variables that appear 129 +in Tk user interfaces) will work - and, in fact, may avoid needless 130 +work relative to interpreted code. 131 + 132 +### Types of optimization to be considered 133 + 134 +The particular optimizations considered in this document all consist 135 +of eliminating code motion to and from the callframe. There are a 136 +number of these that we will attempt to cover: 137 + 138 + * Dead LOAD - An instruction moves a value from the callframe, but 139 + the result is not used anywhere in the program. This case should 140 + already be covered by the dead code analysis in quadcode. 141 + 142 + * LOAD-LOAD - An instruction moves a value from the callframe, and 143 + an earlier instruction has already put the Tcl value into an 144 + SSA value. It can be proven that the value in the callframe has 145 + not changed, so the second move can be eliminated. 146 + 147 + * STORE-LOAD - An instruction moves a value from the callframe, and 148 + an earlier instruction has moved an SSA value into the callframe. 149 + It can be proven that the value in the callframe has not changed. 150 + The move from the callframe can be eliminated. 151 + 152 + * Loop-invariant LOAD - An instruction inside a loop moves a 153 + value from the callframe, and it can be proven that nothing inside 154 + the loop alters the value. The move from the callframe can be 155 + hoisted to before the start of the loop. (Note that in this 156 + case, partial redundancy elimination actually works by creating 157 + a second move outside the loop, and reducing the optimization to 158 + the LOAD-LOAD case.) 159 + 160 + * Dead STORE - An instruction moves a value to the callframe, but 161 + the variable in the callframe is not aliased and is not referred 162 + to again before procedure exit. 163 + 164 + * LOAD-STORE - An instruction moves a value to the callframe, but 165 + the value came from an earlier move from the same place in the 166 + callframe, and it can be proven that the value has not changed. 167 + The move to the callframe can be eliminated. 168 + 169 + * STORE-STORE - An instruction moves a value to the callframe, and 170 + there is a _later_ instruction on each code path that moves a 171 + value to the same place in the callframe, and nothing in between 172 + the two instructions might read the value. It is safe to eliminate 173 + the first store. 174 + 175 + * Loop invariant STORE - An instruction withing a loop moves a value 176 + to the callframe, and nothing inside the loop loads the value. 177 + (This condition will commonly be introduced by STORE-LOAD 178 + elimination.) In this case, the STORE may be pushed to after the 179 + loop exit(s). It is expected that it will be simplest to introduce 180 + speculative move instructions at each loop exit, which will then 181 + cause STORE-STORE optimization to eliminate the move inside the 182 + loop. (It is possible that some of the speculative instructions 183 + can then be optimized away by other means.) 184 + 185 +From among these, it is useful to separate those requiring 186 +_anterograde_ analysis (proceeding from a procedure's entry to its 187 +exits) from those requiring _retrograde_ analysis (proceeding from the 188 +procedure exits to the entry). LOAD-LOAD, STORE-LOAD, loop-invariant 189 +LOAD, and LOAD-STORE are all anterograde. They depend on the 190 +availability of SSA counterparts to callframe values at a particular 191 +point in the program and on whether the use of a value in the 192 +callframe can be anticipated at a given point in the program. 193 +By contrast, dead-LOAD, dead-STORE, STORE-STORE, and loop-invariant 194 +STORE are retrograde. They depend on whether it is true that all paths 195 +forward to the program exit will contain a move to the callframe or a 196 +use of a callframe value, or whether some path contains a reference to 197 +a callframe value. 198 + 199 +The pass structure of the two types of analysis is different. While 200 +the 'iterate to convergence' structure would allow visiting the blocks 201 +of a program in any order, performance demands that anterograde 202 +analysis visit dominators before the blocks that they dominate, while 203 +retrograde analysis visits postdominators before the blocks that they 204 +postdominate. With that ordering, a given iteration to convergence 205 +will require only a number of passes proportional to the loop nesting 206 +depth. 207 + 208 +## TASK 1: Availability analysis 209 + 210 +The first part of anterograde analysis will begin with developing the 211 +set of callframe values that are known to be available as SSA values. 212 +This involves looking at the instructions of a block in order, and 213 +accumulating the mapping of name and callframe instance to SSA value. 214 + 215 +Initially, the available set on exit from each block is empty. 216 + 217 +We iterate through a basic block, applying the following 218 +rules. Initially, the set of available values is the set from the 219 +block's dominator, or Ø if the block is the entry block or unreachable. 220 + 221 ++ `entry` - All values listed on the instruction acquire `Nothing` as 222 + their values. 223 + 224 ++ `moveFromCallFrame` - The result of the instruction becomes the 225 + SSA value corresponding to the given variable name in the given 226 + callframe. 227 + 228 ++ `moveToCallFrame` - The name-value mapping in the result callframe 229 + is set up as follows: 230 + 231 + + The available values in the source callframe become available 232 + values in the destination callframe. 233 + 234 + + For each name specified in the instruction from left to right, the 235 + available value for the name becomes the corresponding value in 236 + the instruction, and all other names that potentially alias the 237 + given name lose their values. 238 + 239 ++ `invoke[Expanded]` - The available values in the source callframe 240 + become available in the destination callframe. Then, any names that the 241 + invoked command might change and any names that might alias them, 242 + lose their available values. 243 + 244 ++ `directSet` etc. - Any name that might be an alias of the changing 245 + variable loses its available value. (If the changing variable name 246 + cannot be determined, all names might be aliases, because in this 247 + case we cannot determine that the name is fully qualified.) 248 + 249 ++ `variable`, `upvar`, `nsupvar` - Any existing value for the given 250 + name and any possible alias thereof is no longer available. Note 251 + that this is actually a program error, but we need to retain correct 252 + behaviour in the case where the error is caught. 253 + 254 ++ φ - If several callframes meet at a φ, the set of names that have 255 + available values will be the intersection of the available sets of 256 + the input callframes. 257 + 258 + For each name in that intersection: 259 + 260 + + If all the incoming values are the same, that value becomes the value of 261 + the name in the result callframe. 262 + 263 + + If values are different, then we find the available values for each 264 + name on exit from the predecessor blocks, and construct a φ 265 + instruction that combines them. We search for a φ that combines 266 + them. 267 + 268 + + If such a φ is found, its result is the new value for the given 269 + name. 270 + 271 + + If no such φ is found, we insert such a φ speculatively into the 272 + header of the block, and give its result as the value for the 273 + given name. 274 + 275 +At the end of the block, after applying these rules, we have the 276 +block's new available set. The calculation of this set may in turn 277 +change the available values on φ instructions at the start of loops, 278 +so the calculation must be applied to convergence over all blocks in 279 +the program. 280 + 281 +Those who are versed in formal data flow analysis are welcome to 282 +separate the descriptions of the operations given above into separate 283 +sets `AVAIL_GEN` and `AVAIL_KILL` so that the problem can be given as 284 +the forward analysis: 285 + 286 +![\texttt{AVAIL\_OUT}[B] = \texttt{AVAIL\_GEN}[B] 287 +~ \cup ~ \left( 288 +\overline{\texttt{AVAIL\_KILL}[B]} 289 +~ \cap \bigcap_{P\in \texttt{pred}[B]} \texttt{AVAIL\_OUT}[P] 290 +\right) 291 +](./avail.svg) 292 + 293 +This separation will be made in the code if it turns out that it 294 +enables combining of passes with other callframe optimizations. 295 + 296 + 297 +### Availability analysis: implications 298 + 299 +Calculating the available values already exposes opportunities for 300 +optimization. In particular, LOAD-LOAD and STORE-LOAD redundancies can 301 +be eliminated by checking each `moveFromCallFrame` operation in turn 302 +for whether it is accessing an available value. If it is, it is 303 +eliminated and its result is replaced with the available value. 304 + 305 +### Availability analysis: equivalence sets 306 + 307 +LOAD-STORE pairs can also be detected, by examining 'moveToCallFrame' 308 +instructions and eliminating them if the variable being written is 309 +available, and known to be equal to the Tcl value in the callframe. 310 + 311 +Therefore, in the availability analysis, as well as accumulating the 312 +sets of available Tcl values and the names in the callframe that 313 +correspond to them, we must track the sets of names in a given 314 +callframe that correspond correspond to each SSA value. That way, 315 +when we examine a `moveToCallFrame` instruction, we can detect 316 +that the given value is already in the callframe under the given name. 317 + 318 +## TASK 2: Live Tcl value analysis 319 + 320 +The first part of retrograde analysis needs to compute the sets of Tcl 321 +variables whose values are possibly needed at the entry to each block 322 +in the program. This analysis involves examining the instructions of 323 +the block in _reverse_ order (that is, visiting postdominators before 324 +the blocks that they postdominate) and accumulating the sets of needed 325 +values. 326 + 327 +Initially, the required set at each block is empty. 328 + 329 +When beginning to examine a block, if the block is an exit, then the 330 +set of required variables is the set of local variables that may be 331 +aliased (by virtue of appearing in `[upvar]`, `[namespace 332 +upvar]/[global]`, or `[variable]`. This assertion mirrors the 333 +requirement that a compiled procedure must have all the nonlocal 334 +variables in sync when it leaves. The set is marked as being relative 335 +to the input callframe of the `return` instruction. 336 + 337 +If a block is not an exit block, then the set of required variables is 338 +the union of the sets of variables required at the entry to its 339 +successors. If a successor contains a φ operation that produces the 340 +callframe to which the variable references refer, then the callframe 341 +reference of each variable is updated to the appropriate input to the 342 +φ. 343 + 344 +Then, back to front, the sets get updated by examining individual 345 +instructions: 346 + 347 ++ `moveFromCallFrame` - The given variable, and all possible aliases 348 + are added to the set of variables whose values are required. 349 + 350 ++ `moveToCallFrame` - The variables required will be relative to the 351 + input callframe. Starting with the set of variables required in the 352 + output callframe, we examine the variables given in the instruction 353 + from left to right. For each of those, the given variable is removed 354 + from the set of variables whose values are required. All possible 355 + aliases of the given variable are added to the set. 356 + 357 ++ `invoke[expanded]` - The variables required will be relative to the 358 + input callframe. We start with the set required relative to the 359 + output callframe. All variables that may be read or written, and 360 + all possible aliases, are added to the set. (Since we do not in 361 + general examine whether a procedure must alter a given variable, 362 + but only whether it may alter it, we require in general that the 363 + previous value of the variable must already be in sync.) 364 + 365 ++ `directGet` etc. - The variables required will be relative to the 366 + input callframe. We start with the set required relative to the 367 + output callframe. Any variable that might alias the variable being 368 + retrieved or set in the instruction is added to the set. 369 + 370 ++ `variable`, `upvar`, `nsupvar` - The variables required will be 371 + relative to the input callframe. We start with the set required 372 + relative to the output callframe. The target variable and any 373 + possible aliases are added to the set. 374 + 375 ++ φ - Analysis stops at a φ instruction that references a callframe. 376 + The translation to the correct input callframe will happen when 377 + analyzing the predecessor block, as noted above. 378 + 379 +When we arrive at a loop's entry, we may produce a greater set of 380 +values that must be in sync at the bottom of the loop, so just as with 381 +the availability analysis, the liveness analysis must be iterated to 382 +convergence. 383 + 384 +### Liveness analysis: implications 385 + 386 +When liveness analysis is complete, it informs removal of certain 387 +`moveToCallFrame` instructions. Any `moveToCallFrame` instruction that 388 +designates a variable that is not live in the given callframe may be 389 +safely removed. Either it is a dead STORE (that is, neither it nor 390 +any potential alias will ever be accessed again), or it participates 391 +in a STORE-STORE pair. 392 + 393 +## TASK 3: Anticipability analysis and loop-invariant loads 394 + 395 + 396 + 397 +## TASK 4: Loop-invariant stores 398 + 399 + 101 400 102 401 ## References 103 402 104 -[SIMP96]: <https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>. 105 403 \[SIMP96\]. 106 404 Loren Taylor Simpson. 1996. Value-Driven Redundancy 107 405 Elimination. Ph.D. Dissertation. Rice University, Houston, TX, 108 406 USA. AAI9631092. [<https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>]