Check-in [643eefa7e6]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Write up first couple of tasks
Timelines: family | ancestors | descendants | both | kbk-refactor-callframe
Files: files | file ages | folders
SHA3-256: 643eefa7e6840b36661d424a5ffe47b3b57ac4ebb6f5865585ef20ac2c1b12b0
User & Date: kbk 2019-02-18 04:25:47.425
Context
2019-02-25
04:41
More about code motion check-in: 2f51c8d73b user: kbk tags: kbk-refactor-callframe
2019-02-18
04:25
Write up first couple of tasks check-in: 643eefa7e6 user: kbk tags: kbk-refactor-callframe
2019-02-16
21:52
Fix some Markdown typos check-in: 9a45fec3f8 user: kbk tags: kbk-refactor-callframe
Changes
Unified Diff Ignore Whitespace Patch
Changes to doc/20190216callframe/callframe.md.
1
2
3
4
5
6
7
8
9
10
11
12
# Optimizing data motion into and out of callframes in Tcl

Kevin B Kenny

16 February 2019

## Introduction

As of the current version ([0718166269]), the handling of variables in
the callframe reflects some fairly fundamental oversights in the effect of
Tcl operations on variables. Essentially, what it does is to introduce
`moveToCallFrame` and `moveToCallFrame` operations around a partial




<







1
2
3
4

5
6
7
8
9
10
11
# Optimizing data motion into and out of callframes in Tcl

Kevin B Kenny



## Introduction

As of the current version ([0718166269]), the handling of variables in
the callframe reflects some fairly fundamental oversights in the effect of
Tcl operations on variables. Essentially, what it does is to introduce
`moveToCallFrame` and `moveToCallFrame` operations around a partial
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

This discussion is written at least partly for the benefit of its
author, to help with understanding what optimizations are safe and
profitable, and how to compute them. It will, however, hopefully aid
those who come after in understanding how the the decisions that were
made.

## Back to basics - the code generation strategy

One general strategy has been overwhelmingly successful in other areas
of quadcode manipulation. Rather than attempting to add operations
that handle the awkward cases when they are detected, it has been much
easier to generate code for the worst case always, and then optimize
it away. The code for variable accesses, for instance, is truly
horrific when initially generated:







|







19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

This discussion is written at least partly for the benefit of its
author, to help with understanding what optimizations are safe and
profitable, and how to compute them. It will, however, hopefully aid
those who come after in understanding how the the decisions that were
made.

### Back to basics - the code generation strategy

One general strategy has been overwhelmingly successful in other areas
of quadcode manipulation. Rather than attempting to add operations
that handle the awkward cases when they are detected, it has been much
easier to generate code for the worst case always, and then optimize
it away. The code for variable accesses, for instance, is truly
horrific when initially generated:
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
    that is, extract the machine-native representation.

Nevertheless, all of these tests become redundant if a second access
to the variable is dominated by the first. In that case, partial
redundancy elimination will eliminate all five of these steps and
simply use the pure value in the arithmetic operation. The partial
redundancy elimination is based largely on the value-based algorithms
developed in [SIMP96][]

We therefore assume here that we will take the brutally simple
approach of generating code that:

  + whenever a local variable (whether potentially aliased or not) is
    loaded, generate a `moveFromCallFrame` instruction transferring a
	Tcl value into an SSA value.







|







46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
    that is, extract the machine-native representation.

Nevertheless, all of these tests become redundant if a second access
to the variable is dominated by the first. In that case, partial
redundancy elimination will eliminate all five of these steps and
simply use the pure value in the arithmetic operation. The partial
redundancy elimination is based largely on the value-based algorithms
developed in \[SIMP96\].

We therefore assume here that we will take the brutally simple
approach of generating code that:

  + whenever a local variable (whether potentially aliased or not) is
    loaded, generate a `moveFromCallFrame` instruction transferring a
	Tcl value into an SSA value.
94
95
96
97
98
99
100
101












































































































































































































































































































102
103
104
105
106
107
108
amenable to his technique owing to anti-dependencies (a STORE
operation may not be hoisted above a LOAD operation that might refer
to the same memory). His technique also does not contemplate downward
code motion, where a STORE operation that does not affect a loop's
operation might be moved to after the loop exit. To address these
deficiencies, it falls to us to develop a better theory of Tcl
variable accesses.













































































































































































































































































































## References

[SIMP96]: <https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>.
\[SIMP96\]. 
Loren Taylor Simpson. 1996. Value-Driven Redundancy
Elimination. Ph.D. Dissertation. Rice University, Houston, TX,
USA. AAI9631092. [<https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>]








>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


<




93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402

403
404
405
406
amenable to his technique owing to anti-dependencies (a STORE
operation may not be hoisted above a LOAD operation that might refer
to the same memory). His technique also does not contemplate downward
code motion, where a STORE operation that does not affect a loop's
operation might be moved to after the loop exit. To address these
deficiencies, it falls to us to develop a better theory of Tcl
variable accesses.

### Trace semantics

As soon as we get into discussion of non-local variables, the question
of traces comes up. In this particular writeup, a very limited concept
of traces is admitted:

  * We assume that any traces present are 'well behaved' in the sense
    that they do not invalidate the compilation (for instance, by
    redefining or changing the resolution of compiled commands,
	creating new aliases among local variables or changing their
    values).
	
  * We assume that skipping traces is permissible to a limited
    extent. In particular, repeated writes to the same variable may
    fire only a trace for the last write, while repeated reads may
    fire only a single read trace. We do, however, respect the
    ordering among reads or writes; a procedure that reads a variable,
    changes its value, and puts it back will fire a read trace and a
    write trace in the correct sequence. We also promise that any read
    trace will fire at a time when all prior writes have been traced.
	
These constraints are weak enough that many loop-invariant reads
and writes can be optimized away. For instance, a procedure that
repeatedly reads the same global value in a loop can fire a trace only
for the first read, and one that accumulates a result through multiple
writes of the same value can defer the write until after the
loop. They nevertheless are strong enough that the common cases (the
read trace on `::env`, and the write traces on variables that appear
in Tk user interfaces) will work - and, in fact, may avoid needless
work relative to interpreted code.

### Types of optimization to be considered

The particular optimizations considered in this document all consist
of eliminating code motion to and from the callframe. There are a
number of these that we will attempt to cover:

  * Dead LOAD - An instruction moves a value from the callframe, but
    the result is not used anywhere in the program. This case should
    already be covered by the dead code analysis in quadcode.
	
  * LOAD-LOAD - An instruction moves a value from the callframe, and
    an earlier instruction has already put the Tcl value into an
    SSA value. It can be proven that the value in the callframe has
    not changed, so the second move can be eliminated.
	
  * STORE-LOAD - An instruction moves a value from the callframe, and
    an earlier instruction has moved an SSA value into the callframe.
	It can be proven that the value in the callframe has not changed.
	The move from the callframe can be eliminated.
	
  * Loop-invariant LOAD - An instruction inside a loop moves a
    value from the callframe, and it can be proven that nothing inside
	the loop alters the value. The move from the callframe can be
	hoisted to before the start of the loop. (Note that in this
	case, partial redundancy elimination actually works by creating
	a second move outside the loop, and reducing the optimization to
	the LOAD-LOAD case.)
	
  * Dead STORE - An instruction moves a value to the callframe, but
    the variable in the callframe is not aliased and is not referred
	to again before procedure exit.
	
  * LOAD-STORE - An instruction moves a value to the callframe, but
    the value came from an earlier move from the same place in the
    callframe, and it can be proven that the value has not changed.
	The move to the callframe can be eliminated.
	
  * STORE-STORE - An instruction moves a value to the callframe, and
    there is a _later_ instruction on each code path that moves a
    value to the same place in the callframe, and nothing in between
    the two instructions might read the value. It is safe to eliminate
    the first store.
	
  * Loop invariant STORE - An instruction withing a loop moves a value
    to the callframe, and nothing inside the loop loads the value.
    (This condition will commonly be introduced by STORE-LOAD
    elimination.) In this case, the STORE may be pushed to after the
    loop exit(s). It is expected that it will be simplest to introduce
	speculative move instructions at each loop exit, which will then
	cause STORE-STORE optimization to eliminate the move inside the
    loop. (It is possible that some of the speculative instructions
    can then be optimized away by other means.)

From among these, it is useful to separate those requiring
_anterograde_ analysis (proceeding from a procedure's entry to its
exits) from those requiring _retrograde_ analysis (proceeding from the
procedure exits to the entry). LOAD-LOAD, STORE-LOAD, loop-invariant
LOAD, and LOAD-STORE are all anterograde. They depend on the
availability of SSA counterparts to callframe values at a particular
point in the program and on whether the use of a value in the
callframe can be anticipated at a given point in the program.
By contrast, dead-LOAD, dead-STORE, STORE-STORE, and loop-invariant
STORE are retrograde. They depend on whether it is true that all paths
forward to the program exit will contain a move to the callframe or a
use of a callframe value, or whether some path contains a reference to
a callframe value.

The pass structure of the two types of analysis is different. While
the 'iterate to convergence' structure would allow visiting the blocks
of a program in any order, performance demands that anterograde
analysis visit dominators before the blocks that they dominate, while
retrograde analysis visits postdominators before the blocks that they
postdominate. With that ordering, a given iteration to convergence
will require only a number of passes proportional to the loop nesting
depth.

## TASK 1: Availability analysis

The first part of anterograde analysis will begin with developing the
set of callframe values that are known to be available as SSA values.
This involves looking at the instructions of a block in order, and
accumulating the mapping of name and callframe instance to SSA value.

Initially, the available set on exit from each block is empty.

We iterate through a basic block, applying the following
rules. Initially, the set of available values is the set from the
block's dominator, or Ø if the block is the entry block or unreachable.

+ `entry` - All values listed on the instruction acquire `Nothing` as
  their values.

+ `moveFromCallFrame` - The result of the instruction becomes the
  SSA value corresponding to the given variable name in the given
  callframe.

+ `moveToCallFrame` - The name-value mapping in the result callframe
  is set up as follows:
  
  + The available values in the source callframe become available
    values in the destination callframe.
	
  + For each name specified in the instruction from left to right, the
    available value for the name becomes the corresponding value in
    the instruction, and all other names that potentially alias the
	given name lose their values.
	
+ `invoke[Expanded]` - The available values in the source callframe
  become available in the destination callframe.  Then, any names that the
  invoked command might change and any names that might alias them,
  lose their available values.

+ `directSet` etc. - Any name that might be an alias of the changing
  variable loses its available value. (If the changing variable name
  cannot be determined, all names might be aliases, because in this
  case we cannot determine that the name is fully qualified.)
  
+ `variable`, `upvar`, `nsupvar` - Any existing value for the given
  name and any possible alias thereof is no longer available. Note
  that this is actually a program error, but we need to retain correct
  behaviour in the case where the error is caught.
  
+ φ - If several callframes meet at a φ, the set of names that have
  available values will be the intersection of the available sets of
  the input callframes. 
  
  For each name in that intersection:
  
  + If all the incoming values are the same, that value becomes the value of
    the name in the result callframe. 
  
  + If values are different, then we find the available values for each
    name on exit from the predecessor blocks, and construct a φ
    instruction that combines them. We search for a φ that combines
    them.
  
    + If such a φ is found, its result is the new value for the given
      name. 
  
    + If no such φ is found, we insert such a φ speculatively into the
      header of the block, and give its result as the value for the
      given name.
  
At the end of the block, after applying these rules, we have the
block's new available set. The calculation of this set may in turn
change the available values on φ instructions at the start of loops,
so the calculation must be applied to convergence over all blocks in
the program.

Those who are versed in formal data flow analysis are welcome to
separate the descriptions of the operations given above into separate
sets `AVAIL_GEN` and `AVAIL_KILL` so that the problem can be given as
the forward analysis:

![\texttt{AVAIL\_OUT}[B] = \texttt{AVAIL\_GEN}[B] 
~ \cup ~ \left( 
\overline{\texttt{AVAIL\_KILL}[B]} 
~ \cap \bigcap_{P\in \texttt{pred}[B]} \texttt{AVAIL\_OUT}[P] 
\right) 
](./avail.svg)

This separation will be made in the code if it turns out that it
enables combining of passes with other callframe optimizations.


### Availability analysis: implications

Calculating the available values already exposes opportunities for
optimization. In particular, LOAD-LOAD and STORE-LOAD redundancies can
be eliminated by checking each `moveFromCallFrame` operation in turn
for whether it is accessing an available value. If it is, it is
eliminated and its result is replaced with the available value.

### Availability analysis: equivalence sets

LOAD-STORE pairs can also be detected, by examining 'moveToCallFrame'
instructions and eliminating them if the variable being written is
available, and known to be equal to the Tcl value in the callframe.

Therefore, in the availability analysis, as well as accumulating the
sets of available Tcl values and the names in the callframe that
correspond to them, we must track the sets of names in a given
callframe that correspond correspond to each SSA value. That way, 
when we examine a `moveToCallFrame` instruction, we can detect
that the given value is already in the callframe under the given name.

## TASK 2: Live Tcl value analysis

The first part of retrograde analysis needs to compute the sets of Tcl
variables whose values are possibly needed at the entry to each block
in the program. This analysis involves examining the instructions of
the block in _reverse_ order (that is, visiting postdominators before
the blocks that they postdominate)  and accumulating the sets of needed
values.

Initially, the required set at each block is empty.

When beginning to examine a block, if the block is an exit, then the
set of required variables is the set of local variables that may be
aliased (by virtue of appearing in `[upvar]`, `[namespace
upvar]/[global]`, or `[variable]`. This assertion mirrors the
requirement that a compiled procedure must have all the nonlocal
variables in sync when it leaves. The set is marked as being relative
to the input callframe of the `return` instruction.

If a block is not an exit block, then the set of required variables is
the union of the sets of variables required at the entry to its
successors. If a successor contains a φ operation that produces the
callframe to which the variable references refer, then the callframe
reference of each variable is updated to the appropriate input to the 
φ.

Then, back to front, the sets get updated by examining individual
instructions:

+ `moveFromCallFrame` - The given variable, and all possible aliases
  are added to the set of variables whose values are required.
  
+ `moveToCallFrame` - The variables required will be relative to the 
  input callframe. Starting with the set of variables required in the
  output callframe, we examine the variables given in the instruction
  from left to right. For each of those, the given variable is removed 
  from the set of variables whose values are required. All possible
  aliases of the given variable are added to the set.
  
+ `invoke[expanded]` - The variables required will be relative to the
  input callframe. We start with the set required relative to the
  output callframe. All variables that may be read or written, and
  all possible aliases, are added to the set. (Since we do not in
  general examine whether a procedure must alter a given variable,
  but only whether it may alter it, we require in general that the
  previous value of the variable must already be in sync.)
  
+ `directGet` etc. - The variables required will be relative to the
  input callframe. We start with the set required relative to the
  output callframe. Any variable that might alias the variable being
  retrieved or set in the instruction is added to the set.
  
+ `variable`, `upvar`, `nsupvar` - The variables required will be
  relative to the input callframe. We start with the set required
  relative to the output callframe. The target variable and any
  possible aliases are added to the set.

+ φ - Analysis stops at a φ instruction that references a callframe.
  The translation to the correct input callframe will happen when
  analyzing the predecessor block, as noted above.
  
When we arrive at a loop's entry, we may produce a greater set of
values that must be in sync at the bottom of the loop, so just as with
the availability analysis, the liveness analysis must be iterated to
convergence.

### Liveness analysis: implications

When liveness analysis is complete, it informs removal of certain
`moveToCallFrame` instructions. Any `moveToCallFrame` instruction that
designates a variable that is not live in the given callframe may be
safely removed. Either it is a dead STORE (that is, neither it nor
any potential alias will ever be accessed again), or it participates
in a STORE-STORE pair. 

## TASK 3: Anticipability analysis and loop-invariant loads



## TASK 4: Loop-invariant stores



## References


\[SIMP96\]. 
Loren Taylor Simpson. 1996. Value-Driven Redundancy
Elimination. Ph.D. Dissertation. Rice University, Houston, TX,
USA. AAI9631092. [<https://www.clear.rice.edu/comp512/Lectures/Papers/SimpsonThesis.pdf>]