skweel - optimize loop-level parallelism and locality
skweel [ options ] infile outfile
The skweel program analyzes and transforms the code to optimize looplevel parallelism and locality.
The transformations skweel performs includes a general class of loop transformations known as "unimodular transformations". These transformations include loop interchange, loop skewing and loop reversal. Skweel can also tile loop nests for cache locality.
This pass expects two file specifications on the command line, first for the input file, then for the output file. When scc is run with the -parallel or -multi flags, skweel will be run. In both cases, skweel is called with the options -P -T -i.
-A n Print out the SUIF code before and/or after transformation. If the integer n is set to 1, the SUIF code is printed before transformation. If it is set to 2, then the code is printed after transformation. The code is printed both before and after transformation if n is set to 3.
-N Normalize the loops only; do not transform to improve code.
-P n Parallelism level. Look for parallelism and annotate the parallel loops in the code. The default is 1.
-Q n Process up to n functions before quitting. The default is 100,000.
-S Do not perform scalar privatization.
-T Do not tile the code.
-V Turn on verbose mode. This prints information about the function being processed, the dependences found, etc.
-W Print statistics about the input program including loop nest shape
and loop nest depth.
-2 When tiling a loop nest of depth n, coalesce the tile to generate 2n-1 loops rather than 2n.
-c n Set the cache size to the integer n bytes. This value is used to calculate block sizes when tiling. The default is 65536 (64K).
-f g Set the fraction of the cache to fill to the floating-point value g. This value is used to calculate block sizes when tiling. The default is 0.6.
-i Interchange loops for register locality.
-l n Set the cache line size to the integer n bytes. This value is used to calculate block sizes when tiling.
-p Print a summary of the for loops in the program. The summary includes information about which loops can and cannot be transformed.
-y n Do not parallelize loops with less than n. iterations. The default is set to 0, meaning that all parallel loops found will be marked with "doall" annotations.
-z n Do not parallelize loops with less work than the integer n. The amount of work in a loop is estimated by a function of the loop bounds and the number of instructions in the loop. The work estimation function used by skweel is from the useful library. The default is set to 0, meaning that all parallel loops found will be marked with "doall" annotations.
These two annotations are placed on TREE_FOR nodes by the scalar expansion pass of oynk and moo, respectively. The var_syms var are modified in the loop nest. This information is used to determine which loops are parallel.
This annotation is placed on TREE_FOR nodes by the scalar expansion pass of moo. The var_syms var are read within the loop nest.
This annotation is placed on TREE_FOR nodes by the scalar expansion pass of moo and oynk. Private copies of the var_syms var can be made for each iteration of the loop nest. This information is used to parallelize loops.
This annotation is placed on TREE_FOR nodes by the scalar expansion pass of oynk. Finalization is needed in order for the var_syms var to privatized. Currently, skweel will not privatize any variables that need finalization.
This annotation is placed on TREE_FOR nodes by the scalar expansion pass of oynk. The induction variable for this loop is read after exiting the loop. If a loop has a live induction variable, skweel will not consider transforming the loop.
This annotation is placed on the proc_sym for FORTRAN intrinsic functions that are known to have no side-effects. Skweel will not parallelize loops that contain calls to functions that do not have "pure function" annotations.
reduction type var
This annotation is placed on TREE_FOR nodes and instructions by reduction. A reduction of the commutative operation type is calculated over the var_sym var. Currently supported reduction types include sum, product, max and min.
C pragma doall
These annotations are generated by the front-end from pragmas in the source code. They allow users to explicitly parallelize loops that skweel wouldn't otherwise parallelize. The annotations are placed on mrk instructions. When skweel sees one of these annotations, it puts a "doall" annotation on the closest TREE_FOR following the mrk instruction. The TREE_FOR must be in the same list as the mrk instruction. Subsequent passes (e.g. pgen ) will now treat the loop as parallel.
These annotations are placed on TREE_FOR nodes to mark the boundaries of fully permutable loop nests. If a loop nest is fully permutable then any permutation of the loops within the nest is legal, and the loop nest can be tiled.
This annotation is placed on TREE_FOR nodes that skweel determines can be legally run in parallel. The "doall" annotation is read by pgen.
This annotation is placed on TREE_FOR nodes, and is used in conjunction with the "doall" annotation. The var_syms var must be made private in order for the code to run correctly in parallel. The "privatized" annotation is read by pgen.
reduced type var
This annotation is placed on TREE_FOR nodes, and is used in conjunction with the "doall" annotation. A reduction of the commutative operation type must be calculated over the var_sym var in order for the code to run correctly in parallel. The "reduced" annotation is read by pgen.
This annotation is placed on TREE_FOR nodes. The loop was not marked with a "doall" annotation because the number of iterations in the loop was deemed too small (using the parameter specified with the -y flag.
fixfortran(1), oynk(1), moo(1), pgen(1), reduction(1), scc(1)
Michael E. Wolf. "Improving Locality and Parallelism in Nested Loops", Ph.D. thesis, Stanford University, Computer Systems Laboratory, August, 1992.
M. E. Wolf and M. S. Lam. "A Loop Transformation Theory and An Algorithm to Maximize Parallelism", IEEE Transactions on Parallel and Distributed October, 1991.
M. E. Wolf and M. S. Lam. "A Data Locality Optimizing Algorithm", Proceedings of the ACM SIGPLAN'91 Conference June, 1991.
The original parallelism and locality optimizer for the old SUIF system was written by Michael Wolf. Jennifer Anderson translated it to new SUIF, and added some features.