CS243: Program Analysis and Optimization Winter 2012
The JoeQ Compiler System
The joeq compiler system is a compiler analysis framework, compiler, and full virtual machine written in the Java programming language. It is open source, distributed under the LGPL - for full information, check out the joeq homepage.
A description of the Joeq system appears in the article:
J. Whaley, Joeq: A Virtual Machine and Compiler Infrastructure
In ACM SIGPLAN 2003 Workshop on Interpreters, Virtual Machines and Emulators, June 2003.
The joeq system is extremely large. We will restrict ourselves to a small, standalone subset of the complete system.
We use a slightly modified joeq that adds Java generics.
Instead of using joeq.Util.Templates.ListIterator.BasicBlock or
joeq.Util.Templates.ListIterator.Operand, you should use
java.util.ListIterator<BasicBlock> or java.util.ListIterator<Operand>. Since
only the portion of joeq that is used for our project has been modified
accordingly, there can be minor descrepancies between the interface of joeq we
provide and the JavaDoc in the sourceforge website (e.g.,
QuadIterator.nextQuad() is obsolete. You should use QuadIterator.next()).
Contents
- Installation
- The resources you need to run JoeQ.
- Java Basics
- If you've never actually used the Java VM before, this will get you started, as well as pointing you to official resources.
- Program Representation
- A quick overview of how the Java VM stores code as bytecodes, and how bytecode is represented as Quads in joeq.
- Code Organization
- How joeq stores and operates on programs, from class files down to individual instructions.
- Details of the Quad representation
- The raw material your analyses will work upon.
- Details of the Visitor classes
- The interfaces you must implement, or the classes you must extend, to do useful work.
- The Analysis Interface
- The Main.Helper class and the primary operations it supports.
- Writing and executing a compiler pass
- The basics of creating a pass and running it over arbitrary code.
- Examples
- Determining all possible thrown exceptions.
Manipulating variables in quads.
Manipulating control flow graphs.
Installation Setting up the JoeQ framework
Java Basics Quick overview
The Java programming language is a general-purpose object-oriented concurrent language. Its syntax is similar to C and C++, but it omits many of the features that make C and C++ complex, confusing, and unsafe.
Java programs run on a standardized platform called the Java virtual machine. The Java virtual machine is an abstract computing machine. Like a real computing machine, it has an instruction set and manipulates various memory areas at run time.
In this class, we will be using the Java 2 Platform, Standard Edition v1.5.0 (commonly called Java 5). This is installed on the "elaine" machines in Sweet Hall, and also is available for download at http://java.sun.com/j2se/downloads.html.
This page assumes that you've got some basic familiarity with the Java language, but haven't had to use it on the elaines. If you've never dealt with Java before, read The Java Tutorial and familiarize yourself with the language. (The first part, Your first cup of Java, covers the basics of installing and running Java on your platform of choice.)
Using the Java Development Kit
There are three main programs that you will use in the JDK:
javac- convert Java source to Java bytecodejava- run Java bytecodejavap- disassemble Java bytecode
All of these run from the command line. We will delay discussion of javap until our discussion of the Java bytecode system, and concentrate more on the more frequently used javac and java.
Create the following file on an elaine machine, and name it Hello.java:
public class Hello {
public static void main(String[] args) {
System.out.println("Hello, world!");
}
}
Now execute the following commands from that directory. You should get the same results as listed here.
elaine6:~/examples> javac Hello.java elaine6:~/examples> java Hello Hello, world! elaine6:~/examples>
The javac command reads Hello.java and produces Hello.class, which is then executed by the java command. Note that you do not specify the .class extension when running code.
If your shell doesn't find these programs, you'll have to extend your
PATH environment variable. The command
setenv PATH ${PATH}:/usr/pubsw/bin
should fix the problem.
The CLASSPATH
One of Java's more unusual features is that there is no static linking
step. The name of any class provides sufficient information to allow the java
command to locate the relevant class file. Package names correspond to
directories, and the name of the .class file must correspond to the name of
the class given. The VM, however, must be informed where to begin searching.
This is the purpose of the CLASSPATH variable. This is a list of directories,
much like the PATH variable under Unix or Win32. Each directory listed is a
possible "root" for the class search. Archives of Java code (which usually
have the extension .jar) may also be listed as if they were directories.
With the 1.5 JDK, CLASSPATH must include the current directory, named by a
single period. (If CLASSPATH is completely undefined, that's OK too.)
Program Representation An Intermediate Representation of code as bytecodes and quads
A Java source program is first converted to a class file containing bytecode, which is then translated into an IR representation in the compiler called Quads. This page familiarizes you with the program representations by way of examples.
Java Bytecode
The Java bytecode is a rather high-level representation of a Java program. While some information, like local variable names, is dropped, high-level information such as class layouts and object hierarchies is retained. Java bytecode is stack-oriented: operands are pushed on the operand stack and arithmetic operations are applied to the top variables on the stack. The stack architecture was chosen because their programs are compact.
One can examine the bytecode of a class by invoking the bytecode
disassembler using the command javap -c <classname>. You do
not need to know the details of Java bytecode for this class. We include a
brief discussion here so that you can understand the process by which Java
source code is translated to our own internal compiler representation. If you
are interested in finding out more, here are the overviews of the class
file format, and the compilation
from Java source to bytecode.
The Quad Intermediate Representation (IR)
The representation that you will be using for the first two assignments is also a rather high-level IR. Like Java bytecode, it retains source program information such as field accesses and virtual method invocations. This supports the implementation of high-level optimizations such as minimizing the cost of virtual function invocations.
Instead of a stack architecture, however, we will use as our model a machine with an unbounded number of pseudo registers. Pseudo registers hold local variables of a method, as well as temporary variables generated by the compiler to store intermediate results. All data must first be loaded into pseudo registers before they can be operated on. This architecture is more conducive to program optimization than the stack architecture.
All the stack operations in the class files are translated into a series of
simple instructions, each accepting up to three input operands and writing to
one result variable. Hence, the IR is called Quads. Instructions are
organized in a control flow graph, where nodes are the basic blocks and edges
are the possible flow of control. Furthermore, the compiler also puts in the
verification checks imposed by the Java semantics. For example, references are
checked for NULL values before they can be used. These checks are
inserted into the Quad representation directly.
Examples
Below we will show a few simple examples to illustrate how the same program
can be represented at the source, bytecode, and the quad representation. You
are encouraged to write new Java examples and use the same steps to find out
how they are represented at the byte code and, more importantly, as quads. The
sources to all the examples can be found in
/usr/class/cs243/examples.
Example 1
This example illustrates how basic expressions are represented as quads.
Source Program
class ExprTest {
int test (int a) {
int b, c, d, e, f;
c = a + 10;
f = a + c;
if (f > 2) {
f = f - c;
}
return (f);
}
}
Bytecode
We first use javac to compile the Java source to a class file, then run the disassembler over the class file.
elaine6:~/examples> javac ExprTest.java
elaine6:~/examples> javap -c ExprTest
Compiled from ExprTest.java
class ExprTest extends java.lang.Object {
ExprTest();
int test(int);
}
Method ExprTest()
0 aload_0
1 invokespecial #1 <Method java.lang.Object()>
4 return
Method int test(int)
0 iload_1
1 bipush 10
3 iadd
4 istore_3
5 iload_1
6 iload_3
7 iadd
8 istore 6
10 iload 6
12 iconst_2
13 if_icmple 22
16 iload 6
18 iload_3
19 isub
20 istore 6
22 iload 6
24 ireturn
elaine6:~/examples>
javap first prints out the names of the methods defined for
each class, then the definition of the individual methods. By default, all
classes extend java.lang.Object; an appropriate constructor is
automatically generated by the compiler if one does not exist.
For each method, javap prints out its signature--for example,
test accepts an integer and returns an integer. A frame is
created for each invocation. Location 0 holds the this pointer; the parameter
and local variables a,b,c,d,e,f are numbered 1 to 6, respectively.
Instructions are labeled by their position in the array of bytecodes
representing the procedure.
Instructions such as load are prefixed by the result type: a,b,c,d,f,i,j,s,
and z represent reference, byte, character, double, float, integer, long,
short, boolean, respectively. An instruction's parameter is either represented
as a suffix or an extra operand. iload_1 and iload_6 load the 1st and 6th
variables from the frame onto the stack, respectively. The difference is just
an optimization in encoding; the former, which is more common, is encoded in
one byte and the latter is encoded in two.
iconst refers to pushing an integer constant on the stack. if_icmple_22 is
a conditional branch based on an integer comparison between two operands on
the stack. Namely, if the top of stack is less than or equal to the second
operand on the stack then go to instruction 22.
You can print out a textual representation of the quad IR by using the PrintQuads.java program (also included as part of the example programs in the joeq distribution). Run it using the following commands:
elaine6:~/examples> javac examples.PrintQuads elaine6:~/examples> java examples.PrintQuads examples.ExprTest Class: ExprTest Method: <init>()V Control flow graph for ExprTest.<init> ()V: BB0 (ENTRY) (in: <none>, out: BB2) BB2 (in: BB0 (ENTRY), out: BB1 (EXIT)) 2 NULL_CHECK T-1 <g>, R0 ExprTest 1 INVOKESPECIAL_V% java.lang.Object.<init> ()V, (R0 ExprTest) 3 RETURN_V BB1 (EXIT) (in: BB2, out: <none>) Exception handlers: [] Register factory: Local: (I=1,F=1,L=1,D=1,A=1) Stack: (I=1,F=1,L=1,D=1,A=1) Method: test(I)I Control flow graph for ExprTest.test (I)I: BB0 (ENTRY) (in: <none>, out: BB2) BB2 (in: BB0 (ENTRY), out: BB3, BB4) 1 ADD_I T0 int, R1 int, IConst: 10 2 MOVE_I R3 int, T0 int 3 ADD_I T0 int, R1 int, R3 int 4 MOVE_I R6 int, T0 int 5 IFCMP_I R6 int, IConst: 2, LE, BB4 BB3 (in: BB2, out: BB4) 6 SUB_I T0 int, R6 int, R3 int 7 MOVE_I R6 int, T0 int BB4 (in: BB2, BB3, out: BB1 (EXIT)) 8 RETURN_I R6 int BB1 (EXIT) (in: BB4, out: <none>) Exception handlers: [] Register factory: Local: (I=7,F=7,L=7,D=7,A=7) Stack: (I=2,F=2,L=2,D=2,A=2) elaine6:~/examples>
This command invokes a program that loads in classes, then invokes the
compiler pass joeq.Compiler.Quad.PrintCFG on each method in the
class given.
Here we see that BB0 and BB1 are the entry and
exit blocks, respectively. There is a conditional flow of control from
BB2 around BB3 arriving at BB4. The
first operand of each quad is the destination variable.
The this pointer is allocated to R0. The parameters and local variables
a,b,c,d,e,f are allocated to pseudo registers R1 to R6, respectively.
Intermediate results are stored into temporary registers. For example, the
result of R1 + 10 is stored into T0, before it is
stored into R3.
The IFCMP_I instruction is similar to the
if_icmpl instruction, except that the comparison operation is one
of the parameters and the target is basic block BB4. The type of
the operations is attached to the operation as a suffix. The initialization
routine includes an INVOKESPECIAL_V% operation. INVOKESPECIAL
invokes an instance method which requires special handling, such as an
instance initialization method, a private method, or a superclass method. The
suffix _V indicates that the function invoked returns void, and the % symbol
indicates that the invoked function may need to be loaded dynamically.
java.lang.Object.<init> ()V says to invoke the initialization function in
java.lang.Object, its superclass. The signature of the class is that it takes
no explicit argument and returns a void. It passes to it the this pointer in
R0 which is an instance of the class ExprTest.
Example 2
Here is another example to illustrate how fields and arrays are handled.
Source Program
class ArrayTest {
int A[];
ArrayTest() {
A = new int[10];
}
int access (int i) {
return (A[i]);
}
}
Quads
Control flow graph for ArrayTest.access (I)I: BB0 (ENTRY) (in: <none>, out: BB2) BB2 (in: BB0 (ENTRY), out: BB1 (EXIT)) 1 NULL_CHECK T-1 <g>, R0 ArrayTest 2 GETFIELD_A T0 int[], R0 ArrayTest, .A, T-1 3 NULL_CHECK T-1 <g>, T0 int[] 4 BOUNDS_CHECK T0 int[], R1 int, T-1 5 ALOAD_I T0 int, T0 int[], R1 int, T-1 6 RETURN_I T0 int BB1 (EXIT) (in: BB2, out: <none>)
The first NULL_CHECK checks if the this pointer is not null. T-1, read T
minus one, is a fake location referenced by the subsequent operation
(GETFIELD_A) that uses the checked pointer. This fake dependence between the
definition and the use of T-1 prevents the instruction scheduler from
inverting the order of NULL_CHECK and GETFIELD_A. The GETFIELD_A operation
stores the A field of the instance, which is a reference to an array, into the
temporary variable T0. The NULL and BOUNDS checks are then performed. The
ALOAD_I instruction loads into a register an indexed array location of type
int. The NEWARRAY is a special instruction that creates a new array of a given
size.
Code Organization How JoeQ stores code as an IR
The joeq system allows access to program code at a wide variety of levels. This section will cover the important methods of each level of focus. Where appropriate, links to the relevant joeq javadocs will be available.
Java Elements: The joeq.Class package
The Class package contains classes that represent the various components of
Java class files. Many of the String-like return values in this package return
objects of type Utf8, explicitly representing the Unicode format the JVM uses;
these may be turned back into Strings by calling toString upon them.
Types: jq_Type and jq_Primitive
jq_Type is the basic type representing types. It has a
subclass, jq_Primitive, representing the primitive types. The
jq_Primitive class has the following static fields for referring
to each type:
| jq_Primitive.BOOLEAN |
| jq_Primitive.BYTE |
| jq_Primitive.CHAR |
| jq_Primitive.DOUBLE |
| jq_Primitive.FLOAT |
| jq_Primitive.INT |
| jq_Primitive.LONG |
| jq_Primitive.SHORT |
| jq_Primitive.VOID |
Neither jq_Type nor jq_Primitive have any methods of interest to us.
Arrays: jq_Array
Arrays are represented with the jq_Array class. The methods of note here are:
int getDimensionality(): Returns how many dimensions the array has:int[][]is dimensionality 2.jq_Type getElementType(): Returns the element type.int[][]has element typeint[].jq_Type getInnermostElementType(): Returns the result of the maximum number of nestedgetElementType()calls.int[][]has innermost element typeint.
Classes: jq_Class
The jq_Class class represents a class file as a whole. You
acquire the jq_Class for a type by calling
joeq.Main.Helper.load; see the section on
joeq.Main.Helper for details. Important methods in
jq_Class are:
String getName():Returns the name of the class.jq_Class getSuperclass():Returns the parent class, or null if the class is java.lang.Object.jq_Class[] getDeclaredInterfaces():Returns an array of all interfaces this class claims to implement. This does not include superinterfaces of those interfaces, nor does it contain interfaces declared by superclasses.jq_InstanceField[] getDeclaredInstanceFields(): Returns all instance fields that this class declares. It does not mention fields declared by superclasses.jq_StaticField[] getDeclaredStaticFields(): Returns all static fields that this class declares. It does not mention fields declared by superclasses.jq_InstanceMethod[] getVirtualMethods(): Returns all instance methods that this class declares. It does not mention methods declared by superclasses.jq_StaticMethod[] getDeclaredStaticMethods(): Returns all static methods that this class declares. It does not mention methods declared by superclasses.
Methods: jq_Method
Utf8 getName(): Returns the (simple) name of this method.jq_Type[] getParamTypes(): The arguments to the routine.jq_Type getReturnType(): The return type of the routine.jq_Class getDeclaringClass(): Returns the class that declared this method.
Fields: jq_Field
Utf8 getName(): Name of the field. Doesn't include the class designator.jq_Type getType(): The type of this field's value.jq_Class getDeclaringClass(): Returns the class that declared this field.
Code: The jeoq.Compiler Package
Most of the code for representing the actual program code in joeq is
contained in the joeq.Compiler package and its subpackages.
joeq.Compiler.Quad.ControlFlowGraph
BasicBlock entry(): The first BasicBlock in the CFG. (This is always empty.)BasicBlock exit(): The last BasicBlock in the CFG. (This is also always empty.)jq_Method getMethod(): The method that corresponds to this CFG.
Iterating over CFGs: joeq.Compiler.Quad.QuadIterator
The QuadIterator loops through a ControlFlowGraph
a Quad at a time. It is constructed taking a ControlFlowGraph as
its constructor argument. The following methods then become available:
Quad next(): Returns the next quad in a reverse-post-order listing of Quads.Quad previous(): Returns the previous quad in a reverse-post-order listing of Quads.boolean hasNext(): Returns true if next is legal.boolean hasPrevious(): Returns true if previous is legal.java.util.Iterator successors(): Returns an iterator of the possible successor quads of the most recently returned quad. If a possible successor is the method exit, it includes the null value in the iteration.java.util.Iterator predecessors(): Returns an iterator of the possible predecessor quads of the most recently returned quad. If a possible predecessor is the method entrance, it includes the null value in the iteration.
joeq.Compiler.Quad.BasicBlock
The BasicBlock class represents a joeq basic block. These have
one entrance, and, assuming no exceptions are thrown, one exit. (If exceptions
are thrown, the "basic block" may have multiple exit points.)
There are no methods of interest to us defined in BasicBlock.
We will usually be focussing either on ControlFlowGraphs or
Quads. If you find this strange, consider what happens if you think about all the instructions in a basic block as residing in it's own basic block -
that's what the Quad representation gives you, so you can implement all your algorithms directly on Quads.
joeq.Compiler.Quad.Quad
At the bottom of the code hierarchy is the quad, a construct very similar to three-address assembly code. Relevant methods on quads themselves are:
Operator getOperator()retrieve the Operator for this Quad. You are unlikely to ever check this directly, as the use of a QuadVisitor will usually do this for you.List getThrownExceptions(): Returns a list of jq_Classes - one for each exception that can be thrown.toString(): A textual mnemonic for this quad.
The operators and operands of Quads are numerous enough that
they they warrant their own section: the Quads
Overview.
joeq.Compiler.BytecodeAnalysis
Important This package deals mostly
with the JVM's code representation -- we don't have to deal with it in this
class. Note, however, that it has a ControlFlowGraph class of its
own, so do not import joeq.Compiler.Quad.* and
joeq.Compiler.BytecodeAnalysis.* in the same code.
Quads Details of the Quad representation of bytecodes
The program representation that we will be using is this class is a control flow graph, where basic blocks consist of simple operations known as Quads. Quads consist of an Operator and up to four Operands.
Operator types
All operators are inherited from the Operator class. Operators have the following methods:
getThrownExceptions(): gets the exception handlers for that operation.getDefinedRegisters(q): gets all the registers defined by quad q.getUsedRegisters(q): gets all the registers used by quad q.getReg1(q),getReg2(q),getReg3(q): gets the named register from quad q.INSTANCE: All operators are implemented using the Singleton pattern. This field holds the object associated with each operator.
All the operations are hierarchically ordered. For example, there are
different operations to write to a field in a class, depending on the type of
the field. For example, the PUTFIELD_{I,F,L,D,A,B,C,S,Z} Operators represent
array load instructions for the types integer, floating-point, long, double,
reference, byte, character, short, and boolean.
PUTFIELD_{I,F,L,D,A,B,C,S,Z}_DYNLINK are similar to
PUTFIELD_{I,F,L,D,A,B,C,S,Z}, except that the invoked function may need to be
loaded dynamically. We use the % symbol to stand for DYNLINK when printing the
quads. All the various PUTFIELD variants inherit from the Putfield class,
which inherits from Operator. Associated with each class is a set of methods,
allowing the compiler writer to refer to the operands of a quad by more
meaningful names than getOp1, getOp2 etc. For example, the first operand for
Binary operations can be retrieved by invoking getDest on the quad containing
the operator.
| Operator | Description | Subclasses | Methods |
|---|---|---|---|
|
Move |
Move between pseudo registers |
|
getMoveOp |
|
Binary |
Binary operation, with two sources and one destination |
|
getDest |
|
Unary |
Unary operation, with one source and one destination |
|
getDest |
|
Goto |
unconditional branch |
|
getTarget |
|
IntIfCmp |
conditional branch |
|
getSrc1 |
|
TableSwitch |
table switch instruction |
|
setTarget |
|
LookupSwitch |
lookup switch instruction |
|
setMatch |
|
Getstatic |
load from a static field |
|
getDest |
|
Putstatic |
store to a static field |
|
getSrc |
|
Getfield |
load from an object field |
|
getDest |
|
Putfield |
store to an object field |
|
getBase |
|
ALoad |
array load |
|
getDest |
|
AStore |
array store |
|
getValue |
|
ALength |
array length |
|
getDest |
|
BoundsCheck |
array bounds check |
|
getRef |
|
NullCheck |
null pointer check |
|
getDest |
|
ZeroCheck |
null pointer check |
|
getDest |
|
StoreCheck |
array object type store check |
|
getRef |
|
New |
allocate a new object |
|
getDest |
|
NewArray |
allocate a new array |
|
getDest |
|
CheckCast |
run time type check |
|
getDest |
|
InstanceOf |
run time type check |
|
getDest |
|
Invoke |
call another method |
|
setParam |
|
Jsr |
call an internal subroutine |
|
getDest |
|
Ret |
return from an internal subroutine |
|
getTarget |
|
Return |
return from the current method |
|
getSrc |
|
Monitor |
access atomic lock for an object |
|
getSrc |
Operand types
The following provides a short description of all the different kinds of operands.
| Operand | Description |
|---|---|
| RegisterOperand | specifies an abstract location |
| IConstOperand | integer constant (32 bit) |
| FConstOperand | floating point constant operand (32 bit) |
| LConstOperand | long integer constant operand (64 bit) |
| DConstOperand | double-precision floating point constant operand (64 bit) |
| AConstOperand | object reference constant |
| MethodOperand | specifies the target method of a method call instruction |
| ParamListOperand | specifies the parameter list for a method call instruction |
| FieldOperand | specifies an object field associated with a get/put field instruction |
| TypeOperand | specifies a type for use in type check instructions |
| ConditionOperand | Specifies a condition code associated with a conditional branch instruction |
| TargetOperand | specifies the target of a branch instruction |
| BasicBlockTableOperand | specifies a table of targets, for use in switch instructions |
Basic blocks and the control flow graph
Quads are organized into BasicBlocks. BasicBlocks are single-entry, meaning that control flow can only enter at the start of a basic block. Barring exceptions and method calls, they are also single-exit, meaning that control flow can only leave at the end of a basic block. Important This is an important difference between the basic blocks that were covered in lecture. Exceptions can cause basic blocks to exit early.
Each BasicBlock contains a list of Quads. They also include a set of predecessor and successor basic blocks. Basic blocks are numbered with unique integer identifiers. There are two special basic blocks, the entry basic block and the exit basic block. These basic blocks always have identifiers 0 and 1, respectively. These blocks mark the method entry point and exit point. They never contain instructions. Obviously, the entry basic block has no predecessors, and the exit basic block has no successors.
Basic blocks also have exception handlers. When an exception occurs within a basic block, control flow jumps to the handler for that basic block that matches the type of the exception that occurred.
BasicBlocks are contained in a ControlFlowGraph. The ControlFlowGraph contains references to the start and end nodes and a list of all exception handlers.
Visitors Details of the Visitor class to traverse the IR
Central to the joeq analysis system is the concept of the Visitor. Visitors
are described fully in the Design Patterns book by Gamma et al., but the
concept is fairly straightforward. Each Visitor class has a number of methods
of the form void visitX (X argument), which are called by the analysis
machinery upon encountering objects of that type. Each pass is defined by one
or more Visitor classes. The granularity of the analysis is determined by what
kind of Visitor the class uses.
The Visitor classes are all actually interfaces. However, since some of the
interfaces specify many, many methods, each includes a static nested class
called EmptyVisitor that implements each method with a no-op. This allows one
to override only those methods that your pass requires.
joeq.Class.jq_TypeVisitor
The jq_TypeVisitor visits arrays, primitives, and classes. For our
purposes, the visitClass(jq_Class c) method is the only one we'll ever really
need to override.
joeq.Class.jq_MethodVisitor
This Visitor can distinguish between static and virtual methods, or between
various sorts of initializers. All visited methods will invoke their most
specific method, then the generic visitMethod. For our purposes, we'll almost
never use anything other than visitMethod.
joeq.Compiler.Quad.ControlFlowGraphVisitor
The only method in this interface is void visitCFG(ControlFlowGraph cfg).
It's easy to implement.
joeq.Compiler.Quad.BasicBlockVisitor
The only method in this interface is void visitBasicBlock(BasicBlock bb).
It's also easy to implement.
joeq.Compiler.Quad.QuadVisitor
The QuadVisitor is the bulkiest of the Visitor classes. There is one method
for each of the major Operator subclasses (the left column in the chart in the
Quads section), and a generic visitQuad. When this visits a Quad,
it first calls the appropriate specific visit routine, and then calls
visitQuad.
You will often use QuadVisitors in conjunction with QuadIterators or with
the BasicBlock.visitQuads function. This will allow you to adjust evaluation
order to react to control flow or intermediate analysis results.
joeq.Main.Helper Analysis Interface How to use Visitor classes you write
The joeq system has a lot of state and has many very complex internal dependencies. The joeq.Main.Helper class hides all of these complexities and requirements behind a very simple interface. In fact, there are only two method names you need to know.
jq_Type load(String name): This routine takes a class name specified by the argument and searches the classpath for a class of that name. If such a class exists, it loads it into a jq_Class object and returns it; otherwise, it throws a NoClassDefFoundError. For our purposes, it will always work to downcast the returned jq_Type to a jq_Class.void runpass (target, pass): This is actually a family of functions, but you do not ordinarily need to worry about that. The target is a chunk of code: a jq_Class, a jq_Method, a ControlFlowGraph, a BasicBlock, or a Quad. The pass is a relevant visitor: a jq_TypeVisitor, a jq_MethodVisitor, a ControlFlowGraphVisitor, a BasicBlockVisitor, or a QuadVisitor. You may run a pass on a target if the target is at least as fine-grained as the pass; so, for instance, a BasicBlock can receive a BasicBlockVisitor or a QuadVisitor, but a jq_Class can receive any kind of pass.
Writing and executing a compiler pass
Writing a compiler pass in joeq is easy, once you learn a few basics. You should be familiar with the basic operation of the visitors and the joeq code hierarchy before dealing with these samples.
Here's a simple compiler pass that counts all the quads it visits:
import joeq.Compiler.Quad.*;
public class QuadCounter extends QuadVisitor.EmptyVisitor {
public int count = 0;
public void visitQuad(Quad q) { count++; }
}
There's not a lot to say about this class. The QuadVisitor interface specifies a vast number of methods, and we only care about one of them (the most generic, which is always invoked). Thus, we extend the associated EmptyVisitor instead of implementing the core interface. The operation of the pass is fairly straightforward -- any time a quad accepts this pass, we bump the counter.
This is all well and good, but how do we run our pass on anything? As a
QuadVisitor, we can run it on a code chunk of any size -- a class, a method, a
basic block, or even a single quad (though that last isn't likely to be too
interesting). To run this pass, we simply use one of joeq.Main.Helper's many
runPass methods. We add a main method that reads in class names from the
command line and runs itself on those passes.
import joeq.Compiler.Quad.*;
import joeq.Main.Helper;
import joeq.Class.jq_Class;
public class QuadCounter extends QuadVisitor.EmptyVisitor {
public int count = 0;
public void visitQuad(Quad q) { count++; }
public static void main(String[] args) {
jq_Class[] c = new jq_Class[args.length];
for (int i = 0; i < args.length; i++) {
c[i] = (jq_Class)Helper.load(args[i]);
}
QuadCounter qc = new QuadCounter();
for (int i = 0; i < args.length; i++) {
qc.count = 0;
Helper.runPass(c[i], qc);
System.out.println(c[i].getName()+" has "+
qc.count+" Quads.");
}
}
}
What if you only wanted to count the number of "move" instructions? The "visitQuad" method matches all quads. The visitor interface also has other kinds of visit methods that match specific quads; that is to say, they are only called on quads that match a specific type. Here is an example of a visitor that counts the number of instructions that load from memory and the number of instructions that store into memory.
import joeq.Compiler.Quad.*;
public class LoadStoreCounter extends QuadVisitor.EmptyVisitor {
public int loadCount=0, storeCount=0;
public void visitLoad(Quad q) { loadCount++; }
public void visitStore(Quad q) { storeCount++; }
}
Examples
Example 1: Finding Thrown Exceptions
Let's write a pass that determines the set of uncaught exceptions that can be thrown by a method.
We will use a flow-insensitive technique. The problem is complicated by the fact that exceptions that are thrown can be caught within the method itself, and we don't want to include those exceptions in our result. Exception handlers are attached to the basic blocks that they protect, so we need to visit the basic blocks to learn about the exception handlers.
We will use one visitor object that is both a BasicBlockVisitor and a
QuadVisitor. The visitBasicBlock routine will cache the current basic block
and call visitQuads() to visit the quads.
import joeq.Compiler.Quad.*;
import joeq.Class.*;
import java.util.*;
import joeq.Main.Helper;
public class FindThrownExceptions extends QuadVisitor.EmptyVisitor implements BasicBlockVisitor {
private Set thrownExceptions = new HashSet();
private BasicBlock currentBasicBlock;
public void visitBasicBlock(BasicBlock bb) {
currentBasicBlock = bb;
Helper.runPass(bb, (QuadVisitor)this);
}
public void visitExceptionThrower(Quad q) {
for (jq_Class c : q.getThrownExceptions()) {
// need to make sure this exception type is not caught by a handler
if (currentBasicBlock.getExceptionHandlers().mustCatch(c) == null) {
thrownExceptions.add(c);
}
}
}
public String toString() {
return "Possible thrown exceptions:"+thrownExceptions;
}
public static void main(String[] args) {
jq_Class[] c = new jq_Class[args.length];
for (int i = 0; i < args.length; i++) {
c[i] = (jq_Class)Helper.load(args[i]);
}
FindThrownExceptions fte = new FindThrownExceptions();
for (int i = 0; i < args.length; i++) {
Helper.runPass(c[i], (BasicBlockVisitor)fte);
}
System.out.println(fte);
}
}
Now let's take a closer look. Our class, FindThrownExceptions, both
implements BasicBlockVisitor and extends QuadVisitor.EmptyVisitor. This means
that in order to call Helper.runPass, we need to cast it to indicate which
kind of Visitor we mean it to be treated as.
There are two fields. The first is thrownExceptions, which contains the
exceptions seen thus far. It is initialized to an empty set. The second is
currentBasicBlock, which keeps track of the current basic block as we iterate
through the quads. We need to keep track of the current basic block because
quads do not contain a reference to their enclosing basic block, and we need
the enclosing basic block to find the exception handlers.
The visitBasicBlock method simply sets the currentBasicBlock field and then
runs the Quad visitor on itself. That includes visitExceptionThrower, which
matches quads that can throw exceptions. We start by iterating over the types
of exceptions that the quad can throw, obtained via getThrownExceptions(). For
each of those exception types, if that type is not necessarily caught by an
exception handler, we add it to the set thrownExceptions.
Lastly, the main routine loads the classes from the command line, find all
uncaught exceptions they throw (the thrownExceptions set only grows in size),
and prints them out.
Example 2: "Using" Variables
In quad format, variables are stored in pseudo-registers. There are an infinite number of pseudo-registers. Pseudo-registers have an associated type: int, float, long, double, or object reference.
We will compute for each pseudo-register the set of quads that define that register ('def' set) and the set of quads that use that register ('use' set). This can be used, for example, to calculate 'gen' and 'kill' sets in the reaching definition dataflow problem.
This, again, is a flow-insensitive problem. We don't need to know about the basic blocks, so we simply make a quad visitor.
import joeq.Class.*;
import joeq.Compiler.Quad.*;
import joeq.Compiler.Quad.Operand.RegisterOperand;
import joeq.Compiler.Quad.RegisterFactory.Register;
import joeq.Main.Helper;
import java.util.*;
public class FindDefsAndUses extends QuadVisitor.EmptyVisitor {
private Map> registersToDefs = new HashMap>(); // Register -> List
private Map> registersToUses = new HashMap>(); // Register -> List
public void visitQuad(Quad q) {
for (RegisterOperand def : q.getDefinedRegisters())
addToMultiMap(registersToDefs, def.getRegister(), q);
for (RegisterOperand use : q.getUsedRegisters())
addToMultiMap(registersToUses, use.getRegister(), q);
}
private void addToMultiMap(Map> m, Register key, Quad value) {
List s = m.get(key);
if (s == null) m.put(key, s = new LinkedList());
s.add(value);
}
public String toString() {
return "Defs: "+registersToDefs+
"\nUses: "+registersToUses;
}
public static void main(String[] args) {
jq_Class[] c = new jq_Class[args.length];
for (int i = 0; i < args.length; i++) {
c[i] = (jq_Class)Helper.load(args[i]);
}
FindDefsAndUses fdu = new FindDefsAndUses();
for (int i = 0; i < args.length; i++) {
Helper.runPass(c[i], fdu);
}
System.out.println(fdu.toString());
}
}
The important point is that quads include methods
getDefinedRegisters() and getUsedRegisters() that
return a list of registers that are defined/used by that quad. The visitQuad
method simply iterates over those lists and adds the quads to a multi-map
keyed on the register.
Example 3: Manipulating control flow graphs
Now let's look at an example of control flow graph manipulation. Let's write a pass that will merge two basic blocks which are immediately adjacent where the first basic block has no other successors and the second basic block has no other predecessors. Obviously, we need to ignore the entry and exit basic blocks. Additionally, let's maintain exception semantics by requiring that the exception handler set is the same for both basic blocks.
Code goes here.
This one is a little more involved. When visiting a basic block, we look at the predecessors of the current basic block. By merging basic blocks with their predecessors rather than their successors, we can avoid having to re-compute the traversal order if we traverse forward through the graph.
We check that the basic block is not the exit node and that there is only one predecessor. Likewise, we check that the predecessor has only one successor and that the predecessor is not the entry node. Finally, we check the equality of the exception handlers. If all of these tests pass, the basic blocks are mergeable.
We check the last quad in the previous basic block; if it is a branch we remove it. We add all of the quads in the current basic block to the previous basic block, merge the flags of the two basic blocks, remove the successor edge between them, and add all successors of the second basic block to the first basic block. And the blocks are merged!
When should I use iterators versus visitors for IR elements?
These two techniques are interchangeable, but in some cases it is more convenient to use one rather than the other.
Visitors are good because they separate the traversal order from the operations performed on each element. Because they are objects, they can be passed around and be used to store results. They also allow advanced selection of objects based on their type, a la QuadVisitor.
However, visitors require a class definition and are therefore cumbersome for simple traversals. If a traversal is very simple and doesn't warrant a new class definition, it may be better to use an iterator.