Homework 6a: Interprocedural Analysis with BDDBDDB

CS243: Advanced Compiling Techniques, Winter 2009

Released: March 4, 2009

Due: 11:00 AM, Wednesday, March 11, 2009

Overview

In this homework, you will interpret the results of and extend some interprocedural program analyses using the BDDBDDB framework.

You may work in groups of one or two for this homework.

Getting started

  1. Download hw6a.tar.gz and extract its contents. Alternatively, SSH to a myth or other CS Unix machine and get the file at:

    /usr/class/cs243/bddbddb/hw6a.tar.gz

For this project we recommend using the CS myth machines.

  1. Change into the hw6a/ directory and type make. This should compile AddressBook.java, which is the program that we will be analyzing throughout this homework. Take a few minutes to familiarize yourself with the contents of AddressBook.java. If you read carefully, you will notice that it contains a few bugs, but that is what we are going to try to detect by writing some simple analyses in Datalog!

  2. Execute: python run_bddbddb.py AddressBook pacs.dtl

    This will run the context-sensitive pointer analysis (written in Datalog as pacs.dtl) on the Java program AddressBook, which you've just compiled. There may be some warning messages printed to the terminal. This is normal, so don't worry about it. As long as it prints Done at the end, then it probably ran successfully.

  3. If the analysis ran successfully, there will now be a results/ sub-directory containing various .bdd, .map, .tuples and other files. (Don't worry about what these files mean for now.)

  4. Execute: python print_bddbddb_results.py AddressBook results/hP.tuples

    This will print out all of the hP points-to relations from the current run in a (reasonably) human-readable text format. You'll soon find out what hP means, so don't worry about it for now. (Actually, it doesn't print out all of the relations; it only prints the ones relevant to AddressBook. There are many more relations generated by Java Standard Library code that this tiny program imports.)

If the last step successfully outputs lines of text to your terminal, then you are all set and ready to begin this homework! If you've gotten stuck or can't get any output, please post your questions to the newsgroup.

Introduction to program analysis using BDDBDDB

BDDBDDB is an implementation of the Datalog logic programming language that allows programmers to specify program analyses in a compact declarative manner.

Let's briefly run through how a Java program proceeds through an analysis using BDDBDDB when you execute this script:

python run_bddbddb.py <main class name> <Datalog filename>
  1. The Joeq front-end (a program called joeq.Main.GenRelations) extracts components (i.e., variables, types) from the .class bytecode file supplied on the command line and encodes them into a BDD (binary decision diagram) representation, stored in the results/*.bdd files.

  2. The solver executes the specified Datalog file on the BDD representation and outputs tuples files in results/*.tuples. If you view those files, they are simply tuples of numbers. The results/*.map files map between these numbers and human-readable source-level names.

Then when you execute this script:

python print_bddbddb_results.py <main class name> <tuples filename>

it will print out a set of tuples to the command-line.

We encourage you to take your own Java programs and experiment with running analyses on them. The simplest way to get started is to compile your programs in the same directory as our scripts. Let us know if you have problems setting up the infrastructure. Note that BDDBDDB is sensitive to Java memory limits and if you are getting OutOfMemory errors with larger test programs you should edit run_bddbddb.py to increase these limits (the -mx command line argument).

Basic Datalog syntax

Let's use pacs.dtl (context-sensitive pointer analysis) as an example. We will step through parts of the file in turn:

First, the header includes various implementation-specific declarations, such as what directory to output the results of the analysis (e.g., results/) and what file to use to refer to the domains of interest (e.g., results/fielddomains.pa). We've annotated that file with plenty of comments, so hopefully those will be useful:

.basedir "results"
.include "fielddomains.pa"

# Here are the abbreviations for domains used in the relations:
# - V: variables
# - H: heap objects, named by invocation sites of object creation methods
# - T: types
# - F: fields
# - I: invocation sites
# - Z: numbers for parameters
# - N: names
# - M: methods
# - C: classes

The next section declares all of the relations:

###### Relations

# All input relations are extracted by Joeq front-end

# variable 'v' points to heap object 'h'
vP0 (v:V0, h:H0) input
# ha.field may point to hb
hP0 (ha:H0, field:F0, hb:H1) input

# statement: dest = source (includes passing of arguments and return values)
A (dest:V0, source:V1) input
# statement: base.field = src
S (base:V0, field:F0, src:V1) input
# statement: dest = base.field
L (base:V0, field:F0, dest:V1) input

# actual parameters for invocation sites:
#   'actualparam' is passed as parameter number 'num' at 'invoke'
actual (invoke:I0, num:Z0, actualparam:V1) input
# formal parameters for methods:
#   the formal parameter number 'num' of 'method' is represented as 'formalparam'
formal (method:M0, num:Z0, formalparam:V0) input
# Method 'method' contains an invocation site 'invoke' with virtual
# method name 'name'
mI (method:M0, invoke:I0, name:N0) input
# Statically-bound method invocation edges:
#   invocation site 'invoke' calls method 'target'
IE0 (invoke:I0, target:M0) input

# variable 'var' has type 'type'
vT (var:V0, type:T0) input
# heap object 'heap' has type 'type'
hT (heap:H0, type:T1) input
# type T1 is assignable to type T0 (based on sub-classing relationship)
aT (type:T0, type:T1) input
# It is type-safe to assign heap object 'h' to variable 'v'
vPfilter (v:V0, h:H0)

# virtual method dispatch info: 
#   method 'method' is the target of dispatching method name 'name' on type 'type'
cha (type:T1, name:N0, method:M0) input
# 'v' is a local variable in method 'm'
mV(m : M0, v : V0) input

# 'v' is the return value of method 'method'
Mret (method:M0, v:V1) input
# 'v' is return value of invocation site 'invoke'
Iret (invoke:I0, v:V0) input

IEnum (invoke:I0, target:M0, ccaller:VC1, ccallee:VC0) input

#Mthr (method:M0, v:V1) input
#Ithr (invoke:I0, v:V0) input
#IE1 (invoke:I0, target:M0)

# Context-sensitive output relations (cannot print as outputtuples
# because there are astronomical numbers of contexts)
IEcs (ccaller:VC0, invoke:I0, ccallee:VC1, target:M0) output
cA (vcdest:VC0, dest:V0, vcsrc:VC1, source:V1) output
cvP (vc:VC0, v:V0, h:H0) output


# These relations are output as tuples that can be 
# printed by print_bddbddb_results.py

# field 'ha.field' of heap object 'ha' may point to heap object 'hb'
hP (ha:H0, field:F0, hb:H1) outputtuples
# variable 'v' may point to heap object 'h'
vP (v:V0, h:H0) outputtuples
# invocation site 'invoke' may call method 'target'
IE (invoke:I0, target:M0) outputtuples


### Insert your new relations below here:

Each relation contains one or more terms, each of which may take on values from a particular domain. input (extensional database) relations are constructed from the BDD representation that the Joeq front-end builds from the source program. output and outputtuples (intensional database) relations are inferred by running the Datalog program. Any relations marked as outputtuples will have their tuples dumped to results/*.tuples files, which can later be printed in human-readable format.

Each relation is like a 'declaration' in an imperative programming language. For example, the hP relation which you will be mainly dealing with in this assignment is shown below:

# field 'ha.field' of heap object 'ha' may point to heap object 'hb'
hP (ha:H0, field:F0, hb:H1) outputtuples

The name of the relation is listed, along with its terms: ha, field, and hb. Each term can take on values drawn from a domain. Here, the H stands for the domain of heap-allocated objects, and the F stands for the domain of field names (ignore the numbers like H0 and H1 for now). Because outputtuples is specified, when the analysis is run, the tuples will be written to a file called results/hP.tuples. Recall that to print out the hP relations, the command we used was:

python print_bddbddb_results.py AddressBook results/hP.tuples

Now you know where that tuples file comes from.

The meaning of each relation is documented briefly in the comments above each line. If necessary, please refer to Chapter 3 of John Whaley's Ph.D. thesis (the creator of BDDBDDB) for more detailed definitions.

The final section consists of the rules which tell Datalog how to infer new relations from the input relations:

###### Rules

cvP(_,v,h) :- vP0(v,h).
cA(_,v1,_,v2) :- A(v1,v2).
IEcs(vc2,i,vc1,m) :- IE0(i,m), IEnum(i,m,vc2,vc1).
vPfilter(v,h) :- vT(v,tv), aT(tv,th), hT(h,th).
cA(vc1,v1,vc2,v2) :- formal(m,z,v1), IEcs(vc2,i,vc1,m), actual(i,z,v2).
cA(vc2,v2,vc1,v1) :- Mret(m,v1), IEcs(vc2,i,vc1,m), Iret(i,v2).
cvP(vc1,v1,h) :- cA(vc1,v1,vc2,v2), cvP(vc2,v2,h), vPfilter(v1,h).
hP(h1,f,h2) :- S(v1,f,v2), cvP(vc1,v1,h1), cvP(vc1,v2,h2).
cvP(vc1,v2,h2) :- L(v1,f,v2), cvP(vc1,v1,h1), hP(h1,f,h2), vPfilter(v2,h2). split
vP(v, h) :- cvP(_, v, h).
IE(i, m) :- IEcs (_,i,_,m).

### Insert your new rules below here:

The relation on the lefthand side of each rule is true if all relations on the righthand side are true (the two sides are separated by the :- symbol). Relations on the righthand side are separated by commas, and each line must end with a period. The special _ symbol represents don't care, so any term can be substituted in its place. For example, this rule:

cvP(_,v,h) :- vP0(v,h).

means that in all possible contexts, cvP is true for a given v and h if vP0(v,h) is true.

References

Part A: Interpreting the results of the pointer analysis

Let's now study the output from running: python print_bddbddb_results.py AddressBook results/hP.tuples

The hP relation specifies that a particular field in a particular heap-allocated object may point to another heap-allocated object. Let's look at one relation in particular, which you should find in your output:

AddressBook.java:63 Concrete: MaleRecord @ 6 || Record/name || AddressBook.java:37 Concrete: String @ 1

The || string separates each line into components. There are three components in this particular relation:

  1. AddressBook.java:63 Concrete: MaleRecord @ 6
  2. Record/name
  3. AddressBook.java:37 Concrete: String @ 1

The first represents a heap-allocated object, in this case a MaleRecord object allocated in line 63 of AddressBook.java (look at the corresponding place in the source code to convince yourself of what line performed the allocation).

The second represents the name field of the Record class.

The third represents a heap-allocated object, in this case a String object allocated at line 37 ("Joe").

What this line says is: The field name of the MaleRecord object allocated on line 63 of AddressBook.java can point to the String "Joe" allocated on line 37. Go ahead and inspect the code manually to make sure that this relation makes sense to you.

For some points-to relations the first component is the value global(null). This is used for static fields.

Problem 1

Copy the output of running python print_bddbddb_results.py AddressBook results/hP.tuples into a text file named submit/problem1.txt. There should be 26 total lines, each representing one hP relation. For every line or group of lines, write out what the points-to relation intuitively means in terms of what someone inspecting the source code can understand. For example, for the line:

AddressBook.java:63 Concrete: MaleRecord @ 6 || Record/name || AddressBook.java:37 Concrete: String @ 1

an appropriate answer would be:

AddressBook.java:63 Concrete: MaleRecord @ 6 || Record/name || AddressBook.java:37 Concrete: String @ 1
  - Line 63 ('Record joeRecord = new MaleRecord(joe, joeEmail);') sets
    joeRecord.name to point to the String "Joe" allocated on line 37.

To avoid tedious repetition, you can write one explanation for an entire group of lines. Don't stress too much about the details for this problem; we just want to make sure that you build up some intuition about what the results of a pointer analysis means.

Problem 2

There is a bug in AddressBook.java:main() where it forgets to add some Record object into the address book. Which Record object is it, and more importantly, by looking at the output of the pointer analysis that you created in Problem 1, what relations in particular could you consult in order to corroborate your answer? In other words, if the program had added that record into the address book, how would the hP relations look different? (You can try to fix the bug yourself, re-compile, and re-run the analysis to see for sure.) Write your answer in a file called submit/problem2.txt.

(Note: Apologies if this problem seems like a contrived example, which it is, but the principle is that for real programs of considerable size, it's infeasible to answer questions like this one by inspecting the source code. However, the pointer analysis we use can scale up to hundreds of thousands of lines of code.)

Part B: Building a simple alias analysis

The goal of a pointer analysis is to answer questions like: For an object X, what objects can the fields of X possibly point to? The goal of an alias analysis is to answer questions like: For objects X and Y with fields fX and fY, respectively, can X.fX and Y.fY possibly point to the same object Z?

Problem 3

By doing Problems 1 and 2, you will likely notice that the alias analysis results can be extracted from the pointer analysis results. In this problem, you will write a simple alias analysis in Datalog atop of the existing pointer analysis.

First, copy pacs.dtl into a new file called alias.dtl

Next, add a new relation in alias.dtl called hALIAS with the following declaration:

### Insert your new relations below here:

# Can h1.field1 and h2.field2 point to the same object?
hALIAS (h1:H0, field1:F0, h2:H1, field2:F1) outputtuples

Add a new rule (or multiple rules if you'd like) such that the hALIAS relation returns all of the pairs h1.field1 and h2.field2 which can point to the same object.

Then you can test your new analysis by running it:

python run_bddbddb.py AddressBook alias.dtl

and printing out the hALIAS relations to your terminal:

python print_bddbddb_results.py AddressBook results/hALIAS.tuples

When you are satisfied that your analysis has met the specs, copy your alias.dtl file into the submit/ sub-directory. Also, copy the results of printing out the hALIAS relations into a file called submit/problem3.txt. At the top of that file, write a paragraph describing what parts of this output you found to be useful and what parts you found to be not useful (i.e., noisy output). Don't stress about details on this part; we just want you to build up some intuition about interpreting the results of your own analysis.

To sanity-check your results, your analysis should be able to report that joeRecord.name and joeAltRecord.name are aliased.

Problem 4

Now you will refine the alias analysis you created in Problem 3 so that it produces less output (actually, so that it produces more relevant output for one specific debugging task). You want to be able to use your refined analysis to detect the following bug in AddressBook.java:

66.    Record maryRecord = new FemaleRecord(mary, maryEmail);
67.    Record janeRecord = new FemaleRecord(jane, maryEmail);

janeRecord should have been constructed with janeEmail rather than maryEmail (this might be due to a copy-and-paste error, for instance). Due to this bug, maryRecord.emailAddr and janeRecord.emailAddr are aliased, when in fact they should not be. Copy alias.dtl into alias_refined.dtl and modify the rules in alias_refined.dtl such that aliasing relationships are only reported when:

For instance, we don't care that maryRecord.emailAddr aliases maryRecord.emailAddr (because they belong to the same heap object, and in fact, they are the same field), nor that the static String field joe refers to the same string object as joeRecord.name (because the base objects have different types).

The expected hALIAS result from running alias_refined.dtl should be:

AddressBook.java:63 Concrete: MaleRecord @ 6 || Record/name || AddressBook.java:69 Concrete: MaleRecord @ 34 || Record/name
AddressBook.java:69 Concrete: MaleRecord @ 34 || Record/name || AddressBook.java:63 Concrete: MaleRecord @ 6 || Record/name
AddressBook.java:66 Concrete: FemaleRecord @ 20 || Record/emailAddr || AddressBook.java:67 Concrete: FemaleRecord @ 27 || Record/emailAddr
AddressBook.java:67 Concrete: FemaleRecord @ 27 || Record/emailAddr || AddressBook.java:66 Concrete: FemaleRecord @ 20 || Record/emailAddr

which says that joeRecord.name and joeAltRecord.name are aliased and that maryRecord.emailAddr and janeRecord.emailAddr are aliased. There should be no other output.

Hints: != is the inequality operator in Datalog. hT is the relation that maps a heap object to its type.

Remember to copy alias_refined.dtl into submit/ when you are done.

Part C: Field-sensitive versus field-insensitive pointer analysis

The pointer analysis we've been studying thus far is field sensitive, which means that it makes a distinction between different fields within an object. Field sensitivity is easy to achieve in Java because it is type-safe and pointer arithmetic is not allowed. In contrast, pointer analyses for C are usually field insensitive, due to the difficulties of distinguishing between field accesses in a sound manner (since pointer arithmetic can be used to access arbitrary fields in a type-unsafe way).

Problem 5

Your task is to modify the rules in pacs.dtl to simulate a field insensitive analysis.

First copy pacs.dtl to a new file named pacs_field_insensitive.dtl.

Then add a printsize directive to the declaration of the hP relation:

hP (ha:H0, field:F0, hb:H1) outputtuples printsize

Now run pacs_field_insensitive.dtl, and when it finishes, it should print out how many total tuples are in the hP relation:

Saving results: SIZE OF hP: 105.

There is a lot of other output, but look for the Saving results: line. There should be 105 tuples in hP. (Actually, depending on your machine type, there may be 110 tuples, so don't worry about the exact number, as long as it's in this ballpark.) Note that this includes tuples from code in the Java Standard Library, which are not printed out by print_bddbddb_results.py.

Now modify the declaration of hP to remove the field term:

hP (ha:H0, hb:H1) outputtuples printsize

Finally, modify the rules involving hP to make it field insensitive. That is, hP is true whenever any field of ha may point to hb. (Hint: Remember that the _ term means don't care.)

Copy pacs_field_insensitive.dtl into submit/ when you are done.

Copy the output of running python print_bddbddb_results.py AddressBook results/hP.tuples on the results of your field-insensitive analysis into a text file named submit/problem5.txt. In that same file, record the size of the hP relation. Did it increase or decrease from its original value of 105? Why? How does your set of tuples differ from the originals in submit/problem1.txt and why are they different?

Part D: Context-sensitive versus context-insensitive pointer analysis

As you've learned in class, a context-insensitive analysis can be much faster to compute than a context-sensitive analysis, but it is often less precise. In this problem, you will compare the results from context-sensitive and context-insensitive analyses.

Problem 6

pa.dtl contains a context-insensitive pointer analysis. Run it using:

python run_bddbddb.py AddressBook pa.dtl

and print out the resulting hP results using:

python print_bddbddb_results.py AddressBook results/hP.tuples

Copy the results into submit/problem6.txt and, in that same file, write a comparison of these results with the context-sensitive pointer analysis results in submit/problem1.txt. In particular, point out where the context-insensitive analysis leads to less precise results, and how the precision is lost (by inspecting what occurs in AddressBook.java). Consult Chapter 12 of the Dragon book for hints.

Submission Logistics

To submit your solutions, package up the submit directory:

myth1:~/hw6a$ tar -czf hw6a_submission.tar.gz submit

and email the archive to the staff list. Your submit directory should contain the following files:

problem1.txt
problem2.txt
alias.dtl
problem3.txt
alias_refined.dtl
pacs_field_insensitive.dtl
problem5.txt
problem6.txt

If you're working with a partner, indicate their name and SUNETID along with the email. Only one submission is required per pair. If you want to resubmit a new solution (before the due date), simply email us the new submission. We'll take the latest version.