Static Analysis

Control Flow Graphs (CFG)

Construction

Nodes (Basic Blocks

INPUT: a sequencce of three-address instructions of P OUTPUT: a list of basic blocks of P METHOD:

determine the leads in P
- the first instruction in P is a leader
- any target instruction of a conditional or unconditional jump is a leader
- any instruction that immediately follows a conditional or unconditional jump is a leader
build BBs for P
- a BB consists of a leader and all its subsequent instructions until the next leader

Edges

There is a conditional or unconditional jump from the end of A to the beginning of B
B immediately follows A in the original order of instructions and A does not end in an unconditional jump

Data Flow Analysis

设数据流分析的值域是 $V$ ，可定义一个 k-tuple： $(O U T [n_{1}], O U T [n_{2}], \dots, O U T [n_{k}])$ 。是集合 $V^{k} = (V_{1}, V_{2}, \dots, V_{k})$ 的一个元素，表示每次迭代后 $k$ 个节点整体的值。

每一次迭代可看作是 $V^{k}$ 映射到新的 $V^{k}$ ，通过转换规则和控制流来映射，记作函数 $F ： V^{k} \to V^{k}$

通过不断迭代，直到相邻两次迭代的 k-tuple 值一样，算法结束。

(⊥, ⊥, \dots, ⊥) (v_{1}^{1}, v_{2}^{1}, \dots, v_{k}^{1}) (v_{1}^{2}, v_{2}^{2}, \dots, v_{k}^{2}) \dots (v_{1}^{i}, v_{2}^{i}, \dots, v_{k}^{i}) (v_{1}^{i + 1}, v_{2}^{i + 1}, \dots, v_{k}^{i + 1}) = X_{0} = X_{1} = F (X_{0}) = X_{2} = F (X_{1}) = X_{i} = F (X_{i - 1}) = X_{i + 1} = F (X_{i})

数据流分析可以看做是迭代算法对格点利用转换规则和 meet/join 操作。

Data Flow Analysis - Applications

Each node is associated with a Transfer Function according to the semantics of statements.
Each program point is associated with a data-flow value that represents an abstraction of the set of all possible program states that can be observed

Forward Analysis: OUT[stm] = f(IN[stm])
Backward Analysis: IN[stm] = f(OUT[stm])

Data Flow analysis is to find a solution to a set of safe-approximation directed constraints on the IN[stm]'s and OUT[stm]'s, for all statements stm

Control Flow's Constraints

Control flow within a Basic Block: $IN [s_{i + 1}] = OUT [s_{i}],$ for all $i = 1, 2, \dots, n - 1$
Control flow among Basic Blocks:
- $IN [B] = IN [s_{1}]$ & $OUT [B] = OUT [s_{n}]$
- for Forward Analysis:
  - $OUT [B] = f_{B} (IN [B]), f_{B} = f_{S_{n}} \circ \dots \circ f_{S_{2}} \circ f_{S_{1}}$
  - $IN [B] = ⋀ P$ a predecessor of $B OUT [P]$
- for Backward Analysis
  - $IN [B] = f_{B} (OUT [B]), f_{B} = f_{S_{1}} \circ \dots \circ f_{S_{S - 1}} \circ f_{S_{n}}$
  - $OUT [B] = Λ_{S a successor of B} IN [S]$

大多数情况下，optimization application 都需要一个 conservative approximations. 如果我们拿到了错误的信息，那么我们的优化就可能是 unsound，并且会影响到程序本来的语义。

一个 canonical choice 是为每个名为 $i d$ 的变量 introduce a target $& i d$ 为每个 allocation site introduce a target $ma ll o c_{i}$ 其中 i 是一个唯一的 index

points to analysis 发生在 syntax tree 上，因为发生在 control flow analysis 之前或同时进行。 points to analysis 的结果是一个函数 $pt (p o in t er)$ 返回 set of possible pointer targets。如果我们想知道两个指针 $p$ & $q$ 是否可能是 aliases，那么一个安全的做法就是比较其交集 $pt (p) \cap pt (q)$

Andersen’s Algorithm

对每个 variable named $i d$ ，用集合 $[[i d]]$ 表示所有可能的 pointer targets

分析假设程序已经倍 normalized，也就是说 pointer manipulation 仅限于以下几种 1 $i d = ma ll oc$ ，生成 constraints: ${mallo c_{i}} \subseteq [[i d]]$ 2 $i d_{1} = & i d_{2}$ ，生成 constraints: ${& i d_{2}} \subseteq [[i d_{1}]]$ 3 $i d_{1} = i d_{2}$ ，生成 constraints: $[[i d_{2}]] \subseteq [[i d_{1}]]$ 4 $i d_{1} = * i d_{2}$ ，生成 constraints: $& i d \in [[i d_{2}]] \Rightarrow [[i d]] \subseteq [[i d_{1}]]$ 5 $* i d_{1} = i d_{2}$ ，生成 constraints: $& i d \in [[i d_{1}]] \Rightarrow [[i d_{2}]] \subseteq [[i d]]$ 6 $i d = n u ll$ ，生成 constraints: $\emptyset \subseteq [[i d]]$ which 可以安全的被忽略

最后我们得到 5 种 constraints，最后解这些约束就得到算法的结果

Steensgaard’s Algorithm

另一个更粗略度一点的算法，by viewing assignments as being bidirectional.

TODO

解这些约束就得到算法的结果，The resulting points-to function is defined as: $pt (p) = {& i d ∣ * p \sim i d} \cup {ma ll o c_{i} ∣ * p \sim ma ll o c_{i}}$

Interprocedural Points-To Analysis

函数指针也可能有 indirect references，我们需要同时做 control flow analysis and the points-to analysis。例如：(***x)(1,2,3);

我们可以对程序做一个简化，使得 function calls 总是形如 id1 = (id2)(a1, ..., an);。类似的 all return expressions are assumed to be just variables.

到目前为止我们都只是把 heap 看作一个 amorphous 结构，几乎只关注了 stack based vars. 我们可以用 shape analysis 对堆进行更细致的分析。

Shape graphs 是一个有向图，其每一个节点都是一个 pointer targets。Shape graphs 的 order 根据 inclusion of their sets of edges 定义。Thus, $⊥$ is the graph without edges and $⊤$ is the completely connected graph.
pointer targets 表示在执行期间可能创建的 memory cells，边表示两个 cell 间可能包含一个引用。