写点C2中memory subgraph中涉及的反依赖问题。没有提纲,心意所至,笔向所指。
反依赖
有这样一段java代码...
static class Foo {
public int a = 2;
public int b = 4;
}
Foo test(int x) {
Foo f1 = new Foo();
Foo f2 = new Foo();
f1.a = 42; // LINE9
f2.a = f1.a; // LINE10
f1.a = 24; // LILNE11
return f1;
}
它对应的ir局部如下,这里我们只关心line9-11:
C2稍微做了些优化,将line10变成了f2.a=42,三条赋值分配对应188#StoreI,192#StoreI和195#StoreI,它们依次执行,和代码顺序一致,这里有个问题?C2如何构造IR的时候如何知道192#StoreI的in(Memory)是188#StoreI,答案就是通过alias type。
简单来说,alias type用来描述memory slice信息,它可以不同粒度的描述memory slice。比如下面的例子,f.a和f2.a访问的memory slice的alias type一样(Reduced\(Foo+12),f.b和f2.b同理alias type一样(Reduced\)Foo+16)。
int test() {
// @ <2> +any in BotPTR *+bot
// @ <3> +0 in rawptr:BotPTR
// @ <4> +0 in java/lang/Object *
// @ <5> RO +8 in java/lang/Object+8 * [narrowklass]
// @ <6> +12 in Reduced$Foo+12 *
// @ <7> +16 in Reduced$Foo+16 *
// @ <8> RO +172 in klass java/lang/Object: 0x00007fff54006cb8+172 *
Foo f = new Foo();
Foo f2 = new Foo();
f.a =12;
f.b = 23;
f2.a = 6;
f2.b = 7;
return f.a+f2.b;
}
回到例子中,由于三个赋值访问的都是Foo#a字段,所以它们相关。甚至,我们可以把上面的赋值改成:
f1.a=42; f2.a=25;
生成的StoreI也会关联起来
上面的情况中,赋值语句的rhs都是常量,但是很多情况下rhs也可以是一个读内存值,这个时候我们需要一个LoadX节点,它的输入是一个内存地址,得到的是内存地址里面的值,行为和解引用一样。下面的java代码:
static int i=2,j=3,k=9;
int test(int x) {
j = i; // 30#StoreI
i = k; // 35#StoreI
return i ; // at this point, i==9,j==2
}
35#StoreI相当于i=k,30#StoreI相当于j=i,具体点27#LoadI是读i的值,33#LoadI是读k的值。执行流程如下:
- 执行27#LoadI,读取i的值,相当于tmp=i 。此时tmp==2
- 执行35#StoreI,读取k的值,然后赋值给i所在内存。此时i == 9
- 执行30#StoreI,将27#LoadI读取到的i的值赋值给j,相当于j=tmp。此时j==2
- 最终
i==9,j==2
。
上面的IR有个比较严重的问题:27#LoadI(读i值)和35#StoreI(给i赋值)访问的memory slice的alias type是相同的,但是两者没有依赖关系的。想象一下,假如先执行35#StoreI,后执行27#LoadI会发生什么?
- 执行35#StoreI,读取k的值,然后赋值给i所在内存。此时i == 9
- 执行27#LoadI,读取i的值,相当于tmp=i 。此时tmp==9
- 执行30#StoreI,将27#LoadI读取到的i的值赋值给j,相当于j=tmp。此时j==9
- 最终
i==9,j==9
35#StoreI和27#LoadI没有依赖关系,不同的执行顺序可能导致不同的结果。
要解决这个问题,我们需要插入反依赖(PhaseCFG::insert_anti_dependences),这一步是在GCM阶段完成,GCM之后的IR如下图
可以看到8#storeI多了一条12#loadI的输入。当执行8#storeI的时候强制要求读取k和i的值,保证了12#lloadI先于8#storeI执行。这条从12#loadI流出到8#storeI的表就叫做反依赖,注意反依赖说的是load依赖store,主体是load。C2在处理反依赖问题的时候,除了插入反依赖的边之外,还有一种解决方案,它也可能将load节点所在的LCA block强制上移,使得load节点可以放置的early block-LCA block均处于store节点之上,用这样的方式来保证store不会先于load执行。
C2插anti-dependences这一块比较简单,可以忽略无关紧要的细节:
Block* PhaseCFG::insert_anti_dependences(Block* LCA, Node* load, bool verify) {
...
// LOAD所在block
Block* early = get_block_for_node(load);
...
// 拿load的in(Memory)作为起始内存节点
Node* initial_mem = load->in(MemNode::Memory);
worklist_store.push(initial_mem);
worklist_visited.push(initial_mem);
worklist_mem.push(NULL);
while (worklist_store.size() > 0) {
// Examine a nearby store to see if it might interfere with our load.
Node* mem = worklist_mem.pop();
Node* store = worklist_store.pop();
uint op = store->Opcode();
// MergeMems do not directly have anti-deps.
// Treat them as internal nodes in a forward tree of memory states,
// the leaves of which are each a 'possible-def'.
if (store == initial_mem // root (exclusive) of tree we are searching
|| op == Op_MergeMem // internal node of tree we are searching
) {
mem = store; // It's not a possibly interfering store.
if (store == initial_mem)
initial_mem = NULL; // only process initial memory once
// 起始内存节点的所有输出,全部放进worklist
for (DUIterator_Fast imax, i = mem->fast_outs(imax); i < imax; i++) {
store = mem->fast_out(i);
if (store->is_MergeMem()) {
// Be sure we don't get into combinatorial problems.
// (Allow phis to be repeated; they can merge two relevant states.)
uint j = worklist_visited.size();
for (; j > 0; j--) {
if (worklist_visited.at(j-1) == store) break;
}
if (j > 0) continue; // already on work list; do not repeat
worklist_visited.push(store);
}
worklist_mem.push(mem);
worklist_store.push(store);
}
continue;
}
if (op == Op_MachProj || op == Op_Catch) continue;
if (store->needs_anti_dependence_check()) continue; // not really a store
// Compute the alias index. Loads and stores with different alias
// indices do not need anti-dependence edges. Wide MemBar's are
// anti-dependent on everything (except immutable memories).
// 如果load和store访问的memory slice不是一个,那就不可能有反依赖了
const TypePtr* adr_type = store->adr_type();
if (!C->can_alias(adr_type, load_alias_idx)) continue;
// Most slow-path runtime calls do NOT modify Java memory, but
// they can block and so write Raw memory.
...
// 拿store所在block
Block* store_block = get_block_for_node(store);
assert(store_block != NULL, "unused killing projections skipped above");
if (store->is_Phi()) {
// Loop-phis need to raise load before input. (Other phis are treated
// as store below.)
... // case 1
} else if (store_block != early) {
// case 2,如果load和store不在一个基本块,那就不能在load和store之间
//通过插入依赖边来解决反依赖问题,需要上移LCA block到store所在block
// 的更上面
// 'store' is between the current LCA and earliest possible block.
// Label its block, and decide later on how to raise the LCA
// to include the effect on LCA of this store.
// If this store's block gets chosen as the raised LCA, we
// will find him on the non_early_stores list and stick him
// with a precedence edge.
// (But, don't bother if LCA is already raised all the way.)
if (LCA != early && !unrelated_load_in_store_null_block(store, load)) {
store_block->set_raise_LCA_mark(load_index);
must_raise_LCA = true;
non_early_stores.push(store);
}
} else {
// case 3,在load和store之间插入一条依赖
// Found a possibly-interfering store in the load's 'early' block.
// This means 'load' cannot sink at all in the dominator tree.
// Add an anti-dep edge, and squeeze 'load' into the highest block.
assert(store != load->find_exact_control(load->in(0)), "dependence cycle found");
if (verify) {
assert(store->find_edge(load) != -1 || unrelated_load_in_store_null_block(store, load),
"missing precedence edge");
} else {
store->add_prec(load);
}
LCA = early;
// This turns off the process of gathering non_early_stores.
}
}
// (Worklist is now empty; all nearby stores have been visited.)
// 如果load和store在一个block,那刚刚已经插入了依赖边,就可以了
if (LCA == early) return LCA;
// 下面是对case2的处理
// We get here only if there are no possibly-interfering stores
// in the load's 'early' block. Move LCA up above all predecessors
// which contain stores we have noted.
//
// The raised LCA block can be a home to such interfering stores,
// but its predecessors must not contain any such stores.
//
// The raised LCA will be a lower bound for placing the load,
// preventing the load from sinking past any block containing
// a store that may invalidate the memory state required by 'load'.
if (must_raise_LCA)
LCA = raise_LCA_above_marks(LCA, load->_idx, early, this);
if (LCA == early) return LCA;
// Insert anti-dependence edges from 'load' to each store
// in the non-early LCA block.
// Mine the non_early_stores list for such stores.
if (LCA->raise_LCA_mark() == load_index) {
while (non_early_stores.size() > 0) {
Node* store = non_early_stores.pop();
Block* store_block = get_block_for_node(store);
if (store_block == LCA) {
// add anti_dependence from store to load in its own block
assert(store != load->find_exact_control(load->in(0)), "dependence cycle found");
if (verify) {
assert(store->find_edge(load) != -1, "missing precedence edge");
} else {
store->add_prec(load);
}
} else {
assert(store_block->raise_LCA_mark() == load_index, "block was marked");
// Any other stores we found must be either inside the new LCA
// or else outside the original LCA. In the latter case, they
// did not interfere with any use of 'load'.
assert(LCA->dominates(store_block)
|| !LCA_orig->dominates(store_block), "no stray stores");
}
}
}
// Return the highest block containing stores; any stores
// within that block have been given anti-dependence edges.
return LCA;
}
别名
-XX:AliasLevel目前有三级:
product(intx, AliasLevel, 3, \
"0 for no aliasing, 1 for oop/field/static/array split, " \
"2 for class split, 3 for unique instances") \
range(0, 3) \
constraint(AliasLevelConstraintFunc,AfterErgo) \
虽然默认级别3,但是实际上c2在普通情况下都是走的2,只有在逃逸分析的时候,才会用split_unique_type精确到instance type。
AliasLevel说的是对memory slice的粒度描述。1级别下,所有Type::InstPtr和Type::AryPtr被当做TypeInstPtr::BOTTOM,所有Type::KlassPtr被当做TypeKlassPtr::BOTTOM,所有Type::RawPtr被当做TypeRawPtr::BOTTOM,所有Type::AnyPtr被当做TypePtr::BOTTOM,换句话说,1级别可以区分出写内存操作写的是类数据,还是实例数据,还是raw内存,又或者都不是,非常粗粒度。而最常见的2级别下它可以区分出写内存操作写的是哪些类的哪些字段,比如foo1.field和foo2.field是一样的alias type,都是Foo+field1_OFFSET。经过逃逸分析后达到3级别可以精确区分出写内存操作写的是不同的实例,比如foo1.field和foo2.field它就知道它们写的内存不一样,alias type也不一样。
之所以正常级别是2,是因为我们构造ir的时候,对于x.f=a和x1.f=b唯一知道的信息是x的类型和f的offset。只有通过EA后建立的CG才能知道x和x1是不是指向同一个对象,所以要达到3级必须走EA。