首页 > 技术文章 > C2 - Anti-dependences

kelthuzadx 2021-08-21 09:14 原文

写点C2中memory subgraph中涉及的反依赖问题。没有提纲,心意所至,笔向所指。

反依赖

有这样一段java代码...

 static class Foo {
     public int a = 2;
     public int b = 4;
 }

Foo test(int x) {
    Foo f1 = new Foo();
    Foo f2 = new Foo();
    f1.a = 42; // LINE9
    f2.a = f1.a;  // LINE10
    f1.a = 24; // LILNE11
    return f1;
}

它对应的ir局部如下,这里我们只关心line9-11:

image

C2稍微做了些优化,将line10变成了f2.a=42,三条赋值分配对应188#StoreI,192#StoreI和195#StoreI,它们依次执行,和代码顺序一致,这里有个问题?C2如何构造IR的时候如何知道192#StoreI的in(Memory)是188#StoreI,答案就是通过alias type。

简单来说,alias type用来描述memory slice信息,它可以不同粒度的描述memory slice。比如下面的例子,f.a和f2.a访问的memory slice的alias type一样(Reduced\(Foo+12),f.b和f2.b同理alias type一样(Reduced\)Foo+16)。

int test() {
    // @ <2>     +any in BotPTR *+bot
    // @ <3>     +0   in rawptr:BotPTR
    // @ <4>     +0   in java/lang/Object *
    // @ <5>  RO +8   in java/lang/Object+8 * [narrowklass]
    // @ <6>     +12  in Reduced$Foo+12 *
    // @ <7>     +16  in Reduced$Foo+16 *
    // @ <8>  RO +172 in klass java/lang/Object: 0x00007fff54006cb8+172 *
    Foo f = new Foo();
    Foo f2 = new Foo();
    f.a  =12;
    f.b = 23;
    f2.a = 6;
    f2.b = 7;
    return f.a+f2.b;
}

回到例子中,由于三个赋值访问的都是Foo#a字段,所以它们相关。甚至,我们可以把上面的赋值改成:
f1.a=42; f2.a=25;
生成的StoreI也会关联起来

image

上面的情况中,赋值语句的rhs都是常量,但是很多情况下rhs也可以是一个读内存值,这个时候我们需要一个LoadX节点,它的输入是一个内存地址,得到的是内存地址里面的值,行为和解引用一样。下面的java代码:

static int i=2,j=3,k=9;
int test(int x) {
    j = i;     // 30#StoreI
    i = k;     // 35#StoreI
    return i ; // at this point, i==9,j==2
}

35#StoreI相当于i=k,30#StoreI相当于j=i,具体点27#LoadI是读i的值,33#LoadI是读k的值。执行流程如下:

  1. 执行27#LoadI,读取i的值,相当于tmp=i 。此时tmp==2
  2. 执行35#StoreI,读取k的值,然后赋值给i所在内存。此时i == 9
  3. 执行30#StoreI,将27#LoadI读取到的i的值赋值给j,相当于j=tmp。此时j==2
  4. 最终 i==9,j==2

image

上面的IR有个比较严重的问题:27#LoadI(读i值)和35#StoreI(给i赋值)访问的memory slice的alias type是相同的,但是两者没有依赖关系的。想象一下,假如先执行35#StoreI,后执行27#LoadI会发生什么?

  1. 执行35#StoreI,读取k的值,然后赋值给i所在内存。此时i == 9
  2. 执行27#LoadI,读取i的值,相当于tmp=i 。此时tmp==9
  3. 执行30#StoreI,将27#LoadI读取到的i的值赋值给j,相当于j=tmp。此时j==9
  4. 最终i==9,j==9

35#StoreI和27#LoadI没有依赖关系,不同的执行顺序可能导致不同的结果。

要解决这个问题,我们需要插入反依赖(PhaseCFG::insert_anti_dependences),这一步是在GCM阶段完成,GCM之后的IR如下图

image

可以看到8#storeI多了一条12#loadI的输入。当执行8#storeI的时候强制要求读取k和i的值,保证了12#lloadI先于8#storeI执行。这条从12#loadI流出到8#storeI的表就叫做反依赖,注意反依赖说的是load依赖store,主体是load。C2在处理反依赖问题的时候,除了插入反依赖的边之外,还有一种解决方案,它也可能将load节点所在的LCA block强制上移,使得load节点可以放置的early block-LCA block均处于store节点之上,用这样的方式来保证store不会先于load执行。

C2插anti-dependences这一块比较简单,可以忽略无关紧要的细节:

Block* PhaseCFG::insert_anti_dependences(Block* LCA, Node* load, bool verify) {

  ...
  // LOAD所在block
  Block* early = get_block_for_node(load);
  ...
  // 拿load的in(Memory)作为起始内存节点
  Node* initial_mem = load->in(MemNode::Memory);
  worklist_store.push(initial_mem);
  worklist_visited.push(initial_mem);
  worklist_mem.push(NULL);
  while (worklist_store.size() > 0) {
    // Examine a nearby store to see if it might interfere with our load.
    Node* mem   = worklist_mem.pop();
    Node* store = worklist_store.pop();
    uint op = store->Opcode();

    // MergeMems do not directly have anti-deps.
    // Treat them as internal nodes in a forward tree of memory states,
    // the leaves of which are each a 'possible-def'.
    if (store == initial_mem    // root (exclusive) of tree we are searching
        || op == Op_MergeMem    // internal node of tree we are searching
        ) {
      mem = store;   // It's not a possibly interfering store.
      if (store == initial_mem)
        initial_mem = NULL;  // only process initial memory once
      // 起始内存节点的所有输出,全部放进worklist
      for (DUIterator_Fast imax, i = mem->fast_outs(imax); i < imax; i++) {
        store = mem->fast_out(i);
        if (store->is_MergeMem()) {
          // Be sure we don't get into combinatorial problems.
          // (Allow phis to be repeated; they can merge two relevant states.)
          uint j = worklist_visited.size();
          for (; j > 0; j--) {
            if (worklist_visited.at(j-1) == store)  break;
          }
          if (j > 0)  continue; // already on work list; do not repeat
          worklist_visited.push(store);
        }
        worklist_mem.push(mem);
        worklist_store.push(store);
      }
      continue;
    }

    if (op == Op_MachProj || op == Op_Catch)   continue;
    if (store->needs_anti_dependence_check())  continue;  // not really a store

    // Compute the alias index.  Loads and stores with different alias
    // indices do not need anti-dependence edges.  Wide MemBar's are
    // anti-dependent on everything (except immutable memories).
    // 如果load和store访问的memory slice不是一个,那就不可能有反依赖了
    const TypePtr* adr_type = store->adr_type();
    if (!C->can_alias(adr_type, load_alias_idx))  continue;

    // Most slow-path runtime calls do NOT modify Java memory, but
    // they can block and so write Raw memory.
    ...
    // 拿store所在block
    Block* store_block = get_block_for_node(store);
    assert(store_block != NULL, "unused killing projections skipped above");

    if (store->is_Phi()) {
      // Loop-phis need to raise load before input. (Other phis are treated
      // as store below.)
      ... // case 1
    } else if (store_block != early) {
      // case 2,如果load和store不在一个基本块,那就不能在load和store之间
      //通过插入依赖边来解决反依赖问题,需要上移LCA block到store所在block
      // 的更上面
      // 'store' is between the current LCA and earliest possible block.
      // Label its block, and decide later on how to raise the LCA
      // to include the effect on LCA of this store.
      // If this store's block gets chosen as the raised LCA, we
      // will find him on the non_early_stores list and stick him
      // with a precedence edge.
      // (But, don't bother if LCA is already raised all the way.)
      if (LCA != early && !unrelated_load_in_store_null_block(store, load)) {
        store_block->set_raise_LCA_mark(load_index);
        must_raise_LCA = true;
        non_early_stores.push(store);
      }
    } else {
      // case 3,在load和store之间插入一条依赖
      // Found a possibly-interfering store in the load's 'early' block.
      // This means 'load' cannot sink at all in the dominator tree.
      // Add an anti-dep edge, and squeeze 'load' into the highest block.
      assert(store != load->find_exact_control(load->in(0)), "dependence cycle found");
      if (verify) {
        assert(store->find_edge(load) != -1 || unrelated_load_in_store_null_block(store, load),
               "missing precedence edge");
      } else {
        store->add_prec(load);
      }
      LCA = early;
      // This turns off the process of gathering non_early_stores.
    }
  }
  // (Worklist is now empty; all nearby stores have been visited.)

  // 如果load和store在一个block,那刚刚已经插入了依赖边,就可以了
  if (LCA == early)  return LCA;
  // 下面是对case2的处理
  // We get here only if there are no possibly-interfering stores
  // in the load's 'early' block.  Move LCA up above all predecessors
  // which contain stores we have noted.
  //
  // The raised LCA block can be a home to such interfering stores,
  // but its predecessors must not contain any such stores.
  //
  // The raised LCA will be a lower bound for placing the load,
  // preventing the load from sinking past any block containing
  // a store that may invalidate the memory state required by 'load'.
  if (must_raise_LCA)
    LCA = raise_LCA_above_marks(LCA, load->_idx, early, this);
  if (LCA == early)  return LCA;

  // Insert anti-dependence edges from 'load' to each store
  // in the non-early LCA block.
  // Mine the non_early_stores list for such stores.
  if (LCA->raise_LCA_mark() == load_index) {
    while (non_early_stores.size() > 0) {
      Node* store = non_early_stores.pop();
      Block* store_block = get_block_for_node(store);
      if (store_block == LCA) {
        // add anti_dependence from store to load in its own block
        assert(store != load->find_exact_control(load->in(0)), "dependence cycle found");
        if (verify) {
          assert(store->find_edge(load) != -1, "missing precedence edge");
        } else {
          store->add_prec(load);
        }
      } else {
        assert(store_block->raise_LCA_mark() == load_index, "block was marked");
        // Any other stores we found must be either inside the new LCA
        // or else outside the original LCA.  In the latter case, they
        // did not interfere with any use of 'load'.
        assert(LCA->dominates(store_block)
               || !LCA_orig->dominates(store_block), "no stray stores");
      }
    }
  }

  // Return the highest block containing stores; any stores
  // within that block have been given anti-dependence edges.
  return LCA;
}

别名

-XX:AliasLevel目前有三级:

  product(intx, AliasLevel,     3,                                          \
          "0 for no aliasing, 1 for oop/field/static/array split, "         \
          "2 for class split, 3 for unique instances")                      \
          range(0, 3)                                                       \
          constraint(AliasLevelConstraintFunc,AfterErgo)                    \

虽然默认级别3,但是实际上c2在普通情况下都是走的2,只有在逃逸分析的时候,才会用split_unique_type精确到instance type。

AliasLevel说的是对memory slice的粒度描述。1级别下,所有Type::InstPtr和Type::AryPtr被当做TypeInstPtr::BOTTOM,所有Type::KlassPtr被当做TypeKlassPtr::BOTTOM,所有Type::RawPtr被当做TypeRawPtr::BOTTOM,所有Type::AnyPtr被当做TypePtr::BOTTOM,换句话说,1级别可以区分出写内存操作写的是类数据,还是实例数据,还是raw内存,又或者都不是,非常粗粒度。而最常见的2级别下它可以区分出写内存操作写的是哪些类的哪些字段,比如foo1.field和foo2.field是一样的alias type,都是Foo+field1_OFFSET。经过逃逸分析后达到3级别可以精确区分出写内存操作写的是不同的实例,比如foo1.field和foo2.field它就知道它们写的内存不一样,alias type也不一样。

之所以正常级别是2,是因为我们构造ir的时候,对于x.f=a和x1.f=b唯一知道的信息是x的类型和f的offset。只有通过EA后建立的CG才能知道x和x1是不是指向同一个对象,所以要达到3级必须走EA。

推荐阅读