首页 > 解决方案 > Goroutines 和 C 线程之间的原子栅栏并发——语义是什么?

问题描述

我想知道是否可以显式地协调 goroutine 和 C 线程之间的原子操作并发。

这里的用例涉及 C 中的音频处理库,它创建一个 OS 线程,并定期调用用户提供的回调来检索音频数据。这必须几乎实时发生,所以我不想招致 cgo 调用、堆栈交换和 Go-land 并发的开销。环形缓冲区通常可以解决这个问题,其中一个线程写入缓冲区,另一个线程读取,并使用内存栅栏执行同步。

然而,目前看来,Go 中原子操作的内存语义在文档中完全未定义,因此对于此目的完全无用,可能还有许多其他用途......(https://golang.org/pkg/sync /atomic/无济于事地只是说“原子”,请参阅https://github.com/golang/go/issues/5045

但是 - 它必须以某种方式工作,即使没有记录。如何?

请注意,我并不是在询问我所描述的问题的解决方案。我不是在问环形缓冲区是否是正确的选择,或者我是否应该“通过共享进行通信”或其他什么。我在询问当前在 Go 中实现的原子操作的内存顺序语义(例如,最新的发布版本 - 具体来说是 1.16.5)。

特别是,这是一个示例程序,它设置了与我的实际用例中发生的情况类似的情况:

package main

/*

   #include <pthread.h>
   #include <malloc.h>

   typedef struct {
      int fence_0;
      char *data;
   } shared_data;

   shared_data *make_shared_data() {
      shared_data *sd = calloc(sizeof(shared_data), 1);
      sd->data = calloc(1024,1);
      sd->data[0] = 17;
      return sd;
   }

   void *get_shared_data_ptr(shared_data *sd) {
      return sd->data;
   }

   int read_data_in_pthread(shared_data *sd) {
      int l;
      __atomic_load(&sd->fence_0, &l, __ATOMIC_ACQUIRE);
      if (l < 2) return 0;
      return sd->data[0] + sd->data[1023]; 
   }

*/
import "C"
import (
   "fmt"
   "runtime"
   "reflect"
   "unsafe"
   "sync/atomic"
)

func main() {

   // Prevent thread/cache switching (to avoid asking a third, unimportant question and allow the below "naughty")
   runtime.LockOSThread()

   // Allocate a C-owned structure.
   csd := C.make_shared_data()

   // This is just an expedient for the sake of this example, I'm aware it's naughty/bad, etc.
   ptr := (*byte)(C.get_shared_data_ptr(csd))
   arrptr := &reflect.SliceHeader{Data: uintptr(unsafe.Pointer(ptr)), Len: 1024, Cap: 1024}
   arr := *(*[]byte)(unsafe.Pointer(arrptr))

   fmt.Printf("%d\n", arr[0])
   done := make(chan bool)

   // Repeatedly execute a reader function in a cgo thread which will output zero if first fence is not 2
   // and output the sum of the first and last data points if it is.
   go func(){
         var s uint8
         s = 0
         for s == 0 {
            s = uint8(C.read_data_in_pthread(csd))
         }
         fmt.Printf("finished: %d\n", s)
         done <- true
      }()

   go func(){
         atomic.StoreInt32((*int32)(&csd.fence_0), 1)
         for i := 0; i < 1024; i++ {
            arr[i] = 255
         }
         atomic.StoreInt32((*int32)(&csd.fence_0), 2)
   }()

   <-done
}

问题是:(a)这个程序的输出可以是17吗?(b) 如果不是,这个程序的输出一定是254,还是可能是255

如果 Go 原子存储使用类似于 gcc 的 ATOMIC_SEQ_CST 的内存模型,那么内存栅栏是顺序的,我们总是会看到254. 这似乎是一个明智的默认设置。但是,这一定是真的吗?

如果不是,我的程序将不可移植并产生错误。所以,我很想知道。

(是的,我知道上面的测试用例绝对是完全不可移植的/只能在 GNU/Linux 上运行……实际的库实际上是可移植的。)

标签: multithreadinggoconcurrencycgoatomic

解决方案


There's a sort of impedance mismatch, as it were, between the Go memory model and the (multiple) memory models available in C and C++ (see cppreference.com on C memory order options, and note that C++ has a more nuanced view than C11 did, beginning in C++20). This can, at least in theory, make for some big headaches for implementors: calls in and out of C code, via cgo, might need to do heavy-duty CPU sync if, e.g., the Go system uses some sort of total or partial store order model and the C system uses a relaxed memory model.

In practice, each implementation will strive to use the same kinds of synchronizations for atomic-load-32 and atomic-store-32, for instance. But:

The use case here involves an audio processing library in C, which creates an OS thread, and periodically calls a user-supplied callback to retrieve audio data. This must happen in almost real-time, so I don't want to incur the overhead of cgo calls, stack swaps, and Go-land concurrency. A ring buffer can solve this problem in general, where one thread writes to the buffer, another reads, and synchronization is performed with memory fences.

[snip]

But - it has to work in some way, even if that's not documented. How?

You're going to have to look at each implementation, one at a time, because the "how" could—at least potentially—be different each time. So find out what your systems use on their PowerPC implementations, find out what your systems use on their ARM implementations, and so on. You'll want to have your low level Go routines be implementation-specific, chosen to work with your low-level C routines.


推荐阅读