c - 我在指针数组和指向数组的指针之间与 openmp 的性能差异有什么问题?
问题描述
我用 C 语言编写了两个程序,它们正在使用 openmp 进行高瘦矩阵乘法。该算法是我机器的内存限制。对于我使用的代码之一和用于存储矩阵的指针数组(aop)。对于我只在数组上使用的其他代码,矩阵的行一个接一个地存储,从现在开始称为 pta。现在我观察到 pta 总是优于 aop 版本。尤其是在使用 12 核而不是 6 核时,aop 的性能会稍微下降,而 pta 的性能会翻倍。我无法真正解释这种行为,我只是假设核心在计算过程中以某种方式干扰。有人可以解释这种行为吗?
指向数组版本的指针:
int main(int argc, char *argv[])
{
// parallel region to verify that pinning works correctly
#pragma omp parallel
{
printf("OpenMP thread %d / %d runs on core %d\n", omp_get_thread_num(), omp_get_num_threads(), sched_getcpu());
}
//define dimensions
int dim_n=atoi(*(argv+1));
int dim_nb=2;
printf("n = %d, nb = %d\n",dim_n,dim_nb);
//allocate space for matrix M, V and W
//each element of **M is a pointer for the first element of an array
//size of double and double* is depending on compiler and machine
double *M = malloc((dim_nb*dim_nb) * sizeof(double));
//Initialize Matrix M
for(int i=0; i<dim_nb; i++)
{
for(int j=0; j<dim_nb; j++)
{
M[i*dim_nb+j]=((i+1)-1.0)*dim_nb+(j+1)-1.0;
}
}
double *V = malloc((dim_n*dim_nb) * sizeof(double));
double *W = malloc((dim_n*dim_nb) * sizeof(double));
// using parallel region to Initialize the matrix V
#pragma omp parallel for schedule(static)
for (int i=0; i<dim_n; i++)
{
for (int j=0; j<dim_nb; j++)
{
V[i*dim_nb+j]=j+1;
}
}
int max_iter=100;
double time = omp_get_wtime();
// calculate the matrix-matrix product VM product max_iter times
for(int iter=0; iter<max_iter; iter++)
{
// calculate matrix-matrix product in parallel
#pragma omp parallel for schedule(static)
// i < #rows of V
for(int i=0; i<dim_n; i++)
{
// j < #columns of M
for(int j=0; j<dim_nb; j++)
{
// Initialize W_ij with zero, everytime W_ij is calculated
W[i*dim_nb+j]=0;
// k < #colums of V = rows of M
for(int k=0; k<dim_nb; k++)
{
W[i*dim_nb+j] += V[i*dim_nb+k]*M[k*dim_nb+j];
}
}
}
}
time=omp_get_wtime()-time;
'''
指针数组版本:
int main(int argc, char *argv[])
{
// parallel region to verify that pinning works correctly
#pragma omp parallel
{
printf("OpenMP thread %d / %d runs on core %d\n", omp_get_thread_num(), omp_get_num_threads(), sched_getcpu());
}
//define dimensions
int dim_n=atoi(*(argv+1));
int dim_nb=2;
printf("n = %d, nb = %d\n",dim_n,dim_nb);
//allocate space for matrix M, V and W
// each element of **M is a pointer for the first element of an array
//size of double and double* is depending on compiler and machine
double **M = malloc(dim_nb * sizeof(double *));
for(int i = 0; i < dim_nb; i++)
{
M[i] = malloc(dim_nb * sizeof(double));
}
//Initialize Matrix
for(int i=0; i<dim_nb; i++)
{
for(int j=0; j<dim_nb; j++)
{
M[i][j]=((i+1)-1.0)*dim_nb+(j+1)-1.0;
}
}
double **V = malloc(dim_n * sizeof(double *));
for(int i=0; i<dim_n; i++)
{
V[i] = malloc(dim_nb * sizeof(double));
}
double **W = malloc(dim_n * sizeof(double *));
for(int i=0; i<dim_n; i++)
{
W[i] = malloc(dim_nb * sizeof(double));
}
// using parallel region to Initialize the matrix V
#pragma omp parallel for schedule(static)
for (int i=0; i<dim_n; i++)
{
for (int j=0; j<dim_nb; j++)
{
V[i][j]=j+1;
}
}
int max_iter=100;
double time = omp_get_wtime();
// calculate the matrix-matrix product VM product max_iter times
for(int iter=0; iter<max_iter; iter++)
{
// calculate matrix-matrix product in parallel
#pragma omp parallel for schedule(static)
// i < #rows of V
for(int i=0; i<dim_n; i++)
{
// j < #columns of M
for(int j=0; j<dim_nb; j++)
{
// Initialize W_ij with zero, everytime W_ij is calculated
W[i][j]=0;
// k < #colums of V = rows of M
for(int k=0; k<dim_nb; k++)
{
W[i][j] += V[i][k]*M[k][j];
}
}
}
}
time=omp_get_wtime()-time;
解决方案
这很容易解释,因为指针版本必须先访问指针然后取消引用该指针。这些内存位置可能相距甚远,缓存也更有可能被刷新。数组中的数据存储在一个内存块中,因此需要更少的内存访问,并且 CPU 更有可能不会错过缓存。
推荐阅读
- javascript - createObjectURL:使用 Blob 时在 Safari 中键入错误
- javascript - 如何在函数调用中从数组中一一返回值?
- python - Python matplotlib 的 FuncAnimation 无法通过内部 Latex 渲染和 PyCharm 正确关闭?
- javascript - 基于 if 语句获取
- c# - 在 MVVM 项目中找不到默认端点
- tensorflow - 为什么评估骰子损失与评估骰子度量不同?
- python - 如何找到自相关和偏自相关的滞后数?
- flutter - 使用 Flutter 打开数据库时出现意外文本“等待”
- javascript - 如何在内部改变自定义 FormControl 值并以编程方式发出新值
- excel - 如何创建任务计划程序任务或以其他方式运行 SSMS 查询以保存到文件位置?