c - Open MP 性能不佳/令人困惑
问题描述
以下是 Tim Mattson 在 Open MP 上的一系列视频中的代码。我所做的唯一更改是使线程数变为 24,因为我有一台 24 核机器。它的性能几乎没有它应该的那么好,我对为什么感到困惑(见下面的结果)。我在这里错过了什么吗?我应该提一下,我是一位在算法方面有经验的理论计算机科学家,但在硬件方面我有点生疏。
#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000;
double step;
int main ()
{
int i;
double x, pi, sum = 0.0;
double start_time, run_time;
step = 1.0/(double) num_steps;
for (i=1;i<=24;i++){
sum = 0.0;
omp_set_num_threads(i);
start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
printf(" num_threads = %d",omp_get_num_threads());
#pragma omp for reduction(+:sum)
for (i=1;i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
}
pi = step * sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi is %f in %f seconds and %d threads\n",pi,run_time,i);
}
}
我预计 24 核的速度会快 20-24 倍,但速度几乎没有两倍。为什么?!这是输出:
num_threads = 1
pi is 3.141593 in 1.531695 seconds and 1 threads
num_threads = 2
pi is 3.141594 in 1.405237 seconds and 2 threads
num_threads = 3
pi is 3.141593 in 1.313049 seconds and 3 threads
num_threads = 4
pi is 3.141592 in 1.069563 seconds and 4 threads
num_threads = 5
pi is 3.141587 in 1.058272 seconds and 5 threads
num_threads = 6
pi is 3.141590 in 1.016013 seconds and 6 threads
num_threads = 7
pi is 3.141579 in 1.023723 seconds and 7 threads
num_threads = 8
pi is 3.141582 in 0.760994 seconds and 8 threads
num_threads = 9
pi is 3.141585 in 0.791577 seconds and 9 threads
num_threads = 10
pi is 3.141593 in 0.868043 seconds and 10 threads
num_threads = 11
pi is 3.141592 in 0.797610 seconds and 11 threads
num_threads = 12
pi is 3.141592 in 0.802422 seconds and 12 threads
num_threads = 13
pi is 3.141590 in 0.941856 seconds and 13 threads
num_threads = 14
pi is 3.141591 in 0.928252 seconds and 14 threads
num_threads = 15
pi is 3.141592 in 0.867834 seconds and 15 threads
num_threads = 16
pi is 3.141593 in 0.830614 seconds and 16 threads
num_threads = 17
pi is 3.141592 in 0.856769 seconds and 17 threads
num_threads = 18
pi is 3.141591 in 0.907325 seconds and 18 threads
num_threads = 19
pi is 3.141592 in 0.880962 seconds and 19 threads
num_threads = 20
pi is 3.141592 in 0.855475 seconds and 20 threads
num_threads = 21
pi is 3.141592 in 0.825202 seconds and 21 threads
num_threads = 22
pi is 3.141592 in 0.759689 seconds and 22 threads
num_threads = 23
pi is 3.141592 in 0.751121 seconds and 23 threads
num_threads = 24
pi is 3.141592 in 0.745476 seconds and 24 threads
那么,我错过了什么?
解决方案
您有一个x
在所有线程之间共享的变量。
虽然编译器会优化它的使用,以便您仍然获得正确的结果(通过将计算值保存x
在寄存器中),但每次迭代都会将该值写入内存。这将在缓存行被刷新和重新加载时造成停顿。
解决方法是x
在使用它的循环体中声明 ( double x = (i-0.5)*step;
),而不是在main
.
推荐阅读
- python - 我的 For 循环在 Python 中与 print 和 file-write 的工作方式不同
- linux - linux如何移动列出的文件?
- json - 使用 jq 或 Python 解析 JSON
- c++ - C++ 中 sizeof() 运算符的功能
- php - 如何将数据转储到视图中?
- python - 如何在 python 中修复 tkinter 问题?
- django - TemplateDoesNotExist 在 /edit-narration/13/edit/
- windows - Windows 防火墙 - Laravel Artisan Serve - 在入站规则中允许端口(不起作用)
- python - PostgreSQL 日期范围查询在本地工作,但不在生产环境中
- c# - 在 C# 中部署的 Matlab DLL 出错(位置 2 中的索引超出数组边界(不得超过 1))