c++ - Parallel vector multiplication using several threads takes longer than sequential
问题描述
I have two functions, which do the multiplication of two vectors of integers (filled with all ones for now). I expect the function vector_multiplication_concurrent
, which uses threads to be faster than the function vector_multiplication
. However, it is actually a bit slower. I suspect that this is because only one thread works on result
variable at a time, so the threads do not actually do the job in parallel. Is it correct? How should I change the code to get the parallel function to be faster?
The code:
#include <iostream>
#include <chrono>
#include <vector>
#include <thread>
#include <mutex>
void vector_multiplication(std::vector<int> const & v1,
std::vector<int> const & v2,
int & result) {
for (int ind = 0; ind < v1.size(); ++ind) {
result += v1[ind] * v2[ind];
}
}
static std::mutex mtx;
void vector_multiplication_concurrent(std::vector<int> const & v1,
std::vector<int> const & v2,
int start_ind, int end_ind,
int & result) {
std::lock_guard<std::mutex> lck(mtx);
for (int ind = start_ind; ind <= end_ind; ++ind) {
result += v1[ind] * v2[ind];
}
}
int main(){
std::vector<int> v1 (10000000, 1);
std::vector<int> v2 (10000000, 1);
int result = 0;
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
vector_multiplication(v1, v2, result);
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << "Duration: " << duration << '\n';
std::cout << "Product: " << result << '\n';
int result_concurrent = 0;
int threads_num = 4;
std::vector<std::thread> threads;
std::chrono::high_resolution_clock::time_point t3 = std::chrono::high_resolution_clock::now();
for (int th = 0; th < threads_num; ++th) {
threads.push_back(std::thread(vector_multiplication_concurrent,
std::ref(v1),
std::ref(v2),
th * (v1.size() / threads_num),
th * (v1.size() / threads_num) + v1.size() / threads_num - 1,
std::ref(result_concurrent)));
}
for (auto & th : threads) {
th.join();
}
std::chrono::high_resolution_clock::time_point t4 = std::chrono::high_resolution_clock::now();
auto duration_concurrent = std::chrono::duration_cast<std::chrono::microseconds>(t4 - t3).count();
std::cout << "Duration concurrent: " << duration_concurrent << '\n';
std::cout << "Product concurrent: " << result_concurrent << '\n';
return 0;
}
解决方案
正如评论中提到的,您在函数的整个持续时间内锁定互斥锁,因此代码实际上是连续的。如果多个线程访问相同的内存并且至少有一个正在写入,则只需要一个互斥锁。
在对向量元素求和的情况下,您只需要在添加最终结果时让多个线程写入同一内存,因此您可以将函数更改为:
static std::mutex mtx;
void vector_multiplication_concurrent(std::vector<int> const & v1,
std::vector<int> const & v2,
int start_ind, int end_ind,
int & result) {
// fully parallel part
// v1 and v2 are shared, but you are only reading
int temp = 0;
for (int ind = start_ind; ind <= end_ind; ++ind) {
temp += v1[ind] * v2[ind];
}
// only this requires you to synchronize access
// result is shared and you are writing to it
std::lock_guard<std::mutex> lck(mtx);
result += temp;
}
PS:我强烈建议您使用迭代器而不是索引。另请注意,您的循环基本上是对std::inner_product
. 使用它而不是普通循环将使您的代码更具表现力。
推荐阅读
- python - Python:为什么当我将 int 或 list 或 tuple 放入列表时结果相同?
- javascript - 添加 javascript 查询动态更改 Oracle Apex 中的按钮标签
- python - Python字符串比较失败
- javascript - 旋转数字问题 - 为什么采用 n mod 10 和 n /10?
- linq - EF Core Complex 查询为每个参数变体缓存
- android - 在生产代码中使用 android library security-crypto 的候选版本
- matrix - 如何在 sprase 矩阵中使用 groupby 和 mean?
- javascript - 如何提高agora音质
- asp.net - 如何在子目录 asp.net 内的 web.config 中设置 bin 文件夹的路径
- python - 如何找到零的子区间