首页 > 解决方案 > 除了锁和线程的创建和销毁成本之外,什么会降低多线程的性能?

问题描述

我编写了一个程序,用于std::thread::hardware_concurrency获取我的计算机可以支持多少线程。然后我将数组的大小除以 N 并得到 N 个块。我创建了 N 个线程来计算块的总和。这是代码

#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <thread>
#include <vector>
#include <stdlib.h>

int64_t thread_cost_time = 0;



template <typename Iterator, typename T> struct accumulate_block {
  void operator()(Iterator first, Iterator last, T &result) {
    using namespace std::chrono;
    auto start = std::chrono::high_resolution_clock::now();
    result = std::accumulate(first, last, result);
    
    auto stop = std::chrono::high_resolution_clock::now();
    auto thread_time =
        std::chrono::duration_cast<microseconds>(stop - start).count();
    thread_cost_time = std::max(thread_time, thread_cost_time);
  }
};

template <typename Iterator, typename T>
T parallel_accumulate(Iterator first, Iterator last, T &init, uint64_t num) {

  uint64_t length = std::distance(first, last);
  const uint64_t min_per_thread = 25;

  // it will assign 12 to hard_ware_threads in my pc
  const uint64_t hardware_threads = std::thread::hardware_concurrency();
  const uint64_t max_threads = (length + min_per_thread - 1) / (min_per_thread);

  // const uint64_t  num_threads = std::min(hardware_threads != 0 ?
  // hardware_threads : 2,
  //                                         max_threads);

  const uint64_t num_threads = num;

  const uint64_t block_size = length / num_threads;

  std::vector<T> results(num_threads);
  std::vector<std::thread> threads(num_threads - 1);
  Iterator block_start = first;
  for (uint64_t i = 0; i < num_threads - 1; i++) {
    Iterator block_end = block_start;
    std::advance(block_end, block_size);

    // calculate the sum of block
    threads[i] = std::thread{accumulate_block<Iterator, T>(), block_start,
                             block_end, std::ref(results[i])};
    block_start = block_end;
  }
  accumulate_block<Iterator, T>()(block_start, last, results[num_threads - 1]);
  std::for_each(threads.begin(), threads.end(),
                std::mem_fn(&std::thread::join));

  return std::accumulate(results.begin(), results.end(), init);
}




int main(int argc, char *argv[]) {

  //    constexpr const uint64_t sz = 1000000000;
  for (int number = 2; number < 32; number++) {
    int64_t parr = 0;
    int64_t single = 0;
    int64_t thread_trivial = 0;
    std::cout
        << "--------------------------------------------------------------"
        << std::endl;
    std::cout << "---------------------thread: " << number
              << "-----------------------" << std::endl;
    int iter_times = 10;
    for (int iter = 0; iter < iter_times; iter++) {
      thread_cost_time = 0;
      constexpr const uint64_t sz = 100000000 ;
      std::vector<uint64_t> arr;
      for (uint32_t i = 0; i < sz; i++) {
        arr.emplace_back(i);
      }

      using namespace std::chrono;

      auto start = std::chrono::high_resolution_clock::now();
      uint64_t init = 0;
      parallel_accumulate<decltype(arr.begin()), uint64_t>(
                       arr.begin(), arr.end(), std::ref(init), number);
      auto stop = std::chrono::high_resolution_clock::now();

      parr += std::chrono::duration_cast<microseconds>(stop - start).count();

      thread_trivial +=
          std::chrono::duration_cast<microseconds>(stop - start).count() -
          thread_cost_time;
      uint64_t init_ = 0;
      uint64_t arr_sz = arr.size();
      // uint64_t  block_sz = arr.size() / 2;
      start = std::chrono::high_resolution_clock::now();
      std::accumulate(arr.begin(), arr.end(), init_);
      // std::cout << init_ << std::endl;
      stop = std::chrono::high_resolution_clock::now();
      single += std::chrono::duration_cast<microseconds>(stop - start).count();

    }
    std::cout << "parallel " << parr / iter_times<< std::endl;
    std::cout << "single thread " << single / iter_times<< std::endl;
    std::cout << "parr is "
              << static_cast<double>(single) / static_cast<double>(parr)
              << "X fast" << std::endl;
    std::cout << "thread create and destory time " << thread_trivial / iter_times
              << std::endl;
  }
}

我记录了多线程和单线程的时间。

我最多只能比只使用一个线程快 6.57 倍,即使std::thread::hardware_concurrency告诉我我有 12 个线程可以同时运行。

这个程序没有争锁。我还记录了创建和销毁线程的时间,即使减去它,我仍然无法达到12倍的速度。

我想也许线程调度会使多线程变慢,但我有 12 个线程,它不应该只达到 6.57 倍的速度。我想也许多线程会降低缓存的命中率,但我不太确定。那么如何才能比只使用一个线程快 12 倍呢?

这是我的程序的静态

线程 平行 单身的 快点
2 324868 633777 1.95
3 218584 633777 2.87
4 167169 633777 3.77
5 136542 633777 4.64
6 113207 633777 5.48
7 147324 633777 4.27
8 136768 633777 4.67

您可以运行我的代码以从 2 个线程到 31 个线程获取数据

标签: c++multithreadingc++11

解决方案


显然,至少在我的 Intel core i7 上,std::thread::hardware_concurrency() 返回可用的硬件线程数。在具有同时多线程的硬件上,通常 2 个硬件线程在单个硬件内核上共享时间。硬件核心在 2 个硬件线程之间透明地切换。这意味着根据 std::thread::hardware_concurrency() 的结果,您只能获得大约一半的加速因子。

在实践中,每个硬件线程会不时因各种原因停止,例如等待数据从内存到达,给其他硬件线程额外的处理时间。通常同时多线程(或 英特尔称之为超线程)会给您额外 15% 的性能,因此您可能期望高达 (12/2)*(115/100) = 6.9 的加速因子.

开销,包括您提到的开销,而且根据我的经验,增加的工作集大小可以进一步降低加速因子。


推荐阅读