首页 > 解决方案 > 许多慢核心没有加速?

问题描述

我的 C++ 程序计算 3D 点的许多特征。这需要一段时间,所以我在不同的线程中进行计算。最后,(所有线程的)所有特征都必须存储在一个文件中。在我的本地机器上,多线程实现非常成功(4 个线程 -> 运行时间减少了 73%)。但是,在我的服务器(40 个慢速 2GHz 内核,80 个线程)上,它甚至比我本地的 4 个线程还要慢。运行时:

我的代码已附加。我试过了:

      ...
      std::vector<double*> points;
      for(unsigned int j = 0; j < xyz.size(); j++) {
        points.push_back(new double[3]{xyz[j][0], xyz[j][1], xyz[j][2]});
      } 
      ofstream featuresOut;
      featuresOut.open(...);
      ...
      KDtree t(&points[0], points.size()); // Build tree
      float batchSize = ((float)points.size())/jobs;
      unsigned int first = job * batchSize;
      unsigned int last = ((job+1) * batchSize) - 1;

      // Generate features
    #ifdef _OPENMP
      omp_set_num_threads(OPENMP_NUM_THREADS);
    #pragma omp parallel for schedule(dynamic)
    #endif
      for(unsigned int r = first; r <= last; r++) {
        if (r % 100000 == 0) {
          cout << "Calculating features for point nr. " << r << endl;
        }
    #ifdef _OPENMP
        int thread_num = omp_get_thread_num();
    #else
        int thread_num = 0;
    #endif
        double features[FEATURE_VECTOR_SIZE];
        if (!(ignoreClass0 && type.valid() && type[r]==0)) {
          double reference[3] = {xyz[r][0], xyz[r][1], xyz[r][2]};
          vector<Point> neighbors = t.kNearestNeighbors(reference, kMax, thread_num); // here we have a ordered set of kMax neighbors (at maximum - could be less for small scans)
          //std::vector<double> features = generateNeighborhoodFeatures(t, reference, kMin, kMax, kDelta, cylRadius);
          unsigned int kOpt = determineKOpt(t, reference, neighbors, kMin, kMax, kDelta); 
          generateNeighborhoodFeatures(features, t, reference, neighbors, kOpt, cylRadius, false, thread_num);
    #pragma omp critical
          {
            featuresOut << xyz[r][0] << "," << xyz[r][1] << "," << xyz[r][2] << ",";
            featuresOut << kOpt;

            for(unsigned int j = 0; j < FEATURE_VECTOR_SIZE; j++) {
              if (isfinite(features[j])) {
                featuresOut << "," << features[j];
              }
              else {
                cout << "Attention! Feature " << j << " (+5) was not finite. Replaced it with very big number" << endl;
                featuresOut << "," << DBL_MAX;
              }
            }
            featuresOut << ",";
            if (type.valid()) {
              featuresOut << type[r];
            } else {
              featuresOut << 0;
            }
            featuresOut << endl;
          }
        }
      }

仅在最后写入磁盘(线程的聚合结果)并不能解决问题(请参阅@J.Svejda 的答案)。同样为每个线程保留一个 KDtree 不会导致加速。

谢谢你的帮助。

标签: c++openmp

解决方案


我相信这是因为你的关键部分。写入磁盘的时间通常比在 CPU 上计算要多得多。我不知道您在 KDTree 上所做的工作有多复杂,但写入磁盘可能需要几毫秒,而 CPU 上的指令大约是纳秒级。featuresOut尽管在您发送 之前,数据可能不会被刷新endl,但它可以解释您糟糕的缩放比例。临界区时间太长,线程必须相互等待。

也许增加每个线程的工作数量,比如说一个线程做 5% 的点。然后它将来自更多点的聚合数据输出到文件中。看看它是否提高了加速。


推荐阅读