c++ - 并行性能不好的原因是什么？

问题描述

我正在尝试实现并行算法，该算法将计算列表中每个序列之间的Levenshtein 距离并将它们存储在矩阵（2d 向量）中。换句话说，我得到了带有数字的二维向量（数以千计的数字序列，最多 30 个数字），我需要计算每个整数向量之间的 Levenshtein 距离。我实现了有效的串行算法，但是当我尝试将其转换为并行时，它要慢得多（线程越多，它就越慢）。并行版本是用 c++11 线程实现的（我也尝试过 OpenMP，但结果相同）。

这是分配工作的函数：

vector<vector<int>> getGraphParallel(vector<vector<int>>& records){
    int V = records.size();
    auto threadCount = std::thread::hardware_concurrency();

    if(threadCount == 0){
        threadCount = 1;
    }
    vector<future<vector<vector<int>>>> futures;
    int rowCount = V / threadCount;
    vector<vector<int>>::const_iterator first = records.begin();
    vector<vector<int>>::const_iterator last = records.begin() + V;

    for(int i = 0; i < threadCount; i++){
        int start = i * rowCount;
        if(i == threadCount - 1){
            rowCount += V % threadCount;
        }
        futures.push_back(std::async(getRows, std::ref(records), start, rowCount, V));
    }

    vector<vector<int>> graph;
    for(int i = 0; i < futures.size(); i++){
        auto result = futures[i].get();
        for(const auto &row : result){
            graph.push_back(row);
        }
    }
    for(int i = 0; i < V; i++)
    {
        for(int j = i + 1; j < V; j++){
            graph[j][i] = graph[i][j];
        }
    }

    return graph;
}

这是计算最终矩阵行的函数：

vector<vector<int>> getRows(vector<vector<int>>& records, int from, int count, int size){
    vector<vector<int>> result(count, vector<int>(size, 0));
    for(int i = 0; i < count; i++){
        for(int j = i + from + 1; j < size; j++){
            result[i][j] = levenshteinDistance(records[i + from], records[j]);
        }
    }
    return result;
}

最后是计算 Levenshtein 距离的函数：

int levenshteinDistance(const vector<int>& first, const vector<int>& second){
    const int sizeFirst = first.size();
    const int sizeSecond = second.size();

    if(sizeFirst == 0) return sizeSecond;
    if(sizeSecond == 0) return sizeFirst;

    vector<vector<int>> distances(sizeFirst + 1, vector<int>(sizeSecond + 1, 0));

    for(int i = 0; i <= sizeFirst; i++){
        distances[i][0] = i;
    }

    for(int j = 0; j <= sizeSecond; j++){
        distances[0][j] = j;
    }

    for (int j = 1; j <= sizeSecond; j++)
        for (int i = 1; i <= sizeFirst; i++)
            if (first[i - 1] == second[j - 1])
                distances[i][j] = distances[i - 1][j - 1];
            else
                distances[i][j] = min(min(
                        distances[i - 1][j] + 1,
                        distances[i][j - 1] + 1),
                        distances[i - 1][j - 1] + 1
                );

    return distances[sizeFirst][sizeSecond];
}

我想到的一件事是这种减速是由错误共享引起的，但我无法用 perf 检查它，因为我在 Oracle VirtualBox 中使用 Ubuntu - 缓存未命中在那里不可用。如果我是对的并且速度变慢是由虚假共享引起的，我应该怎么做才能解决它？如果不是，这种减速的原因是什么？

标签： c++parallel-processing

我可以看到的一个问题是您在使用std::async时没有声明它应该如何运行。它可以异步运行或延迟运行。在延迟的情况下，它将全部在一个线程中运行，它只会被懒惰地评估。默认行为是实现定义的。对于您的情况，使用更多延迟评估它会运行得更慢是有道理的。你可以试试std::async(std::launch::async, ...)。

确保您的 VM 也设置为使用超过 1 个核心。理想情况下，在进行此类优化时，最好尝试消除尽可能多的变量。如果可以，请在没有 VM 的情况下在本地运行程序。分析工具是您最好的选择，它会准确地告诉您时间花在了哪里。

c++ - 并行性能不好的原因是什么？

问题描述

解决方案

推荐阅读