首页 > 解决方案 > C fread / C++ 读取函数的奇怪内存消耗,基于 Linux sysinfo 数据

问题描述

好吧,我的程序有一个奇怪的(在我看来)行为,现在它被简化为只从相当大的(大约 24GB 和 48GB)二进制文件中读取 3 个数组。这些文件的结构非常简单,它们包含一个小标题,以及后面的 3 个数组: int、int 和 float 类型,所有 3 个大小为 N,其中 N 非常大:2147483648 用于 28 GB 文件,4294967296 用于 48 GB一。

为了跟踪内存消耗,我使用了一个基于 Linux sysinfo 的简单函数来检测我在程序的每个阶段有多少可用内存(例如,在我分配数组来存储数据和读取文件时) . 这是函数的代码:

#include <sys/sysinfo.h>
size_t get_free_memory_in_MB()
{
    struct sysinfo info;
    sysinfo(&info);
    return info.freeram / (1024 * 1024);
}

现在直奔问题:奇怪的是,在使用标准 C fread 函数或 C++ 读取函数(根本不重要)从文件中读取 3 个数组中的每一个后,并检查读取后我们有多少可用内存,我看到可用内存量大大减少了(对于下一个示例,大约是 edges_count * sizeof(int))。

fread(src_ids, sizeof(int), edges_count, graph_file);
cout << "1 test: " << get_free_memory_in_MB() << " MB" << endl;

所以基本上,根据 sysinfo 读取整个文件后,我的内存消耗几乎是预期的 2 倍。为了更好地说明问题,我提供了整个函数的代码及其输出;请阅读它,它非常小,可以更好地说明问题。

bool load_from_edges_list_bin_file(string _file_name)
{
    bool directed = true;
    int vertices_count = 1;
    long long int edges_count = 0;

    // open the file
    FILE *graph_file = fopen(_file_name.c_str(), "r");
    if(graph_file == NULL)
        return false;

    // just reading a simple header here
    fread(reinterpret_cast<char*>(&directed), sizeof(bool), 1, graph_file);
    fread(reinterpret_cast<char*>(&vertices_count), sizeof(int), 1, graph_file);
    fread(reinterpret_cast<char*>(&edges_count), sizeof(long long), 1, graph_file);

    cout << "edges count: " << edges_count << endl;
    cout << "Before graph alloc free memory: " << get_free_memory_in_MB() << " MB" << endl;

    // allocate the arrays to store the result
    int *src_ids = new int[edges_count];
    int *dst_ids = new int[edges_count];
    _TEdgeWeight *weights = new _TEdgeWeight[edges_count];

    cout << "After graph alloc free memory: " << get_free_memory_in_MB() << " MB" << endl;

    memset(src_ids, 0, edges_count * sizeof(int));
    memset(dst_ids, 0, edges_count * sizeof(int));
    memset(weights, 0, edges_count * sizeof(_TEdgeWeight));

    cout << "After memset: " << get_free_memory_in_MB() << " MB" << endl;

    // add edges from file
    fread(src_ids, sizeof(int), edges_count, graph_file);
    cout << "1 test: " << get_free_memory_in_MB() << " MB" << endl;

    fread(dst_ids, sizeof(int), edges_count, graph_file);
    cout << "2 test: " << get_free_memory_in_MB() << " MB" << endl;

    fread(weights, sizeof(_TEdgeWeight), edges_count, graph_file);
    cout << "3 test: " << get_free_memory_in_MB() << " MB" << endl;

    cout << "After actual load: " << get_free_memory_in_MB() << " MB" << endl;

    delete []src_ids;
    delete []dst_ids;
    delete []weights;

    cout << "After we removed the graph load: " << get_free_memory_in_MB() << " MB" << endl;

    fclose(graph_file);

    cout << "After we closed the file: " << get_free_memory_in_MB() << " MB" << endl;

    return true;
}

所以,没什么复杂的。直接输出(在 // 之后有一些评论来自我)。首先,对于 24GB 文件:

Loading graph...
edges count: 2147483648
Before graph alloc free memory: 91480 MB 
After graph alloc free memory: 91480 MB // allocated memory here, but noting changed, why?
After memset: 66857 MB // ok, we put some data into the memory (memset) and consumed exactly 24 GB, seems correct
1 test: 57658 MB // first read and we have lost 9 GB...
2 test: 48409 MB // -9 GB again...
3 test: 39161 MB // and once more...
After actual load: 39161 MB // we lost in total 27 GB during the reads. How???
After we removed the graph load: 63783 MB // removed the arrays from memory and freed the memory we have allocated
// 24 GB freed, but 27 are still consumed somewhere
After we closed the file: 63788 MB // closing the file doesn't help
Complete!
After we quit the function: 63788 MB // quitting the function doesn't help too.

48GB 文件类似:

edges count: 4294967296
Before graph alloc free memory: 91485 MB
After graph alloc free memory: 91485 MB
After memset: 42236 MB
1 test: 23784 MB
2 test: 5280 MB
3 test: 490 MB
After actual load: 490 MB
After we removed the graph load: 49737 MB
After we closed the file: 49741 MB
Complete!
After we quit the function: 49741 MB

那么,我的程序内部发生了什么?

1)为什么在读取过程中丢失了这么多内存(使用 C 中的 fread 和 C++ 中的文件流)?

2)为什么关闭文件不会释放消耗的内存?

3) 也许 sysinfo 向我显示了不正确的信息?

4)这个问题可以与内存碎片有关吗?

顺便说一句,我正在一个超级计算机节点上启动我的程序,我在该节点上拥有独占访问权限(因此其他人无法影响它),并且没有可以影响我的程序的辅助应用程序。

谢谢您阅读此篇!

标签: c++linuxfilememory

解决方案


这几乎可以肯定是磁盘 (/page) 缓存。当您读取文件时,操作系统会将部分或全部内容存储在内存中,从而减少可用内存量。这是为了优化未来的读取。

然而,这并不意味着内存要么被进程使用,要么不可用。如果/当需要内存时,它将由操作系统释放并可用。

您应该能够通过跟踪bufferramsysinfo 结构 ( https://www.systutorials.com/docs/linux/man/2-sysinfo/ ) 中的参数值或查看free -m命令的输出来确认这一点在运行程序之前和之后。

有关这方面的更多详细信息,请参阅以下答案:https ://superuser.com/questions/980820/what-is-the-difference-between-memfree-and-memavailable-in-proc-meminfo


推荐阅读