首页 > 解决方案 > 在 Rcpp 中快速高效地创建字符 DataFrame

问题描述

我编写了一个将字符值读入std::vector<std::string>向量的解析器,它在亚秒内解析 100 万条记录。然后我想将向量转换为 a Rcpp::DataFrame,这需要超过 80 秒...

有没有办法Rcpp::DataFrame“有效地”从大字符向量中创建一个?

使用数值时,我会尝试std::memcpy()使用std::vectorto Rcpp::NumericVector(有关更多信息,请参阅此int64 示例或此data.table 示例),但这似乎不适用于字符向量,因为它们的大小不同。

更多信息

该函数的基本思想是解析数独字符串数据(每个数独字符串正好81个字符长),每行有两个数独(数据保存为.csv文件,数据可以在这里找到)。

$ head sudoku.csv
quizzes,solutions
004300209005009001070060043006002087190007400050083000600000105003508690042910300,864371259325849761971265843436192587198657432257483916689734125713528694542916378
040100050107003960520008000000000017000906800803050620090060543600080700250097100,346179258187523964529648371965832417472916835813754629798261543631485792254397186
600120384008459072000006005000264030070080006940003000310000050089700000502000190,695127384138459672724836915851264739273981546946573821317692458489715263562348197
497200000100400005000016098620300040300900000001072600002005870000600004530097061,497258316186439725253716498629381547375964182841572639962145873718623954534897261
005910308009403060027500100030000201000820007006007004000080000640150700890000420,465912378189473562327568149738645291954821637216397854573284916642159783891736425
100005007380900000600000480820001075040760020069002001005039004000020100000046352,194685237382974516657213489823491675541768923769352841215839764436527198978146352
009065430007000800600108020003090002501403960804000100030509007056080000070240090,289765431317924856645138729763891542521473968894652173432519687956387214178246395
000000657702400100350006000500020009210300500047109008008760090900502030030018206,894231657762495183351876942583624719219387564647159328128763495976542831435918276
503070190000006750047190600400038000950200300000010072000804001300001860086720005,563472198219386754847195623472638519951247386638519472795864231324951867186723945 

在 cpp 读取函数内部,我fread()将文件填充一个缓冲区( )并将buffer数据解析为所述std::vector<std::string>向量(在本例中)ab

请注意,包括我迄今为止所做的实验在内的完整代码可以在这个 gist中找到。

const int BUFFERSIZE = 1e8;
const int n_lines = count_lines(filename); // 1 million in this case

FILE* infile;
infile = fopen(filename.c_str(), "r");
unsigned char * buffer;
buffer = (unsigned char*) malloc(BUFFERSIZE);
int64_t this_buffer_size;
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);

// removing of header not shown here...
// BUFFERSIZE is also checked so that no overflow occurs... not shown here..

int line = 0;
while ((this_buffer_size = fread(buffer, 1, BUFFERSIZE, infile)) > 0) {
  int i = 1;
  while (i < buffer) {
    // buffer from i to i + 81 would look like this:
    // 004300209005009001070060043006002087190007400050083000600000105003508690042910300
    // whereas for b it looks from i to i + 81 like this:
    // 864371259325849761971265843436192587198657432257483916689734125713528694542916378
    a[line] = std::string(buffer + i, buffer + i + 81);
    i += 81 + 1; // skip to the next value, +1 for the , or a newline
    b[line] = std::string(buffer + i, buffer + i + 81);
    i += 81 + 1; // skip to the next value, +1 for the , or a newline
    line++;
  }
  // check next buffer, not shown here...
}

// NEXT: parse the data to an R structure

这需要 250 毫秒以下的 100 万行数据集。

然后我想Rcpp::DataFrame从两个向量中创建一个ab,这就是问题所在。转换为 R 对象大约需要 80 秒。

考虑到数据知识(每行 2 项,每 81 个字符长,100 万行,......),有没有更快的选择?

我并不一定要先填充s,如果可能的话,我也可以直接在结构std::vector中收集数据。Rcpp

到目前为止我尝试过的

教科书解决方案

Rcpp::DataFrame df = Rcpp::DataFrame::create(
  Rcpp::Named("unsolved") = a,
  Rcpp::Named("solved") = b,
  Rcpp::Named("stringsAsFactors") = false
);

先列出来


Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

df["unsolved"] = a;
df["solved"] = b;

df.attr("class") = Rcpp::CharacterVector::create("data.frame");

字符矩阵

不能真正与其他方法相比,但感觉更原生......

// before the main loop
std::vector<std::string> vec;
// vec holds both data entries, the first (unsolved) at 0 -> n_lines and solved values at n_lines -> n_lines * 2
vec.resize(2 * n_lines);


// inside the loop
vec[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
vec[l + n_lines] = std::string(buffer + i, buffer + i + 81);
i += 82;
l++;


// to CharacterMatrix
Rcpp::CharacterMatrix res(n_lines, 2, vec.begin());

Github 上的完整代码和时序

标签: rrcpp

解决方案


感谢您提供可用数据的快照(顺便说一句:没有必要对单个文件进行 tar 处理,您可以只xz编辑 csv 文件。无论如何。)

我在我的 Ubuntu 20.04 机器上得到了不同的结果,这些结果更接近我的预期:

  • data.table::fread()正如我们预期的那样具有竞争力(我正在逃避data.tablegit因为在最近的版本中有一个回归)
  • vroom并且stringfish,一旦我们强制物化来比较苹果和苹果而不是苹果的图像,它们就差不多了
  • Rcpp也在球场上,但有点多变

我将它限制在 10 次运行,如果你运行更多,可变性可能会下降,但缓存也会影响它。

简而言之:没有明确的赢家,当然也没有强制替换(已经知道要调整的)参考实现之一的授权。

edd@rob:~/git/stackoverflow/65043010(master)$ Rscript bm.R
Unit: seconds
  expr     min      lq    mean  median      uq     max neval cld
 fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939    10   a
 vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921    10   a
 sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975    10   a
 rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847    10   a
 rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851    10   a
edd@rob:~/git/stackoverflow/65043010(master)$ 

顶级脚本的代码

suppressMessages({
    library(data.table)
    library(Rcpp)
    library(vroom)
    library(stringfish)
    library(microbenchmark)
})

vroomread <- function(csvfile) {
    a <- vroom(csvfile, col_types = "cc", progress = FALSE)
    vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
    a <- sf_readLines(csvfile)
    dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
                                 sol = sf_substr(a, 83, 163))
}

sourceCpp("rcppfuncs.cpp")


csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
               vroom=vroomread(csvfile),
               sfish=sfread(csvfile),
               rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
               rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
               times=10)

Rcpp 脚本的代码

#include <Rcpp.h>
#include <fstream>

//[[Rcpp::export]]

Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

推荐阅读