r - 在 Rcpp 中快速高效地创建字符 DataFrame
问题描述
我编写了一个将字符值读入std::vector<std::string>
向量的解析器,它在亚秒内解析 100 万条记录。然后我想将向量转换为 a Rcpp::DataFrame
,这需要超过 80 秒...
有没有办法Rcpp::DataFrame
“有效地”从大字符向量中创建一个?
使用数值时,我会尝试std::memcpy()
使用std::vector
to Rcpp::NumericVector
(有关更多信息,请参阅此int64 示例或此data.table 示例),但这似乎不适用于字符向量,因为它们的大小不同。
更多信息
该函数的基本思想是解析数独字符串数据(每个数独字符串正好81个字符长),每行有两个数独(数据保存为.csv
文件,数据可以在这里找到)。
$ head sudoku.csv
quizzes,solutions
004300209005009001070060043006002087190007400050083000600000105003508690042910300,864371259325849761971265843436192587198657432257483916689734125713528694542916378
040100050107003960520008000000000017000906800803050620090060543600080700250097100,346179258187523964529648371965832417472916835813754629798261543631485792254397186
600120384008459072000006005000264030070080006940003000310000050089700000502000190,695127384138459672724836915851264739273981546946573821317692458489715263562348197
497200000100400005000016098620300040300900000001072600002005870000600004530097061,497258316186439725253716498629381547375964182841572639962145873718623954534897261
005910308009403060027500100030000201000820007006007004000080000640150700890000420,465912378189473562327568149738645291954821637216397854573284916642159783891736425
100005007380900000600000480820001075040760020069002001005039004000020100000046352,194685237382974516657213489823491675541768923769352841215839764436527198978146352
009065430007000800600108020003090002501403960804000100030509007056080000070240090,289765431317924856645138729763891542521473968894652173432519687956387214178246395
000000657702400100350006000500020009210300500047109008008760090900502030030018206,894231657762495183351876942583624719219387564647159328128763495976542831435918276
503070190000006750047190600400038000950200300000010072000804001300001860086720005,563472198219386754847195623472638519951247386638519472795864231324951867186723945
在 cpp 读取函数内部,我fread()
将文件填充一个缓冲区( )并将buffer
数据解析为所述std::vector<std::string>
向量(在本例中)a
b
请注意,包括我迄今为止所做的实验在内的完整代码可以在这个 gist中找到。
const int BUFFERSIZE = 1e8;
const int n_lines = count_lines(filename); // 1 million in this case
FILE* infile;
infile = fopen(filename.c_str(), "r");
unsigned char * buffer;
buffer = (unsigned char*) malloc(BUFFERSIZE);
int64_t this_buffer_size;
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);
// removing of header not shown here...
// BUFFERSIZE is also checked so that no overflow occurs... not shown here..
int line = 0;
while ((this_buffer_size = fread(buffer, 1, BUFFERSIZE, infile)) > 0) {
int i = 1;
while (i < buffer) {
// buffer from i to i + 81 would look like this:
// 004300209005009001070060043006002087190007400050083000600000105003508690042910300
// whereas for b it looks from i to i + 81 like this:
// 864371259325849761971265843436192587198657432257483916689734125713528694542916378
a[line] = std::string(buffer + i, buffer + i + 81);
i += 81 + 1; // skip to the next value, +1 for the , or a newline
b[line] = std::string(buffer + i, buffer + i + 81);
i += 81 + 1; // skip to the next value, +1 for the , or a newline
line++;
}
// check next buffer, not shown here...
}
// NEXT: parse the data to an R structure
这需要 250 毫秒以下的 100 万行数据集。
然后我想Rcpp::DataFrame
从两个向量中创建一个a
和b
,这就是问题所在。转换为 R 对象大约需要 80 秒。
考虑到数据知识(每行 2 项,每 81 个字符长,100 万行,......),有没有更快的选择?
我并不一定要先填充s,如果可能的话,我也可以直接在结构std::vector
中收集数据。Rcpp
到目前为止我尝试过的
教科书解决方案
Rcpp::DataFrame df = Rcpp::DataFrame::create(
Rcpp::Named("unsolved") = a,
Rcpp::Named("solved") = b,
Rcpp::Named("stringsAsFactors") = false
);
先列出来
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.frame");
字符矩阵
不能真正与其他方法相比,但感觉更原生......
// before the main loop
std::vector<std::string> vec;
// vec holds both data entries, the first (unsolved) at 0 -> n_lines and solved values at n_lines -> n_lines * 2
vec.resize(2 * n_lines);
// inside the loop
vec[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
vec[l + n_lines] = std::string(buffer + i, buffer + i + 81);
i += 82;
l++;
// to CharacterMatrix
Rcpp::CharacterMatrix res(n_lines, 2, vec.begin());
Github 上的完整代码和时序
解决方案
感谢您提供可用数据的快照(顺便说一句:没有必要对单个文件进行 tar 处理,您可以只xz
编辑 csv 文件。无论如何。)
我在我的 Ubuntu 20.04 机器上得到了不同的结果,这些结果更接近我的预期:
data.table::fread()
正如我们预期的那样具有竞争力(我正在逃避data.table
,git
因为在最近的版本中有一个回归)vroom
并且stringfish
,一旦我们强制物化来比较苹果和苹果而不是苹果的图像,它们就差不多了Rcpp
也在球场上,但有点多变
我将它限制在 10 次运行,如果你运行更多,可变性可能会下降,但缓存也会影响它。
简而言之:没有明确的赢家,当然也没有强制替换(已经知道要调整的)参考实现之一的授权。
edd@rob:~/git/stackoverflow/65043010(master)$ Rscript bm.R
Unit: seconds
expr min lq mean median uq max neval cld
fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939 10 a
vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921 10 a
sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975 10 a
rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847 10 a
rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851 10 a
edd@rob:~/git/stackoverflow/65043010(master)$
顶级脚本的代码
suppressMessages({
library(data.table)
library(Rcpp)
library(vroom)
library(stringfish)
library(microbenchmark)
})
vroomread <- function(csvfile) {
a <- vroom(csvfile, col_types = "cc", progress = FALSE)
vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
a <- sf_readLines(csvfile)
dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163))
}
sourceCpp("rcppfuncs.cpp")
csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
vroom=vroomread(csvfile),
sfish=sfread(csvfile),
rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
times=10)
Rcpp 脚本的代码
#include <Rcpp.h>
#include <fstream>
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
推荐阅读
- angular - Angular 6 构建错误预期 0 个参数,但在形式上得到 1
- scala - 在 Spark scala 中创建 ArrayType 列
- javascript - 从对象中获取属性值并将其分配给javascript中的变量
- sql - 访问查询中的分组加权平均
- python-3.x - 在 AWS Sagemaker 上安装 graphiz
- angular - Angular:使用 ChangeDetectionStrategy.OnPush 进行模型侦听
- android - SyncAdapter 在 Kotlin 中更改数据后更新 RecyclerView UI
- reactjs - React:如果 useCallback 返回一个值可以吗,或者这是一个不好的模式?
- c++ - 在 C++ 中为链表类实现入队函数
- java - 我将双倍时间“”的方法错误 - 带有特定文本的 XPATH