r - 从 R 中的 s3 存储桶读取 csv 文件时出现“setDT”错误
问题描述
我在下面的代码中将文件从 s3 读取到 spark 框架中并匿名文件中的数据
library(data.table)
library(digest)
ct_test <- spark_read_csv(
sc,
name = "test_data",
memory = FALSE,
path = "s3://XXXXXXX/sunny/Sample_data.csv",
header = TRUE,
delimiter = ",",
stringsAsFactors = FALSE
)
cols_to_mask <- c("Email","Phone")
anonymize <- function(x, algo="crc32") {
sapply(x, function(y) if(y == "" | is.na(y)) "" else digest(y, algo = algo))
}
setDT(ct_test)
ct_test[, (cols_to_mask) := lapply(.SD, anonymize), .SDcols = cols_to_mask]
print(ct_test)
但是代码失败并出现以下错误
Error in setDT(ct_test) :
All elements in argument 'x' to 'setDT' must be of same length, but the profile of input lengths (length:frequency) is: [1:1, 2:1]
The first entry with fewer than 2 entries is 1
> ct_test[, (cols_to_mask) := lapply(.SD, anonymize), .SDcols = cols_to_mask]
Error in `:=`((cols_to_mask), lapply(.SD, anonymize)) :
Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
>
非常感谢任何解决此问题的帮助
下面是 str(ct_test) 的输出
$ ops:List of 2
..$ x : 'ident' chr "cx_data"
..$ vars: chr [1:5] "ID" "Name" "Email" "Phone" ...
..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
- attr(*, "class")= chr [1:4] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
输入数据集
ID,Name,Email,Phone,Survey
10,Ravi,test@gmail.com,874589,Survey 1
20,John,abc@gmail.com,878756,Survey 2
30,Smith,tt@yahoo.com,565656,Survey 3
40,Kevin,,,Survey 3
按照建议,将代码更改如下
cx_data <- spark_read_csv(
sc,
name = "cx_data",
memory = FALSE,
path = "s3://xxxx/sunny/Sample_data.csv",
delimiter = ",",
stringsAsFactors = FALSE
#infer_schema = FALSE
)
test_data <-fread(cx_data)
但现在失败并出现以下错误
Error in fread(cx_data) :
input= must be a single character string containing a file name, a system command containing at least one space
解决方案
推荐阅读
- javascript - 我们是否能够从 HTML 向 android 主屏幕添加图钉?
- mysql - 优化查询 - 1 个数据库中的用户订阅,另一个数据库中的 3 级数据。查找用户订阅的顶层
- python - 如何在 PyQt5 中基于中心而不是左角使用移动命令定位标签
- git - Git rebase when rebaseing master with a feature - 了解变基图
- ansible - 如何将 ansible 调试日志保存到单个文件
- node.js - 在heroku上部署mongodb atlas后无法解决错误?
- python - 我想将数据保存在 sqlite 数据库中,我的查询在网页中不起作用但相同的代码在 Django Shell 中运行良好
- javascript - 带有html 5模式的Angular中的正则表达式荷兰邮政编码
- apache-kafka - Kafka Streams 中跨密钥更新的消息顺序
- java - Java Quartz,每个作业只允许一个线程