首页 > 解决方案 > 使用 sparklyr 将大型 JSON 写入 CSV

问题描述

我正在尝试将大型 JSON 文件(6GB)转换为 CSV,以便更轻松地将其加载到 R 中。我遇到了这个解决方案(来自https://community.rstudio.com/t/how-to-read-large -json-file-in-r/13486/33):

library(sparklyr)
library(dplyr)
library(jsonlite)

Sys.setenv(SPARK_HOME="/usr/lib/spark")
# Configure cluster (c3.4xlarge 30G 16core 320disk)
conf <- spark_config()
conf$'sparklyr.shell.executor-memory' <- "7g"
conf$'sparklyr.shell.driver-memory' <- "7g"
conf$spark.executor.cores <- 20
conf$spark.executor.memory <- "7G"
conf$spark.yarn.am.cores  <- 20
conf$spark.yarn.am.memory <- "7G"
conf$spark.executor.instances <- 20
conf$spark.dynamicAllocation.enabled <- "false"
conf$maximizeResourceAllocation <- "true"
conf$spark.default.parallelism <- 32

sc <- spark_connect(master = "local", config = conf, version = '2.2.0')
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE,
                              overwrite = TRUE) 
sdf_schema_viewer(sample_tbl)

我以前从未使用过 Spark,我试图了解我加载的数据在 Rstudio 中的位置,以及如何将数据写入 CSV?

标签: rjsoncsvapache-spark

解决方案


不确定 sparklyr,但如果您尝试读取大型 json 文件并尝试使用Spark R写入 CSV 文件,下面是相同的示例代码。

此代码将仅在 spark 环境中运行,而不在 Rstudio 中运行

# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
path <- "examples/src/main/resources/people.json" 
# Create a DataFrame from the file(s) pointed to by path
people <- read.json(path)
# Write date frame to CSV
write.df(people, "people.csv", "csv")

推荐阅读