首页 > 解决方案 > How to loop multiple .Rdata objects from AWS/S3 into a list in R

问题描述

So I have several large .Rdata objects stored in one bucket of PUMS data. There are 10 files I need to load into R (10 years of data) to do analyses. I am having issues loading or looping multiple files from S3 into R.

Here is how to load one object at a time:

s3load("dataUS19.Rdata", bucket = "my bucket")

That creates ram issues so load them one at a time so created a bucket data frame and then tried this loop:

awsDF <- get_bucket_df("my bucket") # getting bucket   
data <- list() # creating list 

data <- awsDF$Key[grep("dataUS", awsDF$Key)]  #specify only the .Rdata objects that start with dataUS
for (match in data) {   
s3load(object=match, bucket="my bucket"))

The issue is that loop does load multiple objects at once but it does not store them as a list. They load as separate dfs/objects which creates ram issues (able to load about 6 of the files)

I am not a programmer and was trained in Stata so any help to get multiple .Rdata objects in a list from S3 would be greatly appreciated.

标签: ramazon-web-servicesloopsamazon-s3rdata

解决方案


Consider loading with environments. Similar to base R's load, aws.s3's s3load maintains an envir argument.

rdata <- awsDF$Key[grep("dataUS", awsDF$Key)]

data_list <- lapply(rdata, function(file) {
    s3load(file, bucket="my bucket", envir=(temp_env <- new.env()))
    as.list.environment(temp_env)
})

If .Rdata files contains only one object, extract first item:

as.list.environment(temp_env)[1]

推荐阅读