首页 > 解决方案 > r中对大数据框的简单操作

问题描述

我有一个相对较大的数据框。它包含大约 4000 万行和 12 列,请参阅下面的部分内容。具体来说,它是美国各县的 3 小时平均臭氧数据。每行代表某个县和某天(从 19800101 到 20161231,共 3108 个县)。请注意,此数据文件的大小为 7.05 GB。

 index       date state.fips county.fp   X07.30   X10.30   X13.30   X16.30   X19.30   X21.30   X01.30   X04.30
1 01001 1980-01-01         01       001 29.98488 29.47778 29.12294 29.98976 31.69830 31.56405 30.48744 29.62118
2 01001 1980-01-02         01       001 29.03014 28.75464 28.58736 30.26555 32.39263 32.43746 31.70940 31.14960
3 01001 1980-01-03         01       001 30.69475 30.19832 29.68841 30.28920 31.61882 31.43047 31.01369 30.58366
4 01001 1980-01-04         01       001 30.20852 29.69874 29.47550 30.55639 32.62610 34.47959 35.54881 35.78104
5 01001 1980-01-05         01       001 35.80190 35.69129 35.89026 38.51287 39.82833 39.49016 38.73464 38.09185
6 01001 1980-01-06         01       001 37.32787 36.55899 35.96070 36.62670 37.03226 36.71239 35.86387 35.05945

问题是下面列中的时间是 UTC,我需要转换为美国当地时间。美国有五个时区,分别是东部时区、中部臭氧时区、山地时区和太平洋时区。是的,我只报道了毗邻的美国。我应该如何开始这个操作?

另外请注意原始数据文件很大(7.05 GB)。我们可能会遇到没有足够的内存错误。我正在使用 16 GB RAM 的笔记本电脑。

下面我发布我的代码来做到这一点。但是我不知道如何添加dplyr:case_when来调整时区。

names(ozone) <- gsub("^X","", names(ozone)) # get rid of X in columns names

ozone <- pivot_longer(ozone, cols = c('01.30','04.30','07.30',
                                     '10.30','13.30','16.30','19.30','21.30'),
                                   names_to = 'time', values_to = 'ozone_val')

ozone$date <- ymd(ozone$date) # convert to date format

ozone$date = as.POSIXct(paste(ozone$date, ozone$time),
                             format = "%Y-%m-%d %H.%M",
                             tz = 'UTC')
ozone$date <- with_tz(ozone$date, "America/New_York") # how to apply case_when here

ozone$time <- substr(ozone$date, 12, 19)

ozone$year.day <- substr(ozone$date, 1, 10)

ozone <- subset(ozone, select = -date)

ozone_1 <- pivot_wider(ozone, id_cols = c('index','state.fips','county.fp','year.day'),
                     names_from = 'time', values_from = 'ozone_val')

有任何想法吗?

标签: rmergebigdata

解决方案


This should get you started but you'll need to post a more complete reproducible example and/or some more info about what exactly you are looking for. But, you should be able to use this general framework if you do not run out of memory (e.g., you may be able to use something like dplyr::case_when() to create the timezone based on the state; or subsetting after making the column POSIXct). Hope this gets you started!

Also, I am happy to explain anything that is unclear!

library(data.table)

setDT(data)

names(data) <- gsub("^X", "", names(data))

dt <- melt(data, id.vars = c("index", "date", "state.fips", "county.fp"),
           variable.name = "time", value.name = "ozone_val")

dt[, date := as.POSIXct(paste(as.character(date), time), 
                        format = "%Y-%m-%d %H.%M",
                        tz = "America/New_York")]
print(dt, nrows = 10)

   index                date state.fips county.fp  time ozone_val
 1:  1001 1980-01-01 07:30:00          1         1 07.30  29.98488
 2:  1001 1980-01-02 07:30:00          1         1 07.30  29.03014
 3:  1001 1980-01-03 07:30:00          1         1 07.30  30.69475
 4:  1001 1980-01-04 07:30:00          1         1 07.30  30.20852
 5:  1001 1980-01-05 07:30:00          1         1 07.30  35.80190
---                                                               
44:  1001 1980-01-02 04:30:00          1         1 04.30  31.14960
45:  1001 1980-01-03 04:30:00          1         1 04.30  30.58366
46:  1001 1980-01-04 04:30:00          1         1 04.30  35.78104
47:  1001 1980-01-05 04:30:00          1         1 04.30  38.09185
48:  1001 1980-01-06 04:30:00          1         1 04.30  35.05945

Data:

data <- read.table(header = T, text = "index       date state.fips county.fp   X07.30   X10.30   X13.30   X16.30   X19.30   X21.30   X01.30   X04.30
1 01001 1980-01-01         01       001 29.98488 29.47778 29.12294 29.98976 31.69830 31.56405 30.48744 29.62118
2 01001 1980-01-02         01       001 29.03014 28.75464 28.58736 30.26555 32.39263 32.43746 31.70940 31.14960
3 01001 1980-01-03         01       001 30.69475 30.19832 29.68841 30.28920 31.61882 31.43047 31.01369 30.58366
4 01001 1980-01-04         01       001 30.20852 29.69874 29.47550 30.55639 32.62610 34.47959 35.54881 35.78104
5 01001 1980-01-05         01       001 35.80190 35.69129 35.89026 38.51287 39.82833 39.49016 38.73464 38.09185
6 01001 1980-01-06         01       001 37.32787 36.55899 35.96070 36.62670 37.03226 36.71239 35.86387 35.05945")

推荐阅读