首页 > 解决方案 > 在 R 中,将“Oct 29 - Nov 1”转换为“20201029”和“20201101”

问题描述

我正在处理从网站上抓取的凌乱表格,为了使日期列更有用,我需要清理被抓取的内容。我们的数据看起来像这样:

mydata <- structure(list(Dates = c("Sep\r\n            \r\n            10 - 13", 
"Oct\r\n            \r\n            8 - 11", "Oct 29 - Nov 1", 
"Nov\r\n            \r\n            19 - 22", "Jan\r\n            \r\n            21 - 24", 
"Mar\r\n            \r\n            4 - 7", "Apr 29 - May 2"), 
    points = c("500", "500", "500", "500", "500", "550", "500"
    )), row.names = c(1L, 5L, 8L, 11L, 16L, 23L, 32L), class = "data.frame")


> mydata
                                        Dates points
1  Sep\r\n            \r\n            10 - 13    500
5   Oct\r\n            \r\n            8 - 11    500
8                              Oct 29 - Nov 1    500
11 Nov\r\n            \r\n            19 - 22    500
16 Jan\r\n            \r\n            21 - 24    500
23   Mar\r\n            \r\n            4 - 7    550
32                             Apr 29 - May 2    500

中的每个日期Dates都是一个真正应该是 astartDate和的日期范围endDate。我们正在尝试创建的是:

> newdata
     StartDate   EndDate  points
1     20200910  20200913     500
1     20201008  20201011     500
1     20201029  20201101     500
1     20201119  20201122     500
1     20210121  20210124     500
1     20210304  20210307     500
1     20210429  20210502     500

我们可以假设 9 月至 12 月的所有日期都是 2020 年,而 1 月至 8 月的所有日期都是 2021 年。

编辑 1

这可能不是最干净的代码,但我成功地将 Dates 列分成 2 列

  cleaning_dates_df <- do.call('rbind', strsplit(mydata$Dates, '-')) %>% as.data.frame()
  colnames(cleaning_dates_df) <- c('start', 'end')
  cleaning_dates_df$start <- as.character(cleaning_dates_df$start)
  cleaning_dates_df$end <- as.character(cleaning_dates_df$end)
  cleaning_dates_df <- cleaning_dates_df %>%
    dplyr::mutate(end = ifelse(nchar(end) > 4, end, paste0(trimws(sub("\r\n.*", "", start)), end))) %>%
    dplyr::mutate(start = ifelse(nchar(start) < 8, start, paste0(trimws(sub("\r\n.*", "", start)), sub(".*\\s", "", start)))) %>%
    dplyr::mutate(end = trimws(end)) %>% dplyr::mutate(start = trimws(start))

  head(cleaning_dates_df, 8)

...仍然需要转换成YYYYMMDD

标签: rdatedate-range

解决方案


我不会称它为漂亮,但您可以使用正则表达式先获取所有部分:

rgx <- "^([a-z]+)(\\r|\\n|\\s)+(\\d+)\\s\\-\\s([a-z]+)*\\s*(\\d+)$"
td <- strcapture(rgx, tolower(mydata$Dates), 
                 proto=list(mth1="",x="",day1="",mth2="",day2=""))

仅提及一个月时复制月份:

td$mth2[td$mth2 == ''] <- td$mth1[td$mth2 == '']

将月份转换为数字,然后决定是 2020 年还是 2021 年:

td[c("mth1","mth2")] <- lapply(td[c("mth1","mth2")],
                               function(x) match(x, tolower(month.abb)))
td[c("yr1","yr2")]   <- lapply(td[c("mth1","mth2")],
                               function(x) ifelse(x >= 9, 2020, 2021) )

从不同的部分构造日期:

mydata$startdate <- as.Date(paste(td$yr1, td$mth1, td$day1, sep="/"))
mydata$enddate   <- as.Date(paste(td$yr2, td$mth2, td$day2, sep="/"))

结束!:

mydata

#                                        Dates points  startdate    enddate
#1  Sep\r\n            \r\n            10 - 13    500 2020-09-10 2020-09-13
#5   Oct\r\n            \r\n            8 - 11    500 2020-10-08 2020-10-11
#8                              Oct 29 - Nov 1    500 2020-10-29 2020-11-01
#11 Nov\r\n            \r\n            19 - 22    500 2020-11-19 2020-11-22
#16 Jan\r\n            \r\n            21 - 24    500 2021-01-21 2021-01-24
#23   Mar\r\n            \r\n            4 - 7    550 2021-03-04 2021-03-07
#32                             Apr 29 - May 2    500 2021-04-29 2021-05-02

推荐阅读