首页 > 解决方案 > 如何操作 data.frame 以达到预期的结果?以某种方式滞后?

问题描述

我试图以难以描述的方式操作数据表。我的直觉告诉我这是某种滞后,但我不确定 - 所以我会举例说明。

我可以在 excel 中轻松完成此操作,但我的数据集太大,excel 无法有效处理。

初始数据:

Column1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C")
Column2 <- c(201801, 201802, 201803, 201804, 201803, 201804, 201805, 201803, 201804)
Column3 <- c("Active", "Active", "Active", "Closed", "Active", "Active", "CO", "Active", "BK")
Column4 <- c(100, 97, 95, 0, 50, 45, 45, 100, 90)
(dat <- dplyr::tibble(Column1, Column2, Column3, Column4))

# A tibble: 9 x 4
  Column1 Column2 Column3 Column4
  <chr>     <dbl> <chr>     <dbl>
1 A        201801 Active      100
2 A        201802 Active       97
3 A        201803 Active       95
4 A        201804 Closed        0
5 B        201803 Active       50
6 B        201804 Active       45
7 B        201805 CO           45
8 C        201803 Active      100
9 C        201804 BK           90

期望的输出:

Col1 <- c("A", "A", "A", "B", "B", "C")
Col2 <- c(201010, 201802, 201803, 201003, 201804, 201803)
Col3 <- c(201802, 201803, 201804, 201804, 201805, 201804)
Col4 <- c("Active", "Active", "Active", "Active", "Active", "Active")
Col5 <- c("Active", "Active", "Closed", "Active", "CO", "BK")
Col6 <- c(100, 97, 95, 50, 45, 100)
Col7 <- c(97, 95, 0, 45, 45, 90)
(dat_desired <- dplyr::tibble(Col1, Col2, Col3, Col4, Col5, Col6, Col7))

# A tibble: 6 x 7
  Col1    Col2   Col3 Col4   Col5    Col6  Col7
  <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl> <dbl>
1 A     201001 201802 Active Active   100    97
2 A     201802 201803 Active Active    97    95
3 A     201803 201804 Active Closed    95     0
4 B     201003 201804 Active Active    50    45
5 B     201804 201805 Active CO        45    45
6 C     201803 201804 Active BK       100    90

顺便说一句,根据下面的一些建议,我尝试了以下方法(但产生了以下错误):

library(zoo)

R <- read_excel("H:/R Programs/R_Data.xlsx")

Column1 = as.vector(R[,1])
Column2 = as.vector(R[,2])
Column3 = as.vector(R[,3])
Column4 = as.vector(R[,4])
Column5 = as.vector(R[,5])

(dat <- dplyr::tibble(Column1, Column2, Column3, Column4, Column5))

# A tibble: 415,533 x 5
Column1$Loan_Key Column2$File_Run_Date Column3$Status Column4$Days Column5$Bal
<dbl>                 <dbl> <chr>                 <dbl>       <dbl>
1         11111111              20180201 ACTIVE                  -19      24472.
2         11111111              20180301 ACTIVE                  -19      24264.
3         11111111              20180401 ACTIVE                  -19      23991.
4         11111111              20180501 BK                      -49      23350.
5         11111111              20180601 BK                      -19      23488.
6         11111111              20180701 BK                      -19      23169.
7         11111111              20180801 BK                      -19      23008.
8         11111111              20180901 BK                      -19      22693.
9         11111111              20181001 BK                      -19      22378.
10        11111111              20181101 BK                      -19      22192.

# ... with 415,523 more rows

(data.frame(rollapply(data = dat, 2, c)) %>% filter(X1 == X2) %>% 
    select(-X2) %>% setNames(paste0("Col", 1:9)))

Error in ncol(xj) : object 'xj' not found

标签: rdataframe

解决方案


这不是基本上只是一个自加入(即一个left_joinwith daton dat"Column1"吗?

library(tidyverse)
dat %>%
    left_join(dat, by = c("Column1")) %>%
    filter(Column2.y - Column2.x == 1) %>%
    select(
        Col1 = Column1,
        Col2 = Column2.x, Col3 = Column2.y,
        Col4 = Column3.x, Col5 = Column3.y,
        Col6 = Column4.x, Col7 = Column4.y)
## A tibble: 6 x 7
#  Col1    Col2   Col3 Col4   Col5    Col6  Col7
#  <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl> <dbl>
#1 A     201801 201802 Active Active   100    97
#2 A     201802 201803 Active Active    97    95
#3 A     201803 201804 Active Closed    95     0
#4 B     201803 201804 Active Active    50    45
#5 B     201804 201805 Active CO        45    45
#6 C     201803 201804 Active BK       100    90

推荐阅读