首页 > 解决方案 > 每个循环添加日期列

问题描述

我想准备从 rvest 网站获得的分析数据框:

x <- list()
for (i in 18:19){
  for (j in 1:12) {
    x[[paste0("20",i,".",j)]]<-paste0("https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=20",i,"&filter_month=",j,"&List=Listele")
  }
}

只需创建两年的链接以使用 rvest 阅读 html,我想绑定在单个数据框中

DF <- data.frame()

for (i in x){
  html_monthly <- read_html(i)
  temp_df <- html_table(html_monthly,fill=T)[[4]]
  temp_df <- temp_df[-c(1,2,28,29),]
  DF <- bind_rows(DF,temp_df)
}

这是我一个月得到的


           X1     X2      X3    X4
1   A   292.290 57.920  158,36  12,48
2   B   2.725.497   540.511 1.920,41    100,50
3   C   25.260.026  8.000.259   4.641,49    567,45
4   D   2.582.916   527 667,90  0,19
5   E   24.041.009  12.196.630  3.483,63    477,84
6   F   973.180 24.216  719,08  5,66
7   G   5.368.531   2.203.468   1.444,43    153,74

我想根据链接在每个循环中添加日期列。例如每个月有 25 行,其日期为 2018-1,前 25 行中的主数据框 DF 将是 2018-1 并继续

我尝试在循环中添加一个计数器,它从名称(x)[计数器]中为每个 temp_df 绑定一个列,它可以工作 6 个月,但之后出现错误

有什么建议吗?

标签: rdate

解决方案


试试这个(未经测试):

for (i in seq_along(x)){
  html_monthly <- read_html(x[[i]])
  temp_df <- html_table(html_monthly,fill=T)[[4]]
  temp_df <- temp_df[-c(1,2,28,29),]
  temp_df$date <- names(x)[i]
  DF <- bind_rows(DF,temp_df)
}

但是,我建议另一种实现方式。

eg <- expand.grid(i=2018:2019, j=1:3)
eg
#      i j
# 1 2018 1
# 2 2019 1
# 3 2018 2
# 4 2019 2
# 5 2018 3
# 6 2019 3

expand.grid函数为我们提供了所提供变量的每种组合。我j=1:3只是为了演示而缩写,扩展为你想要的。从这里开始,可以在一个命令中更简洁地生成 URL。(虽然我在这里使用它,但使用/使用相同的列/向量sprintf也可以轻松完成。)我以一种以后更容易使用的格式命名它们。pastepaste0as.Date

urls <- sprintf("https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=%d&filter_month=%d&List=Listele", eg$i, eg$j)
names(urls) <- sprintf("%d-%02d-01", eg$i, eg$j)
urls
#                                                                                          2018-01-01 
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2018&filter_month=1&List=Listele" 
#                                                                                          2019-01-01 
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2019&filter_month=1&List=Listele" 
#                                                                                          2018-02-01 
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2018&filter_month=2&List=Listele" 
#                                                                                          2019-02-01 
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2019&filter_month=2&List=Listele" 
#                                                                                          2018-03-01 
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2018&filter_month=3&List=Listele" 
#                                                                                          2019-03-01 
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2019&filter_month=3&List=Listele" 

从这里开始,我发现通常最好先将数据抓取到一个列表中,然后在后面的步骤中进行任何修改。(在 this 中添加列当然很好lapply,但这主要是偏好。)

library(rvest)
lst_of_frames <- lapply(urls, function(url) html_table(read_html(url), fill = TRUE)[[4]])
str(lst_of_frames[1:2])
# List of 2
#  $ 2018-01-01:'data.frame':   29 obs. of  5 variables:
#   ..$ X1: chr [1:29] "Isyeri Grubu" "Isyeri Grubu" "ARABA KIRALAMA" "ARAÇ KIRALAMA-SATIS/SERVIS/YEDEK PARÇA" ...
#   ..$ X2: chr [1:29] "Islem Adedi" "Islem Adedi(Kredi Karti)" "292.290" "2.725.497" ...
#   ..$ X3: chr [1:29] "Islem Adedi" "Islem Adedi (Banka Karti)" "57.920" "540.511" ...
#   ..$ X4: chr [1:29] "Islem Tutari (Milyon TL)" "Islem Tutari \n                (Kredi Karti)" "158,36" "1.920,41" ...
#   ..$ X5: chr [1:29] "Islem Tutari (Milyon TL)" "Islem Tutari \n                    (Banka Karti)" "12,48" "100,50" ...
#  $ 2019-01-01:'data.frame':   29 obs. of  5 variables:
#   ..$ X1: chr [1:29] "Isyeri Grubu" "Isyeri Grubu" "ARABA KIRALAMA" "ARAÇ KIRALAMA-SATIS/SERVIS/YEDEK PARÇA" ...
#   ..$ X2: chr [1:29] "Islem Adedi" "Islem Adedi(Kredi Karti)" "256.372" "2.967.019" ...
#   ..$ X3: chr [1:29] "Islem Adedi" "Islem Adedi (Banka Karti)" "49.296" "642.136" ...
#   ..$ X4: chr [1:29] "Islem Tutari (Milyon TL)" "Islem Tutari \n                (Kredi Karti)" "195,13" "2.185,84" ...
#   ..$ X5: chr [1:29] "Islem Tutari (Milyon TL)" "Islem Tutari \n                    (Banka Karti)" "14,77" "127,16" ...

现在我认为是您最初的问题,如何date为每个抓取的帧创建此列。

lst2_of_frames <- Map(function(nm, x) transform(x, date = as.Date(nm)), names(lst_of_frames), lst_of_frames)
results <- do.call(rbind, lst2_of_frames)

head(results, n=3)
#                          X1                       X2                        X3                                           X4
# 2018-01-01.1   Isyeri Grubu              Islem Adedi               Islem Adedi                     Islem Tutari (Milyon TL)
# 2018-01-01.2   Isyeri Grubu Islem Adedi(Kredi Karti) Islem Adedi (Banka Karti) Islem Tutari \n                (Kredi Karti)
# 2018-01-01.3 ARABA KIRALAMA                  292.290                    57.920                                       158,36
#                                                            X5       date
# 2018-01-01.1                         Islem Tutari (Milyon TL) 2018-01-01
# 2018-01-01.2 Islem Tutari \n                    (Banka Karti) 2018-01-01
# 2018-01-01.3                                            12,48 2018-01-01

tail(results, n=3)
#                   X1          X2          X3        X4        X5       date
# 2019-03-01.27  YEMEK  45.639.403  39.193.997  2.629,09  1.219,64 2019-03-01
# 2019-03-01.28  DIGER   5.249.866  20.716.466  1.585,72    838,78 2019-03-01
# 2019-03-01.29 TOPLAM 350.282.800 185.000.312 68.957,28 11.658,47 2019-03-01

推荐阅读