r - 每个循环添加日期列
问题描述
我想准备从 rvest 网站获得的分析数据框:
x <- list()
for (i in 18:19){
for (j in 1:12) {
x[[paste0("20",i,".",j)]]<-paste0("https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=20",i,"&filter_month=",j,"&List=Listele")
}
}
只需创建两年的链接以使用 rvest 阅读 html,我想绑定在单个数据框中
DF <- data.frame()
for (i in x){
html_monthly <- read_html(i)
temp_df <- html_table(html_monthly,fill=T)[[4]]
temp_df <- temp_df[-c(1,2,28,29),]
DF <- bind_rows(DF,temp_df)
}
这是我一个月得到的
X1 X2 X3 X4
1 A 292.290 57.920 158,36 12,48
2 B 2.725.497 540.511 1.920,41 100,50
3 C 25.260.026 8.000.259 4.641,49 567,45
4 D 2.582.916 527 667,90 0,19
5 E 24.041.009 12.196.630 3.483,63 477,84
6 F 973.180 24.216 719,08 5,66
7 G 5.368.531 2.203.468 1.444,43 153,74
我想根据链接在每个循环中添加日期列。例如每个月有 25 行,其日期为 2018-1,前 25 行中的主数据框 DF 将是 2018-1 并继续
我尝试在循环中添加一个计数器,它从名称(x)[计数器]中为每个 temp_df 绑定一个列,它可以工作 6 个月,但之后出现错误
有什么建议吗?
解决方案
试试这个(未经测试):
for (i in seq_along(x)){
html_monthly <- read_html(x[[i]])
temp_df <- html_table(html_monthly,fill=T)[[4]]
temp_df <- temp_df[-c(1,2,28,29),]
temp_df$date <- names(x)[i]
DF <- bind_rows(DF,temp_df)
}
但是,我建议另一种实现方式。
eg <- expand.grid(i=2018:2019, j=1:3)
eg
# i j
# 1 2018 1
# 2 2019 1
# 3 2018 2
# 4 2019 2
# 5 2018 3
# 6 2019 3
该expand.grid
函数为我们提供了所提供变量的每种组合。我j=1:3
只是为了演示而缩写,扩展为你想要的。从这里开始,可以在一个命令中更简洁地生成 URL。(虽然我在这里使用它,但使用/使用相同的列/向量sprintf
也可以轻松完成。)我以一种以后更容易使用的格式命名它们。paste
paste0
as.Date
urls <- sprintf("https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=%d&filter_month=%d&List=Listele", eg$i, eg$j)
names(urls) <- sprintf("%d-%02d-01", eg$i, eg$j)
urls
# 2018-01-01
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2018&filter_month=1&List=Listele"
# 2019-01-01
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2019&filter_month=1&List=Listele"
# 2018-02-01
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2018&filter_month=2&List=Listele"
# 2019-02-01
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2019&filter_month=2&List=Listele"
# 2018-03-01
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2018&filter_month=3&List=Listele"
# 2019-03-01
# "https://bkm.com.tr/secilen-aya-ait-sektorel-gelisim/?filter_year=2019&filter_month=3&List=Listele"
从这里开始,我发现通常最好先将数据抓取到一个列表中,然后在后面的步骤中进行任何修改。(在 this 中添加列当然很好lapply
,但这主要是偏好。)
library(rvest)
lst_of_frames <- lapply(urls, function(url) html_table(read_html(url), fill = TRUE)[[4]])
str(lst_of_frames[1:2])
# List of 2
# $ 2018-01-01:'data.frame': 29 obs. of 5 variables:
# ..$ X1: chr [1:29] "Isyeri Grubu" "Isyeri Grubu" "ARABA KIRALAMA" "ARAÇ KIRALAMA-SATIS/SERVIS/YEDEK PARÇA" ...
# ..$ X2: chr [1:29] "Islem Adedi" "Islem Adedi(Kredi Karti)" "292.290" "2.725.497" ...
# ..$ X3: chr [1:29] "Islem Adedi" "Islem Adedi (Banka Karti)" "57.920" "540.511" ...
# ..$ X4: chr [1:29] "Islem Tutari (Milyon TL)" "Islem Tutari \n (Kredi Karti)" "158,36" "1.920,41" ...
# ..$ X5: chr [1:29] "Islem Tutari (Milyon TL)" "Islem Tutari \n (Banka Karti)" "12,48" "100,50" ...
# $ 2019-01-01:'data.frame': 29 obs. of 5 variables:
# ..$ X1: chr [1:29] "Isyeri Grubu" "Isyeri Grubu" "ARABA KIRALAMA" "ARAÇ KIRALAMA-SATIS/SERVIS/YEDEK PARÇA" ...
# ..$ X2: chr [1:29] "Islem Adedi" "Islem Adedi(Kredi Karti)" "256.372" "2.967.019" ...
# ..$ X3: chr [1:29] "Islem Adedi" "Islem Adedi (Banka Karti)" "49.296" "642.136" ...
# ..$ X4: chr [1:29] "Islem Tutari (Milyon TL)" "Islem Tutari \n (Kredi Karti)" "195,13" "2.185,84" ...
# ..$ X5: chr [1:29] "Islem Tutari (Milyon TL)" "Islem Tutari \n (Banka Karti)" "14,77" "127,16" ...
现在我认为是您最初的问题,如何date
为每个抓取的帧创建此列。
lst2_of_frames <- Map(function(nm, x) transform(x, date = as.Date(nm)), names(lst_of_frames), lst_of_frames)
results <- do.call(rbind, lst2_of_frames)
head(results, n=3)
# X1 X2 X3 X4
# 2018-01-01.1 Isyeri Grubu Islem Adedi Islem Adedi Islem Tutari (Milyon TL)
# 2018-01-01.2 Isyeri Grubu Islem Adedi(Kredi Karti) Islem Adedi (Banka Karti) Islem Tutari \n (Kredi Karti)
# 2018-01-01.3 ARABA KIRALAMA 292.290 57.920 158,36
# X5 date
# 2018-01-01.1 Islem Tutari (Milyon TL) 2018-01-01
# 2018-01-01.2 Islem Tutari \n (Banka Karti) 2018-01-01
# 2018-01-01.3 12,48 2018-01-01
tail(results, n=3)
# X1 X2 X3 X4 X5 date
# 2019-03-01.27 YEMEK 45.639.403 39.193.997 2.629,09 1.219,64 2019-03-01
# 2019-03-01.28 DIGER 5.249.866 20.716.466 1.585,72 838,78 2019-03-01
# 2019-03-01.29 TOPLAM 350.282.800 185.000.312 68.957,28 11.658,47 2019-03-01
推荐阅读
- python - 删除具有相同 x 值的坐标
- flutter - 如何在flutter draggablescrollablesheet中创建堆栈垂直列表视图和水平列表视图?
- python - 函数总是返回 2
- java - 从 v3 到 v5 的 Hibernate 升级文档或标准
- vector - Vector包含数据但报告长度为0,可以被某些函数访问
- python - Visual Studio Code - Python - unhashable 类型:列表 [包括代码和错误]
- google-cloud-platform - 启动脚本完成后 GCP 启用自动缩放
- c++ - 在带有 M1 芯片的 Mac 上使用 Homebrew 安装 QuantLib
- prolog - Prolog:如何在不重复的情况下创建所有可能的组合
- arithmetic-expressions - 表达评估 Oz/Mozart