首页 > 解决方案 > 如何在使用 rvest 和 stringr 从 R 中的网站抓取数据时遍历大量 ID?

问题描述

我有一个 Well ID (wdid) 列表,我想从指定网站获取它们各自的转移记录。基于适用于另一个网站上类似查询的代码,我想出了以下代码。但是,在代码“成功”运行后,我只会得到“0”或“list()”。我也不是 R 和网络抓取的高级用户。有人可以帮我解决我所缺少的吗?代码如下:

#Install Packages

install.packages(c("httr", "jsonlite", "lubridate"))
install.packages('stringr')

library(rvest)
library(stringr)
library(jsonlite)
library(lubridate)
options(stringsAsFactors = FALSE)

#Load External Data

Wells<-read.csv("~/file path/Wells.csv",header=TRUE)  #This is the list of the Well IDs whose diversion record I want to pull
  n=length(Wells[,1])

#Specify the URL
URL.1 <- "http://dnrweb.state.co.us/DWR/DwrApiService/api/v2/structures/divrec/divrecyear/?format=csvforced&wdid="

#Run Loop
for(i in 1:n){
  tryCatch({    #This is to keep the loop from breaking if there is an error
         wdid<-2005002 #This is the first Well ID in the list
        #Wells[i,1]

  url<-paste(URL.1,"0",wdid,sep="")

  site<-read_html(url)

  d<-html_nodes(site,"#tabs-4 table")%>%  #This finds the spot in the html I need
    html_table()                          #Because what I need is a table, this keeps in table form

  ifelse(length(d)!=0,dd<-data.frame(d,wdid),dd<-0) #The length test makes sure the record is not empty

  ifelse(i==1,output<-dd,output<-rbind(output,dd))
  
  Sys.sleep(0.5)
  },error=function(e){cat("ERROR :",conditionMessage(e), wdid, "\n")}) #Displays error that would have broken loop
}

temp<-output[output$wdid>0,]
output.2<-merge(Wells,temp,by.x="dataValue",by.y="wdid",all=TRUE)
write.csv(output.2,file="~/file path/records.csv")


以下是我的井 ID 示例:

wdid
2005002
2005003
2005004
2005007
2005009
2005011
2005014
2005018
2005020
2005021
2005022
2005023
2005027
2005028
2005031
2005032
2005033
2005034
2005035
2005036
2005037
2005038
2005041
2005043
2005046
2005049
2005050
2005052
2005053
2005054
2005055
2005057
2005058
2005059
2005060
2005063
2005066
2005067
2005070
2005073
2005074
2005075
2005076
2005080
2005084
2005087
2005092
2005094
2005095
2005096
`````````


它返回的输出如下所示:

输出 [,1] 输出 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0 dd 0

标签: rweb-scraping

解决方案


推荐阅读