r - Rvest 中的多个页面
问题描述
我正在使用 R 中的 Rvest 进行网络抓取。我试图从有 12 个页面的搜索页面中获取数据。我编写了一个代码来迭代页面以从每个页面收集数据。但我的代码只重复收集第一页。这是我的代码示例。
# New method for Pagination
url_base <- "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?SortBy=1&Distance=400&ResultsPerPage=10&Name=e.g.%20Singh%20or%20John%20Smith&Specialty=230&Location.Id=0&Location.Name=e.g.%20postcode%20or%20town&Location.Longitude=0&Location.Latitude=0&CurrentPage=1&OnlyViewConsultantsWithOutcomeData=False"
map_df(1:12, function(i) {
cat(".")
pg <- read_html(sprintf(url_base,i))
data.frame(consultant_name = html_text(html_nodes(pg,".consultants-list h2 a")))
}) -> names
dplyr::glimpse(names)
代码的编辑版本:
# New method for Pagination
url_base <- "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?ResultsPerPage=100&defaultConsultantName=e.g.+Singh+or+John+Smith&DefaultLocationText=e.g.+postcode+or+town&DefaultSearchDistance=25&Name=e.g.+Singh+or+John+Smith&Specialty=230&Location.Name=e.g.+postcode+or+town&Location.Id=0&CurrentPage=%d"
map_df(1:12, function(i) {
cat(".")
pg <- read_html(sprintf(url_base,i))
data.frame(consultant_name = html_text(html_nodes(pg,".consultants-list h2 a")),
gmc_no = gsub("GMC membership number: ","",html_text(html_nodes(pg,".consultants-list .name-number p"))),
Speciality = html_text(html_nodes(pg,".consultants-list .specialties ul li")),
location = html_text(html_nodes(pg,".consultants-list .consultant-services ul li")),stringsAsFactors=FALSE)
}) -> names
dplyr::glimpse(names)
上面的代码接受 8 个循环来获取 800 行,即每页 100 行,但随后会出现错误。
......... data.frame中的错误(consultant_name = html_text(html_nodes(pg,“.consultants-list h2 a”)),:参数暗示不同的行数:100、101调用自:数据。 frame(consultant_name = html_text(html_nodes(pg, ".consultants-list h2 a")), gmc_no = gsub("GMC 会员编号: ", "", html_text(html_nodes(pg, ".consultants-list .name-number p"))), 专业 = html_text(html_nodes(pg, ".consultants-list .specialties ul li")), 位置 = html_text(html_nodes(pg, ".consultants-list .consultant-services ul li")), stringsAsFactors = FALSE) 浏览[1]>
我试图改变循环号码,但没有运气。
请帮我解决这个问题!!!
解决方案
这是我在查看 URL 模式后得出的结论。
library(tidyverse)
library(rvest)
base_url <- "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?Specialty="
# change the code to pull other specialities
specialty_code = 230 # ie. Anaesthesia services = 230
# show 100 per page
tgt_url <- str_c(base_url,specialty_code,"&ResultsPerPage=100&CurrentPage=")
pg <- read_html(tgt_url)
# count the total results and set the page count
res_cnt <- pg %>% html_nodes('.fcresultsinfo li:nth-child(1)') %>% html_text() %>% str_remove('.* of ') %>% as.numeric()
pg_cnt = ceiling(res_cnt / 100)
res_all <- NULL
for (i in 1:pg_cnt) {
pg <- read_html(str_c(tgt_url,i))
res_pg <- tibble(
consultant_name = pg %>% html_nodes(".consultants-list h2 a") %>% html_text(),
gmc_no = pg %>% html_nodes(".consultants-list .name-number p") %>% html_text() %>%
str_remove("GMC membership number: "),
speciality = pg %>% html_nodes(".consultants-list .specialties ul") %>%
html_text() %>% str_replace_all(', \r\n\\s+',', ') %>% str_trim(),
location = pg %>% html_nodes(".consultants-list .consultant-services ul") %>%
html_text() %>% str_replace_all(', \r\n\\s+',', ') %>% str_trim(),
src_link = pg %>% html_nodes(".consultants-list h2 a") %>% html_attr('href')
)
res_all <- res_all %>% bind_rows(res_pg)
}
这就是我得到的:
> nrow(res_all)
## [1] 1141
> res_all %>% select(1:4) %>% tail()
## # A tibble: 6 x 4
## consultant_name gmc_no speciality location
## <chr> <chr> <chr> <chr>
## 1 Mark Yeates 4716345 Anaesthesia services The Great Western Hospital
## 2 Steven Yentis 2939700 Anaesthesia services Chelsea and Westminster Hospital
## 3 Louise Young 6139457 Anaesthesia services Southampton General Hospital
## 4 Andreas Zafiropoulos 6075484 Anaesthesia services Shrewsbury and Telford Hospital NHS Trust
## 5 Suhail Zaidi 4239598 Anaesthesia services Luton and Dunstable Hospital
## 6 Cezary Zugaj 4751331 Anaesthesia services Oxford University Hospitals NHS Foundation Trust
推荐阅读
- javascript - 将 Javascript 呈现的网页内容读入 R
- hadoop - 连接已经关闭 Hive
- c++ - OPenssl 警报证书未知 SSL,警报编号 46
- python - 给定最小值和最大值的完美数字生成器
- javascript - 使用带有导航栏的 window.location.pathname 和 window.location.hash 时渲染组件出现问题
- python - 将TXT文件中的多个列表插入到字典中?,python
- html - 如何将图像定位在容器的底角并保持容器文本围绕它
- azure-devops - 真正寻找示例/用例,何时在 Azure Devops 中使用阶段
- indexing - 根据 ELF 文件过滤 Opengrok 索引
- excel - 比较列返回最大功率查询