r - RSelenium 文本提取不能循环工作
问题描述
我正在尝试从这个政府网站的案例数据库中提取文本 - https://www.te.gob.mx/buscador/ - 使用 RSelenium。
我已经设法让 RSelenium 提取我感兴趣的文本并将其手动存储在数据框中,但是,我希望它通过for loop
然后它点击网站上“简历”下的第一个链接,打开一个如下所示的页面:
我正在从每个“简历”子页面中提取一些文本并将它们存储在数据框中。
这是我的代码的样子:
setwd("C:/Users/ohenr/Dropbox/10-19 Research Projects/16 R")
getwd()
pacman::p_load(rvest, tidyverse, stringr, RSelenium, data.table) #loads all the packages in one command
url <- "https://www.te.gob.mx/buscador"
# Setting up the remote driver
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L,
browserName = "firefox")
# Input this into the terminal to start the firefox image in docker
# docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0
# Open the remote Driver (open firefox in R Selenium)
remDr$open()
# Navigating throught the mx resumen website
remDr$navigate(url)
# Click the regions on the left side of the webpage
region_lists <- remDr$findElements(using = "css selector", ".salas-tree")
region_lists[[1]]$clickElement()
#List resumen elements from the first page
res <- remDr$findElements("css selector", "#resumenResultados")
# number of resumen on the first page
res_n <- length(res)
#build a dataframe that has that same number of observations
resumen.df <- data.frame(expediente = character(res_n),
entidad = character(res_n),
turno = character(res_n),
res_text = character(res_n),
stringsAsFactors = F)
for (j in 1:res_n) {
res[[j]]$clickElement() # click on the jth resumen
elements <- remDr$findElements(using = "css selector", "h4") #extract the h4 elements from the resumen subpage
expediente <- unlist(elements[[1]]$getElementText())
entidad <- unlist(elements[[8]]$getElementText())
turno <- unlist(elements[[5]]$getElementText())
res_text <- remDr$findElement("css selector", "#swal2-content > div > div > p")
res_text <- unlist(res_text$getElementText())
resumen.df$expediente[j] <- expediente
resumen.df$entidad[j] <- entidad
resumen.df$turno[j] <- turno
resumen.df$res_text[j] <- res_text
#click the okay button on the page to exit the resumen subpage
button <- remDr$findElement("css selector", "body > div.swal2-container.swal2-center.swal2-fade.swal2-shown > div > div.swal2-actions > button.swal2-confirm.swal2-styled")
button$clickElement()
}
但是,一旦我运行循环,我就会收到此错误:
Error in elements[[1]] : subscript out of bounds
我认为问题与循环中事物的索引方式有关,因为我可以一次填充一行数据框。关于如何正确迭代此过程的任何想法?
解决方案
@SlowLearning 在评论中的建议最终解决了这个问题,但我不得不在更多的地方添加 Sys.sleep(2) 才能让它工作。该脚本的运行速度比网站加载速度快。
n <- remDr$findElement(using = "css selector", "#resultadosgsa_paginate > span > a:nth-child(7)")
n <- n$getElementText()
n <- as.numeric(n)
n
for (i in 1:n) {
# click through each page in the region, collecting the text
res <- remDr$findElements("css selector", "#resumenResultados")
res_n <- length(res)
resumen.df <- data.frame(expediente = character(res_n),
entidad = character(res_n),
turno = character(res_n),
res_text = character(res_n),
stringsAsFactors = F)
for (j in 1:res_n) {
Sys.sleep(2)
res[[j]]$clickElement()
Sys.sleep(2)
ex_location <- remDr$findElement("css selector", "#swal2-content > div > div > h4:nth-child(1)")
expediente <- unlist(ex_location$getElementText())
en_location <- remDr$findElement("css selector", "#swal2-content > div > div > h4:nth-child(8)")
entidad <- unlist(en_location$getElementText())
tu_location <- remDr$findElement("css selector", "#swal2-content > div > div > h4:nth-child(5)")
turno <- unlist(tu_location$getElementText())
te_location <- remDr$findElement("css selector", "#swal2-content > div > div > p")
res_text <- unlist(te_location$getElementText())
resumen.df$expediente[j] <- expediente
resumen.df$entidad[j] <- entidad
resumen.df$turno[j] <- turno
resumen.df$res_text[j] <- res_text
Sys.sleep(2)
# close out the subpage and wait before opening the next one
button <- remDr$findElement("css selector", "body > div.swal2-container.swal2-center.swal2-fade.swal2-shown > div > div.swal2-actions > button.swal2-confirm.swal2-styled")
button$clickElement()
}
global.list <- list(global.df, resumen.df)
global.df <- rbindlist(global.list)
Sys.sleep(2)
next.page.button <- remDr$findElement("css selector", "#resultadosgsa_next")
next.page.button$clickElement()
Sys.sleep(2)
}
推荐阅读
- c# - 在 .NET CORE 3.1 应用程序中访问使用脚手架生成的视图时出错
- flutter - 在颤动图上显示跟随线和工具提示
- vue.js - 如何在 VueJs 中使用三元运算符
- sql-server - SQL Server WITH 子句,如何添加附加条件?
- c# - 当 AutoSizeRowsMode 为 AllCells 时,datagridview 无法设置行高
- html - 如果文本像“...更多”一样溢出,如何添加自定义文本
- apache-flink - 对 JobManager 和 JobMaster 感到困惑
- arrays - 如何在 MPLABx 中以图形方式可视化数组的内容
- c# - 存储到数据库中的长文本字符串的有效比较
- php - 根据用户的注册日期,通过日期分组获取固定的记录数