java - 使用 selenium WebDriver 提取文本和 Web 链接
问题描述
我正在研究 selenium,我想从 Sympla 的事件中提取文本和链接,但是当我点击“更多事件”按钮时,我无法提取下一个事件,它总是从页面中提取相同的初始事件.
完整的类,易于复制。
public static void main(String[] args) throws InterruptedException {
WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
// If have captcha, close the page and exit.
boolean captcha = driver.getPageSource().contains("Não sou um robô");
if (captcha == true) {
System.out.println("O Captcha apareceu, acabou a brincadeira!");
driver.close();
driver.quit();
}
// load more button
WebElement CarregarMais = driver.findElement(By
.xpath("//button[@id='more-events']"));
// Number of events counter
List<WebElement> eventos = (List<WebElement>) driver.findElements(By
.cssSelector("div.event-name.event-card"));
System.out.println("Number of links: " + eventos.size());
// Number of links counter
List<WebElement> eventos_link = (List<WebElement>) driver
.findElements(By.cssSelector("a.sympla-card.w-inline-block"));
// iterating over the button more events
for (int j = 0; j < eventos.size(); j++) {
CarregarMais.click();
@SuppressWarnings("deprecation")
WebDriverWait wait = new WebDriverWait(driver, 10);
WebElement element = wait.until(ExpectedConditions
.elementToBeClickable(By
.xpath("//button[@id='more-events']")));
// Iterating over event links
for (int i = 0; i < eventos_link.size(); i++) {
System.out.println(i + " " + eventos.get(i).getText() + " - "
+ eventos_link.get(i).getAttribute("href"));
Thread.sleep(500);
}
}
}
解决方案
这是因为您不再阅读链接。每次单击按钮都会创建一个新页面,因此您需要再次阅读它们。
此外,您需要存储最后获取的链接。
因此,在等待按钮再次可点击后,您需要重新阅读eventos
和eventos_link
. 也许您使用全局变量,例如lastFetchedLinkIndex
.
这将是我的方法(调整你的代码):
WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
// If have captcha, close the page and exit.
boolean captcha = driver.getPageSource().contains("Não sou um robô");
if (captcha == true) {
System.out.println("O Captcha apareceu, acabou a brincadeira!");
driver.close();
driver.quit();
}
// load more button
WebElement CarregarMais = driver.findElement(By
.xpath("//button[@id='more-events']"));
// Number of events counter
List<WebElement> eventos = (List<WebElement>) driver.findElements(By
.cssSelector("div.event-name.event-card"));
System.out.println("Number of links: " + eventos.size());
// Number of links counter
List<WebElement> eventos_link = (List<WebElement>) driver
.findElements(By.cssSelector("a.sympla-card.w-inline-block"));
int lastEventScraped = 0;
// iterating over the button more events
for (int j = 0; j < eventos.size(); j++) {
CarregarMais.click();
@SuppressWarnings("deprecation")
WebDriverWait wait = new WebDriverWait(driver, 10);
WebElement element = wait.until(ExpectedConditions
.elementToBeClickable(By
.xpath("//button[@id='more-events']")));
eventos = (List<WebElement>) driver.findElements(By
.cssSelector("div.event-name.event-card"));
eventos_link = (List<WebElement>) driver
.findElements(By.cssSelector("a.sympla-card.w-inline-block"));
// Iterating over event links
for (int i = lastEventScraped; i < eventos_link.size(); i++, lastEventScraped++) {
System.out.println(i + " " + eventos.get(i).getText() + " - "
+ eventos_link.get(i).getAttribute("href"));
Thread.sleep(500);
}
}
推荐阅读
- javascript - 如何使用固定标题导出 .PDF 版本的“ag-grid”?
- android - 我想使用 kotlin 多平台为 android 和 ios 发出网络获取请求,如何?
- c# - 无法启动 dotnet 核心进程。添加工具失败
- msix - 为什么VS 2019在我创建.msix包时将“_Test”附加到文件夹名称
- ruby - 使用数组中的值扩展散列数组
- python - tkinter - 如果错误未修复,如何不处理进一步的代码 - messagebox.showerror()
- load-balancing - abp后台jobmanager和负载均衡
- vue.js - 错误 vue.js 中“todo”道具的意外突变(我正在使用 vue3)
- reactjs - 使用 npx create-react-app 创建应用程序时显示消息“Model Parameter is Mandatory”的 React JS
- vue.js - 此 set-cookie 已被阻止,因为它具有 samesite=lax