python - Without using WebDriverWait my code return: element click intercepted / with WebDriverWait returns 'NoneType' object is not iterable
问题描述
Code Proposal:
Collecting the links to all the games of the day present on the page (https://int.soccerway.com/matches/2021/07/28/), giving me the freedom to change the date to whatever I want, such as 2021/08/01
and so on. So that in the future I can loop and collect the list from several different days at the same time, in one code call.
Even though it's a very slow model, without using Headless
, this model clicks all the buttons, expands the data and imports all 465 listed match links:
for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head clickable')]"):
btn.click()
Full Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(r"C:\Users\Computador\Desktop\Python\chromedriver.exe", options=options)
url = "https://int.soccerway.com/matches/2021/07/28/"
driver.get(url)
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
time.sleep(10)
for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head clickable')]"):
btn.click()
time.sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
resultado = jogo.get_attribute("href")
print(resultado)
driver.quit()
But when I add options.add_argument("headless")
so that the browser is not opened on my screen, the model returns the following error:
Message: element click intercepted
To get around this problem, I analyzed options and found this one on WebDriverWait
(https://stackoverflow.com/a/62904494/11462274) and tried to use it like this:
for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head clickable')]"))):
btn.click()
Full Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
options.add_argument("headless")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(r"C:\Users\Computador\Desktop\Python\chromedriver.exe", options=options)
url = "https://int.soccerway.com/matches/2021/07/28/"
driver.get(url)
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
time.sleep(10)
for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head clickable')]"))):
btn.click()
time.sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
resultado = jogo.get_attribute("href")
print(resultado)
driver.quit()
But because it's not iterable, it returns in error:
'NoneType' object is not iterable
Why do I need this option?
1 - I'm going to automate it in an online terminal, so there won't be any browser to open on the screen and I need to make it fast so I don't spend too much of my time limits on the terminal.
2 - I need to find an option that I can use any date instead of 2021/07/28
in:
url = "https://int.soccerway.com/matches/2021/07/28/"
Where in the future I'll add the parameter:
today = date.today().strftime("%Y/%m/%d")
In this answer (https://stackoverflow.com/a/68535595/11462274), a guy indicated a very fast and interesting option (He named the option at the end of the answer as: Quicker Version) without the need for a WebDriver
, but I was only able to make it work on the first page of the site, when I try to use other dates of the year, he keeps returning only the links to the games of the current day.
Expected Result (there are 465 links but I didn't put the entire result because there is a character limit):
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-sheriff-tiraspol/alashkert-fc/3517568/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-neftchi/olympiakos-cfp/3517569/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/scs-cfr-1907-cluj-sa/newcastle-fc/3517571/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-midtjylland/celtic-fc/3517576/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-razgrad-2000/mura/3517574/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/galatasaray-sk/psv-nv/3517577/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/bsc-young-boys-bern/k-slovan-bratislava/3517566/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-crvena-zvezda-beograd/fc-kairat-almaty/3517570/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/ac-sparta-praha/sk-rapid-wien/3517575/
https://int.soccerway.com/matches/2021/07/28/world/olympics/saudi-arabia-u23/brazil--under-23/3497390/
https://int.soccerway.com/matches/2021/07/28/world/olympics/germany-u23/cote-divoire-u23/3497391/
https://int.soccerway.com/matches/2021/07/28/world/olympics/romania-u23/new-zealand-under-23/3497361/
https://int.soccerway.com/matches/2021/07/28/world/olympics/korea-republic-u23/honduras-u23/3497362/
https://int.soccerway.com/matches/2021/07/28/world/olympics/australia-under-23/egypt-under-23/3497383/
https://int.soccerway.com/matches/2021/07/28/world/olympics/spain-under-23/argentina-under-23/3497384/
https://int.soccerway.com/matches/2021/07/28/world/olympics/france-u23/japan-u23/3497331/
https://int.soccerway.com/matches/2021/07/28/world/olympics/south-africa-u23/mexico-u23/3497332/
https://int.soccerway.com/matches/2021/07/28/africa/cecafa-senior-challenge-cup/uganda-under-23/eritrea-under-23/3567664/
Note 1: There are multiple types of score-time
, such as score-time status
and score-time score
, that's why I used contains
in "//td[contains(@class,'score-time')]//a"
Update
If possible, in addition to helping me solve the current problem, I am interested in an improved and faster option for the method I currently use. (I'm still learning, so my methods are pretty archaic).
解决方案
你不需要硒
Selenium 绝不应该是从网络上抓取数据的主要方式。它很慢,并且通常需要比其替代方案更多的代码行。尽可能requests
与lxml
解析器结合使用。在这个特定的用例中,您selenium
仅用于在不同的 URL 之间切换,这可以很容易地进行硬编码,从而避免一开始就需要使用它。
import requests
from lxml import html
import csv
import re
from datetime import datetime
import json
class GameCrawler(object):
def __init__(self):
self.input_date = input('Specify a date e.g. 2021/07/28: ')
self.date_object = datetime.strptime(self.input_date, "%Y/%m/%d")
self.output_file = '{}.csv'.format(re.sub('/', '-', self.input_date))
self.ROOT_URL = 'https://int.soccerway.com'
self.json_request_url = '{}/a/block_competition_matches_summary'.format(self.ROOT_URL)
self.entry_point = '{}/matches/{}'.format(self.ROOT_URL, self.input_date)
self.session = requests.Session()
self.HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
self.all_game_urls = []
self.league_urls = self.get_league_urls()
def save_to_csv(self):
with open(self.output_file, 'a+') as f:
writer = csv.writer(f)
for row in self.all_game_urls:
writer.writerow([row])
return
def request_other_pages(self, page_params):
params = {
'block_id': 'page_competition_1_block_competition_matches_summary_11',
'callback_params': json.dumps({
"page": page_params['page_count'] + 2,
"block_service_id": "competition_summary_block_competitionmatchessummary",
"round_id": int(page_params['round_id']),
"outgroup":"",
"view":1,
"competition_id": int(page_params['competition_id'])
}),
'action': 'changePage',
'params': json.dumps({"page": page_params['page_count']}),
}
response = self.session.get(self.json_request_url, headers=self.HEADERS, params=params)
if response.status_code != 200:
return
else:
json_data = json.loads(response.text)["commands"][0]["parameters"]["content"]
return html.fromstring(json_data)
def get_page_params(self, tree, response):
res = re.search('r(\d+)?/$', response.url)
if res:
page_params = {
'round_id': res.group(1),
'competition_id': tree.xpath('//*[@data-competition]/@data-competition')[0],
'page_count': len(tree.xpath('//*[@class="page-dropdown"]/option'))
}
return page_params if page_params['page_count'] != 0 else {}
return {}
def match_day_check(self, game):
timestamp = game.xpath('./@data-timestamp')[0]
match_date = datetime.fromtimestamp(int(timestamp))
return True if self.date_object.day == match_date.day else False
def scrape_page(self, tree):
for game in tree.xpath('//*[@data-timestamp]'):
game_url = game.xpath('./td[@class="score-time "]/a/@href')
if game_url and self.match_day_check(game):
self.all_game_urls.append('{}{}'.format(self.ROOT_URL, game_url[0]))
return
def get_league_urls(self):
page = self.session.get(self.entry_point, headers=self.HEADERS)
tree = html.fromstring(page.content)
league_urls = ['{}{}'.format(self.ROOT_URL, league_url) for league_url in tree.xpath('//th[@class="competition-link"]/a/@href')]
return league_urls
def main(self):
for index, league_url in enumerate(self.league_urls):
response = self.session.get(league_url, headers=self.HEADERS)
tree = html.fromstring(response.content)
self.scrape_page(tree)
page_params = self.get_page_params(tree, response)
if page_params.get('page_count', 0) != 0:
while True:
page_params['page_count'] = page_params['page_count'] - 1
if page_params['page_count'] == 0:
break
tree = self.request_other_pages(page_params)
if tree is None:
continue
self.scrape_page(tree)
print('Retrieved links for {} out of {} competitions'.format(index+1, len(self.league_urls)))
self.save_to_csv()
return
if __name__ == '__main__':
GameCrawler().main()
那么 Selenium 什么时候值得使用呢?
如今,网站提供动态内容很常见,因此如果您要检索的数据不是静态加载的:
- 检查浏览器的网络选项卡以查看是否有特定于您感兴趣的数据的请求,并且,
- 尝试用
requests
.
如果第 1 点和第 2 点由于网页的设计方式而无法实现,那么您最好的选择是使用selenium
通过模拟用户交互来获取所需内容的方法。对于 HTML 解析,您仍然可以选择使用lxml
,或者您可以坚持使用selenium
哪个也提供了该功能。
第一次编辑:
- 修复了 OP 提出的问题
- 包括所提供代码的限制
- 代码重构
- 添加了日期检查以确保仅保存在指定日期进行的比赛
- 添加了允许保存搜索结果的功能
第二次编辑:
- 添加了用于浏览每个列出的比赛的所有页面的功能,
get_page_params()
以及request_other_pages()
- 更多代码重构
推荐阅读
- python - 优化中的多个目标
- tensorflow - 如何解释卷积神经网络中的梯度范数?
- nginx - Kubernetes 中带有 nginx 入口控制器的 Kibana
- mysql - 将外键添加到现有表
- javascript - 在 Chart.js 中为单个数据值在图例中显示多个标签
- nginx - 在 stderr 中发送的 FastCGI:“PHP 消息:PHP 通知:未定义索引:”
- reactjs - 尝试将 React 应用程序部署到 Heroku 时出错
- c# - 从 MVC .net 中的 XML URL 文件获取节点
- javascript - Javascript - 从 DOM 中检索数据
- bash - 在所有子文件夹中从 .csv 搜索文件并将其复制到新位置