python - Without using WebDriverWait my code return: element click intercepted / with WebDriverWait returns 'NoneType' object is not iterable

问题描述

Code Proposal:

Collecting the links to all the games of the day present on the page (https://int.soccerway.com/matches/2021/07/28/), giving me the freedom to change the date to whatever I want, such as 2021/08/01 and so on. So that in the future I can loop and collect the list from several different days at the same time, in one code call.

Even though it's a very slow model, without using Headless, this model clicks all the buttons, expands the data and imports all 465 listed match links:

for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head  clickable')]"):
    btn.click()

Full Code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(r"C:\Users\Computador\Desktop\Python\chromedriver.exe", options=options)

url = "https://int.soccerway.com/matches/2021/07/28/"

driver.get(url)
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
time.sleep(10)
for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head  clickable')]"):
    btn.click()
time.sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
    resultado = jogo.get_attribute("href")
    print(resultado)
driver.quit()

But when I add options.add_argument("headless") so that the browser is not opened on my screen, the model returns the following error:

Message: element click intercepted

To get around this problem, I analyzed options and found this one on WebDriverWait (https://stackoverflow.com/a/62904494/11462274) and tried to use it like this:

for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head  clickable')]"))):
    btn.click()

Full Code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

from selenium.webdriver.support.ui import WebDriverWait       
from selenium.webdriver.common.by import By       
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("start-maximized")
options.add_argument("headless")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(r"C:\Users\Computador\Desktop\Python\chromedriver.exe", options=options)

url = "https://int.soccerway.com/matches/2021/07/28/"

driver.get(url)
driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click()
driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click()
time.sleep(10)
for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head  clickable')]"))):
    btn.click()
time.sleep(10)
jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a")
for jogo in jogos:
    resultado = jogo.get_attribute("href")
    print(resultado)
driver.quit()

But because it's not iterable, it returns in error:

'NoneType' object is not iterable

Why do I need this option?

1 - I'm going to automate it in an online terminal, so there won't be any browser to open on the screen and I need to make it fast so I don't spend too much of my time limits on the terminal.

2 - I need to find an option that I can use any date instead of 2021/07/28 in:

url = "https://int.soccerway.com/matches/2021/07/28/"

Where in the future I'll add the parameter:

today = date.today().strftime("%Y/%m/%d")

In this answer (https://stackoverflow.com/a/68535595/11462274), a guy indicated a very fast and interesting option (He named the option at the end of the answer as: Quicker Version) without the need for a WebDriver, but I was only able to make it work on the first page of the site, when I try to use other dates of the year, he keeps returning only the links to the games of the current day.

Expected Result (there are 465 links but I didn't put the entire result because there is a character limit):

https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-sheriff-tiraspol/alashkert-fc/3517568/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-neftchi/olympiakos-cfp/3517569/        
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/scs-cfr-1907-cluj-sa/newcastle-fc/3517571/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-midtjylland/celtic-fc/3517576/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-razgrad-2000/mura/3517574/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/galatasaray-sk/psv-nv/3517577/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/bsc-young-boys-bern/k-slovan-bratislava/3517566/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-crvena-zvezda-beograd/fc-kairat-almaty/3517570/
https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/ac-sparta-praha/sk-rapid-wien/3517575/
https://int.soccerway.com/matches/2021/07/28/world/olympics/saudi-arabia-u23/brazil--under-23/3497390/
https://int.soccerway.com/matches/2021/07/28/world/olympics/germany-u23/cote-divoire-u23/3497391/
https://int.soccerway.com/matches/2021/07/28/world/olympics/romania-u23/new-zealand-under-23/3497361/
https://int.soccerway.com/matches/2021/07/28/world/olympics/korea-republic-u23/honduras-u23/3497362/
https://int.soccerway.com/matches/2021/07/28/world/olympics/australia-under-23/egypt-under-23/3497383/
https://int.soccerway.com/matches/2021/07/28/world/olympics/spain-under-23/argentina-under-23/3497384/
https://int.soccerway.com/matches/2021/07/28/world/olympics/france-u23/japan-u23/3497331/
https://int.soccerway.com/matches/2021/07/28/world/olympics/south-africa-u23/mexico-u23/3497332/
https://int.soccerway.com/matches/2021/07/28/africa/cecafa-senior-challenge-cup/uganda-under-23/eritrea-under-23/3567664/

Note 1: There are multiple types of score-time, such as score-time status and score-time score, that's why I used contains in "//td[contains(@class,'score-time')]//a"

Update

If possible, in addition to helping me solve the current problem, I am interested in an improved and faster option for the method I currently use. (I'm still learning, so my methods are pretty archaic).

标签： pythonseleniumselenium-webdriverweb-scrapingxpath

解决方案

你不需要硒

Selenium 绝不应该是从网络上抓取数据的主要方式。它很慢，并且通常需要比其替代方案更多的代码行。尽可能requests与lxml解析器结合使用。在这个特定的用例中，您selenium仅用于在不同的 URL 之间切换，这可以很容易地进行硬编码，从而避免一开始就需要使用它。

import requests
from lxml import html
import csv
import re
from datetime import datetime
import json

class GameCrawler(object):
    def __init__(self):
        self.input_date = input('Specify a date e.g. 2021/07/28: ')
        self.date_object = datetime.strptime(self.input_date, "%Y/%m/%d")
        self.output_file = '{}.csv'.format(re.sub('/', '-', self.input_date))
        self.ROOT_URL = 'https://int.soccerway.com'
        self.json_request_url = '{}/a/block_competition_matches_summary'.format(self.ROOT_URL)
        self.entry_point = '{}/matches/{}'.format(self.ROOT_URL, self.input_date)
        self.session = requests.Session()
        self.HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
        self.all_game_urls = []
        self.league_urls = self.get_league_urls()

    def save_to_csv(self):
        with open(self.output_file, 'a+') as f:
            writer = csv.writer(f)
            for row in self.all_game_urls:
                writer.writerow([row]) 
        return

    def request_other_pages(self, page_params):
        params = {
            'block_id': 'page_competition_1_block_competition_matches_summary_11',
            'callback_params': json.dumps({
                "page": page_params['page_count'] + 2, 
                "block_service_id": "competition_summary_block_competitionmatchessummary",
                "round_id": int(page_params['round_id']),
                "outgroup":"",
                "view":1,
                "competition_id": int(page_params['competition_id'])
            }),
            'action': 'changePage',
            'params': json.dumps({"page": page_params['page_count']}),
        }
        response = self.session.get(self.json_request_url, headers=self.HEADERS, params=params)
        if response.status_code != 200:
            return
        else:
            json_data = json.loads(response.text)["commands"][0]["parameters"]["content"]
            return html.fromstring(json_data)

    def get_page_params(self, tree, response):
        res = re.search('r(\d+)?/$', response.url)
        if res:
            page_params = {
                'round_id': res.group(1),
                'competition_id': tree.xpath('//*[@data-competition]/@data-competition')[0],
                'page_count': len(tree.xpath('//*[@class="page-dropdown"]/option'))
            }
            return page_params if page_params['page_count'] != 0 else {}
        return {}

    def match_day_check(self, game):
        timestamp = game.xpath('./@data-timestamp')[0]
        match_date = datetime.fromtimestamp(int(timestamp))
        return True if self.date_object.day == match_date.day else False

    def scrape_page(self, tree):
        for game in tree.xpath('//*[@data-timestamp]'):
            game_url = game.xpath('./td[@class="score-time "]/a/@href')
            if game_url and self.match_day_check(game):
                self.all_game_urls.append('{}{}'.format(self.ROOT_URL, game_url[0]))
        return

    def get_league_urls(self):
        page = self.session.get(self.entry_point, headers=self.HEADERS)
        tree = html.fromstring(page.content)
        league_urls = ['{}{}'.format(self.ROOT_URL, league_url) for league_url in tree.xpath('//th[@class="competition-link"]/a/@href')]
        return league_urls

    def main(self):
        for index, league_url in enumerate(self.league_urls):
            response = self.session.get(league_url, headers=self.HEADERS)
            tree = html.fromstring(response.content)
            self.scrape_page(tree)
            page_params = self.get_page_params(tree, response)
            if page_params.get('page_count', 0) != 0:
                while True:
                    page_params['page_count'] = page_params['page_count'] - 1
                    if page_params['page_count'] == 0:
                        break
                    tree = self.request_other_pages(page_params)
                    if tree is None:
                        continue
                    self.scrape_page(tree)
            print('Retrieved links for {} out of {} competitions'.format(index+1, len(self.league_urls)))
        self.save_to_csv()
        return

if __name__ == '__main__':
    GameCrawler().main()

那么 Selenium 什么时候值得使用呢？

如今，网站提供动态内容很常见，因此如果您要检索的数据不是静态加载的：

检查浏览器的网络选项卡以查看是否有特定于您感兴趣的数据的请求，并且，
尝试用requests.

如果第 1 点和第 2 点由于网页的设计方式而无法实现，那么您最好的选择是使用selenium通过模拟用户交互来获取所需内容的方法。对于 HTML 解析，您仍然可以选择使用lxml，或者您可以坚持使用selenium哪个也提供了该功能。

第一次编辑：

修复了 OP 提出的问题
包括所提供代码的限制
代码重构
添加了日期检查以确保仅保存在指定日期进行的比赛
添加了允许保存搜索结果的功能

第二次编辑：