首页 > 解决方案 > 在 python 中使用 selenium 创建一个用于网络抓取的循环

问题描述

我想创建一个循环,以便我可以从 at the races 网站上抓取所有八场比赛中每匹马的个人时间数据。

以下是八场比赛的第一场比赛(17:15)的示例:

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support import expected_conditions

from selenium.webdriver.support.ui import WebDriverWait

url = 'http://www.attheraces.com/racecard/Wolverhampton/6-October-2018/1715'

driver = webdriver.Chrome()

driver.get(url)

driver.implicitly_wait(2)

driver.find_element_by_xpath('//*[@id="racecard-tabs 1061960"]/div[1]/div/div[1]/ul/li[2]/a').click()

WebDriverWait(driver, 5).until(expected_conditions.presence_of_element_located((By.XPATH, '//*[@id="tab-racecard-sectional-times"]/div/div[1]/div[1]/div[2]/div/button')))

下一场比赛(17:45)将有以下网址:

url = 'http://www.attheraces.com/racecard/Wolverhampton/6-October-2018/1745'

并且以下代码中的 id 会随着 url 不断变化

driver.find_element_by_xpath('//*[@id="racecard-tabs 1061961"]/div[1]/div/div[1]/ul/li[2]/a').click()

所以对于 17:15,racecard-tabs 变为 1061960

对于 17:45,racecard-tabs 变为 1061961

18:15,raecard-tabs 变为 1061963,以此类推。

非常感谢任何帮助或建议。

标签: pythonhtmlloopsseleniumweb-scraping

解决方案


这将起作用。您甚至可以更改日期,其余的将自动为您执行!

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.ui import WebDriverWait

def races(main_url):
    driver = webdriver.Chrome()
    driver.get(main_url)
    driver.implicitly_wait(2)

    races = driver.find_elements_by_class_name('time-location')
    races = [race.text[:5] for race in races]
    races = [race.replace(':', '') for race in races]

    driver.close()

    return races

def scrape(url):
    driver = webdriver.Chrome()
    driver.get(url)
    driver.implicitly_wait(2)
    driver.find_elements_by_class_name('racecard-ajax-link')[1].click()

    WebDriverWait(driver, 5).until(expected_conditions.presence_of_element_located((By.XPATH, '//*[@id="tab-racecard-sectional-times"]/div/div[1]/div[1]/div[2]/div/button')))

    for horse in driver.find_elements_by_class_name('card-item'):
        horseName = horse.find_element_by_class_name('form-link').text
        times = horse.find_elements_by_class_name('sectionals-time')
        times = [time.text for time in times]
        print('{}: {}'.format(horseName, times))
    print()

    driver.close()

def main():
    date = '6-October-2018'
    main_url = 'http://www.attheraces.com/racecard/Wolverhampton/' + date
    for race in races(main_url):
        url = main_url + '/' + race
        print(url)
        scrape(url)

if __name__ == '__main__':
    main()

推荐阅读