首页 > 解决方案 > 从某个日期开始抓取数据

问题描述

我只想在某个日期之后从表中抓取数据。下面的代码获取数据中的第一个日期(附加 url),但是我将如何创建一个 for 循环来仅从 2020 年 10 月 11 日和之前的所有行中提取数据?

我想创建一个for循环来提取此表'table table-hover small horsePerformance'中某个日期之前的所有数据)

http://www.harness.org.au/racing/horse-search/?horseId=813476


with requests.Session() as s:
   try:
       webpage_response = s.get(horseurl, headers=headers)
   except requests.exceptions.ConnectionError:
        r.status_code = "Connection refused"
                            
   soup = bs(webpage_response.content, "html.parser")
   horseresult6 = soup.find('table', class_='table table-hover small horsePerformance')
   daysbetween = horseresult6.find('td', class_='date').get_text().strip()
   daysbetween24 = horseresult6.find('td', class_='date').find_next('td', class_='date').get_text().strip()

但是我认为它应该看起来像

for tr in horseresult6.find_all('tr')[1:]: 
     daysbetween = tr.find('td', class_='date').get_text().strip()
     if xdate > daysbetween:
         do something
     else:
         continue

当我尝试这个时,它似乎不起作用

标签: pythonbeautifulsoup

解决方案


<您可以使用and运算符比较日期>

就是这样:

import time

import requests
from bs4 import BeautifulSoup

horse_url = "http://www.harness.org.au/racing/horse-search/?horseId=813476"

with requests.Session() as s:
    try:
        webpage_response = s.get(horse_url)
    except requests.exceptions.ConnectionError:
        webpage_response.status_code = "Connection refused"

    table = BeautifulSoup(
        webpage_response.content,
        "html.parser",
    ).find('table', class_='table table-hover small horsePerformance')

    target_date = "11 Oct 2020"

    for row in table.find_all("tr")[1:]:  # skipping the header
        date = row.find("td", class_="date").find("a").getText()  # table date
        if time.strptime(date, "%d %b %Y") >= time.strptime(target_date, "%d %b %Y"):  # comparing the dates
            # do your parsing here, this is just an example
            print(f'{date} - {row.find("td", class_="stake").getText(strip=True)}')

输出:

05 Apr 2021 - $4,484
29 Mar 2021 - $595
23 Mar 2021 - $4,484
12 Mar 2021 - $220
08 Mar 2021 - $181
02 Mar 2021 - $263
19 Feb 2021 - $180
12 Feb 2021 - $1,200
26 Jan 2021 - $4,484

时光倒流

target_date = "26 Jan 2021"

    for row in table.find_all("tr")[1:]:  # skipping the header
        date = row.find("td", class_="date").find("a").getText()  # table date
        if time.strptime(date, "%d %b %Y") <= time.strptime(target_date, "%d %b %Y"):  # comparing the dates
            # do your parsing here, this is just an example
            print(f'{date} - {row.find("td", class_="stake").getText(strip=True)}')

输出:

26 Jan 2021 - $4,484
14 Sep 2020 - $100
11 Sep 2020 - $616
04 Sep 2020 - $180
21 Aug 2020 - $180
17 Aug 2020 - $595
28 Jul 2020 - $4,291
21 Jul 2020 - $3,523
13 Jul 2020 - $300
30 Jun 2020 - $1,173
15 Jun 2020 - $100
30 May 2020 - $3,523
22 May 2020 - $500
12 May 2020 - $963
05 May 2020 - $3,523
02 May 2020 - $1,986
24 Apr 2020 - $144
09 Apr 2020 - $144
30 Mar 2020 - $1,225
10 Mar 2020 - $100
09 Dec 2019 - $595
02 Dec 2019 - $4,484
26 Nov 2019 - $4,484
19 Nov 2019 - $100
02 Nov 2019 - $4,484
27 Oct 2019 - $2,562
13 Oct 2019 - $700
31 May 2019 - $1,000
21 May 2019 - $4,484
07 May 2019 - $1,225
27 Apr 2019 - $595
21 Apr 2019 - $0
14 Apr 2019 - $0
07 Apr 2019 - $0

推荐阅读