首页 > 解决方案 > 如何在 Web Scraping Python 中跳过没有元素的 URL

问题描述

我正在尝试抓取 14 个站点,但其中一些没有任何数据表,然后我想跳过这些 URL 并处理其余的。但是,我无法解决这个问题,我应该如何解决这个问题?

这是代码:

#define date ranges
dates_sep = np.arange(26,31,1)

#define hour range
h = np.arange(0,24,1)
h_list = list(h)
hours_L0 = [str(item).zfill(2) for item in h_list]

for d_sep in dates_sep: 
    for h_L0 in hours_L0:
        
        urls_1 = "https://www.hko.gov.hk/en/wxinfo/rainfall/rf_record.shtml?form=rfrecorde&Selday=" + str(d_sep) + "&Selmonth=09&Selhour=" + str(h_L0)
        html_content = requests.get(urls_1).text
        soup = BeautifulSoup(html_content,"lxml")

问题从这里开始:

        Rainfall = soup.find_all("table", 
                                     title="Table of the rainfall recorded in various regions")
#But some URLs don't have the elements mentioned above.

        table1 = rainfall[0]
        body = table1.find_all("tr")

        head = body[0]
        body_rows = body[1:]

        headings = []
        for item in head.find_all("td", align="center"):
           item = (item.text).rstrip("\n")
           headings.append(item)

        all_rows = []
        for row_num in range(len(body_rows)):
            row = []
            for row_item in body_rows[row_num].find_all("td"):
                aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
                row.append(aa)
            all_rows.append(row)

        df = pd.DataFrame(data=all_rows)
        df.head()

示例网址:

https://www.hko.gov.hk/sc/wxinfo/rainfall/rf_record.shtml?form=rfrecorde&Selday=10&Selmonth=10&Selhour=00

https://www.hko.gov.hk/sc/wxinfo/rainfall/rf_record.shtml?form=rfrecorde&Selday=26&Selmonth=09&Selhour=00

标签: pythonweb-scrapingskip

解决方案


推荐阅读