首页 > 解决方案 > 如何从bs4中的标签数组中获取字符串?

问题描述

后:

soup.select('tr:nth-child(1)')

我有:

[<tr>
 <th bgcolor="#5ac05a" colspan="2">Date</th>
 <th bgcolor="#a3c35a">T<br/>(C)</th>
 <th bgcolor="#c0a35a">Td<br/>(C)</th>
 <th bgcolor="#a3c35a">Tmax<br/>(C)</th>
 <th bgcolor="#a3c35a">Tmin<br/>(C)</th>
...
 </tr>]

我如何在不手动选择每个元素的情况下获取字符串列表(日期、T、Td),比如soup.select('tr:nth-child(1) > th:nth-child(5)')[0].text因为这工作很慢而且我在不同页面上有不同数量的 th?

标签: web-scrapingbeautifulsoup

解决方案


要将表获取到 pandas 数据框,您可以使用以下示例:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.ogimet.com/cgi-bin/gsynres?ind=28698&lang=en&decoded=yes&ndays=31&ano=2021&mes=1&day=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

header = [
    th.get_text(strip=True) for th in soup.thead.select("tr")[0].select("th")
]

all_data = []
for row in soup.thead.select("tr")[1:]:
    tds = [td.get_text(strip=True) for td in row.select("td")[:-3]]
    tds.insert(0, tds.pop(0) + " " + tds.pop(0))

    for td in row.select("td")[-3:]:
        img = td.select_one("img[onmouseover]")
        if img:
            tds.append(re.search(r"'([^']+)'", img["onmouseover"]).group(1))
        else:
            tds.append("-")

    all_data.append(tds)

df = pd.DataFrame(all_data, columns=header)
print(df)
df.to_csv("data.csv", index=False)

印刷:

                 Date   T(C)  Td(C) Tmax(C) Tmin(C)  ddd ffkmh Gustkmh   P0hPa P seahPa   PTnd Prec(mm) Nt Nh InsoD-1 Viskm Snow(cm)                                                 WW                                                 W1                                                 W2
0    01/01/2021 06:00  -30.6  -33.7   -----   -31.1  NNW   7.2    ----  1027.8   1045.5   +1.5     ----  0  -     ---  20.0     ----                 Diamond dust (with or without fog)                       Snow, or rain and snow mixed  Cloud covering more than 1/2 of the sky during...
1    01/01/2021 03:00  -30.7  -33.7   -----   -30.7  NNW   7.2    ----  1026.2   1044.0   +1.0   Tr/12h  8  8     3.7  10.0       23                 Diamond dust (with or without fog)                       Snow, or rain and snow mixed  Cloud covering more than 1/2 of the sky throug...
2    01/01/2021 00:00  -30.1  -33.1   -----   -----  NNW   7.2    ----  1025.3   1043.0   +0.6     ----  8  0     ---  10.0     ----                 Diamond dust (with or without fog)                       Snow, or rain and snow mixed  Cloud covering more than 1/2 of the sky during...
3    12/31/2020 21:00  -30.5  -33.5   -----   -----  NNW   3.6    ----  1024.7   1042.4   +0.6     ----  0  -     ---  10.0     ----                 Diamond dust (with or without fog)                       Snow, or rain and snow mixed  Cloud covering 1/2 or less of the sky througho...

...and so on

并保存data.csv(来自 LibreOffice 的屏幕截图):

在此处输入图像描述


推荐阅读