首页 > 解决方案 > 如何使用 beatifulsoup 仅提取锚标签内的文本?

问题描述

我目前正在开发我的第一个网络抓取应用程序,为此我使用 BeautifulSoup。尽管我正在抓取的网站不使用类名作为其 HTML 元素,但这一切都很好地工作。

仍然在 StackOverflow 和文档的帮助下,直到现在我认为我在做正确的事情,但我得到了一个错误。我要解决的问题是在网站上的表格内的 a-tag 中获取文本。(www.footballbettingtips.org)。

尽管我可以获得完整的 a-tag,例如:

href="/tips/2021-07-06/849719.html" title="投注技巧 - 凯夫拉维克 W. - 雷神阿库雷里 W.">凯夫拉维克 W. - 雷神阿库雷里 W. </a>

我只想要文字:Keflavik W. - Thor Akureyri W .。

这是我的代码:

source = requests.get(URL, headers=headers).text
soup = BeautifulSoup(source, 'lxml')

# De info dat ik wil hebben: wedstrijden, tijd, voorspellingen & quotering
table = soup.find("table",{"class":"results"})
# print(table.prettify())

#dit zijn alle rijen met info over de wedstrijden
rows = table.findChildren('tr')
numb_rows = len(rows)
#dit is de hoeveelheid rijen met wedstrijdinfo + naam competitie van vandaag
# print(numb_rows)
all_games = []

for row in rows:
    a_tag = row.a.get_text()
    print(a_tag)

    for strong_tag in row.find_all('strong'):      
        prediction = strong_tag.text
        all_games.append(prediction)

这部分给了我和错误:

for row in rows:
    a_tag = row.a.get_text()
    print(a_tag)

错误:回溯(最近一次调用最后一次):文件“c:/Users/Jente/Desktop/Webscraping/webscraping.py”,第 28 行,在 a_tag = row.a.get_text() AttributeError:'NoneType' 对象没有属性'get_text'

我不知道如何解决这个问题,因为考虑到文档,这应该只能获取文本。我尝试了很多方法,例如不使用 for 循环、getText() 方法以及许多其他避免错误的方法。

我希望有人知道我的情况在哪里以及如何出错并在这里帮助我!

标签: pythonweb-scrapingbeautifulsoup

解决方案


要在 pandas DataFrame 中获取表格,您可以使用下一个示例:

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}

url = "https://www.footballbettingtips.org/tips/2021-07-06.html"

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

data = []
for tr in soup.select("tr:not(:has(th))"):
    tds = [td.get_text(strip=True) for td in tr.select("td")]
    if len(tds) != 7:
        continue
    data.append([tr.find_previous("h4").get_text(strip=True), *tds[:6]])

df = pd.DataFrame(
    data, columns=["League", "Time", "Match", "1", "X", "2", "Tip"]
)
print(df)
df.to_csv("data.csv", index=False)

印刷:

                          League   Time                                              Match      1      X      2  Tip
0           AFC Champions League  10:00                        Cerezo Osaka - Guangzhou FC   1.06   9.60  34.50  4:0
1           AFC Champions League  14:00                               Port FC - Kitchee SC   1.96   3.43   3.60  0:2
2            Aus. NPL Queensland  09:30     Sunshine Coast Wanderers - Brisbane Roar Youth   7.20   5.95   1.28  1:2
3            Aus. NPL Queensland  09:30                  Peninsula Power - Redlands United   1.05  10.50  23.50  3:0
4           Brazilian Paranaense  18:20                          Operario PR - Londrina PR   1.99   2.76   4.63  1:2
5              Brazilian Serie A  22:30                       Santos - Atletico Paranaense   2.29   3.05   3.38  1:1
6              Brazilian Serie B  22:00                                 Ponte Preta - Avai   2.63   2.81   2.83  0:2
7              Brazilian Serie B  22:00                             Cruzeiro - Coritiba PR   2.30   2.93   3.18  2:0
8       Cambodian Premier League  11:00                     Svay Rieng FC - Cambodia Tiger   1.44   4.72   5.50  2:0
9               Champions League  16:00                 HJK Helsinki - Buducnost Podgorica   1.39   4.98   7.65  3:1
10              Champions League  16:00                Flora Tallinn - Hibernians FC Paola   1.64   3.90   5.50  0:0
11              Champions League  16:00                     Ferencvárosi TC - KF Prishtina   1.15   8.00  18.00  2:1
12              Champions League  17:00                        Zalgiris Vilnius - Linfield   1.90   3.65   4.00  2:0
13              Champions League  17:00                        CFR Cluj - Borac Banja Luka   1.18   6.75  17.50  1:1
14              Champions League  17:30                 CS Fola Esch - Lincoln Red Imps FC   1.39   4.92   8.00  4:0
15              Champions League  18:00                          Skendija Tetovo - NŠ Mura   2.79   3.25   2.60  2:1
16               Club Friendlies  09:00                            Kortrijk - Valenciennes   1.87   3.78   3.45  0:1
17             Euro Championship  19:00                                      Italy - Spain   2.45   3.17   3.07  0:0
18      Europa Conference League  15:45                          Mosta FC - Spartak Trnava   5.75   4.38   1.48  1:2
19      Europa Conference League  15:50                    Mons Calpe SC - FC Santa Coloma   2.37   3.27   2.85  2:1
20      Europa Conference League  18:30                             FK Podgorica - FK Laci   2.02   3.42   3.42  2:0
21      Finnish Kakkonen Group B  15:30             Tampereen Ilves II - FC Honka Akatemia   2.16   3.98   2.66  1:1
22                      Gold Cup  20:30                Trinidad and Tobago - French Guiana   1.24   5.50  11.00  4:1
23                      Gold Cup  23:00                                    Haiti - Bermuda   1.35   4.90   7.25  2:0
24      Icelandic Úrvalsdeild W.  18:00                     Keflavik W. - Thor Akureyri W.   2.49   3.57   2.46  1:2
25      Icelandic Úrvalsdeild W.  18:00        Fylkir Reykjavik W. - IBV Vestmannaeyjar W.   1.96   3.55   3.35  2:0
26      Icelandic Úrvalsdeild W.  18:00                        Stjarnan W. - Tindastoll W.   1.52   4.33   4.98  3:0
27      Icelandic Úrvalsdeild W.  20:00                    Selfoss W. - Valur Reykjavik W.   4.83   4.45   1.52  2:3
28      Icelandic Úrvalsdeild W.  20:00              Throttur Reykjavik W. - Breidablik W.   8.75   7.15   1.20  1:3
29            Iranian Pro League  14:30             Machine Sazi Tabriz - Mes Rafsanjan FC   5.65   3.05   1.74  0:1

...

并保存data.csv(来自 LibreOffice 的屏幕截图):

在此处输入图像描述


推荐阅读