首页 > 解决方案 > BeautifulSoup 试图从包装的 div 中获取文本,但返回的是空的或“无”

问题描述

这是我要解析的 HTML 的图片(抱歉): 带有统计数据的体育页面的 HTML

我正在使用这条线:

home_stats = soup.select_one('div', class_='statText:nth-child(1)').text

认为我会得到 statText 类的第一个孩子,结果将是 53%。

但事实并非如此。我得到“正在加载...”,但没有任何我试图使用和显示的数据。

我到目前为止的完整代码:

soup = BeautifulSoup(source, 'lxml')

home_team = soup.find('div', class_='tname-home').a.text
away_team = soup.find('div', class_='tname-away').a.text
home_score = soup.select_one('.current-result .scoreboard:nth-child(1)').text
away_score = soup.select_one('.current-result .scoreboard:nth-child(2)').text
print("The home team is " + home_team, "and they scored " + home_score)
print()
print("The away team is " + away_team, "and they scored " + away_score)

home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
print(home_stats)

目前确实打印了客队和客队以及他们的进球数。但我似乎无法从该站点获得任何统计内容。

我的输出计划是:

[home_team] had 53% ball possession and [away_team] had 47% ball possession

但是,我想从解析中删除“%”符号(但这不是必需的)。我的计划是稍后将这些数字用于更多统计信息,因此 % 符号会妨碍您。

为这个菜鸟问题道歉 - 这是我 Pythonic 旅程的绝对开始。我已经搜索了互联网和 StackOverflow,但找不到这种情况 - 我也可能不知道我在寻找什么。

感谢您的帮助!愿你的答案是我选为“正确”的答案;)

标签: pythonbeautifulsoup

解决方案


假设是您尝试抓取的网站,以下是抓取所有统计信息的完整代码:

from bs4 import BeautifulSoup
from selenium import webdriver 
import pandas as pd 

driver = webdriver.Chrome('chromedriver.exe')

driver.get('https://www.scoreboard.com/en/match/SO3Fg7NR/#match-statistics;0')

pg = driver.page_source #Gets the source code of the page
driver.close()

soup = BeautifulSoup(pg,'html.parser') #Creates a soup object

statrows = soup.find_all('div',class_ = "statTextGroup") #Finds all the div tags with class statTextGroup -- these div tags contain the stats

#Scrapes the team names
teams = soup.find_all('a',class_ = "participant-imglink")

teamslst = []
for x in teams:
    team = x.text.strip()
    if team != "":
        teamslst.append(team)

stats_dict = {}

count = 0
for x in statrows:
   txt = x.text 
   final_txt = ""
   stat = ""
   alphabet = False
   percentage = False
   
   #Extracts the numbers from the text
   for c in txt:
       if c in '0123456789':
           final_txt+=c
       else:
           if alphabet == False:
               final_txt+= "-"
               alphabet = True
           if c != "%":
               stat += c
           else:
               percentage = True 
   values = final_txt.split('-')

   #Appends the values to the dictionary
   for x in values:
       if stat in stats_dict.keys():
           if percentage == True:
               stats_dict[stat].append(x + "%")
           else:
               stats_dict[stat].append(int(x))
               
       else:
           if percentage == True:
               stats_dict[stat] = [x + "%"]
           else:
               stats_dict[stat] = [int(x)]
               
   count += 1 
   if count == 15:
       break

index = [teamslst[0],teamslst[1]]

#Creates a pandas DataFrame out of the dictionary
df = pd.DataFrame(stats_dict,index = index).T 
print(df)

输出:

                  Burnley Southampton
Ball Possession       53%         47%
Goal Attempts          10           5
Shots on Goal           2           1
Shots off Goal          4           2
Blocked Shots           4           2
Free Kicks             11          10
Corner Kicks            8           2
Offsides                2           1
Goalkeeper Saves        0           2
Fouls                   8          10
Yellow Cards            1           0
Total Passes          522         480
Tackles                15          12
Attacks               142         105
Dangerous Attacks      44          29

希望这会有所帮助!

PS:我实际上是为另一个问题编写了这段代码,但我没有发布它,因为答案已经发布了!但我不知道它现在会派上用场!无论如何,我希望我的回答能满足你的需要。


推荐阅读