python - BeautifulSoup 4 - 在 div 之外抓取元素(h2)
问题描述
我正在尝试从以下以下站点抓取一些足球比赛数据:
https://liveonsat.com/uk-england-all-football.php
查看网站的源代码,我发现大部分信息(团队名称、开始时间和频道)都包含在外部 div 中( div class="blockfix")。我可以使用下面的代码成功地抓取这些数据:
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
import tkinter as tk
from tkinter import messagebox
from tkinter import *
from PIL import ImageTk, Image
def makesoup(url):
page=requests.get(url)
return BeautifulSoup(page.text,"lxml")
def matchscrape(g_data):
for item in g_data:
try:
match = item.find_all("div", {"class": "fix"})[0].text
print(match)
except:
pass
try:
starttime = item.find_all("div", {"class": "fLeft_time_live"})[0].text
print(starttime)
except:
pass
try:
channel = item.find_all("td", {"class": "chan_col"})
for i in channel:
print(i.get_text().strip())
except:
pass
def start():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))
root = tk.Tk()
root.resizable(False, False)
root.geometry("600x600")
root.wm_title("liveonsat scraper")
Label = tk.Label(root, text = 'liveonsat scraper', font = ('Comic Sans MS',18))
button = tk.Button(root, text="Scrape Matches", command=start)
button3 = tk.Button(root, text = "Quit Program", command=quit)
Label.pack()
button.pack()
button3.pack()
status_label = tk.Label(text="")
status_label.pack()
root.mainloop()
例如,我收到以下输出:
我遇到的问题是一个元素(匹配日期)包含在 div 之外( div class="blockfix")。我不确定如何检索这些数据。我尝试更改以下代码:
def start():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))
至
def start():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("td", {"height": "50"}))
因为这个元素包含匹配日期的 h2 标签( h2 class="time_head),但是当我尝试这个时,我得到一个完全不同的输出,这是不正确的(见下面的代码)
def matchscrape(g_data):
for item in g_data:
try:
match = item.find_all("div", {"class": "fix"})[0].text
print(match)
except:
pass
try:
matchdate = item.find_all("h2", {"class": "time_head"})[0].text
print(matchdate)
except:
pass
try:
starttime = item.find_all("div", {"class": "fLeft_time_live"})[0].text
print(starttime)
except:
pass
try:
channel = item.find_all("td", {"class": "chan_col"})
for i in channel:
print(i.get_text().strip())
except:
pass
def start():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("td", {"height": "50"}))
不正确的输出:(由于只有一个匹配名称,时间和日期与100个频道名称一起输出)
进一步澄清。我试图达到的最终结果是每场比赛、每场比赛的时间、显示每场比赛的频道和日期比赛被显示为被抓取和输出(打印)。
感谢任何可以在此问题上为我提供指导或帮助的人。如果需要进一步澄清或其他任何事情,我将非常乐意提供。
更新:下面是一个匹配的注释中要求的 HTML 代码作为示例。我在显示时遇到问题的元素是 h2 class="time_head"
<div style="clear:right"> <div class=floatAndClearL><h2 class = sport_head >Football</h2></div> <!-- sport_head -->
<div class=floatAndClearL><h2 class = time_head>Friday, 10th July</h2></div> <!-- time_head --> <div><span class = comp_head>English Championship - Week 43</span></div>
<div class = blockfix > <!-- block 1-->
<div class=fix> <!-- around fixture and notes 2-->
<div class=fix_text> <!-- around fixture text 3-->
<div class = imgCenter><span><img src="../img/team/england.gif"></span></div>
<div class = fLeft style="width:270px;text-align:center;background-color:#ffd379;color:#800000;font-size:10pt;font-family:Tahoma, Geneva, sans-serif">Huddersfield v Luton Town</div>
<div class = imgCenter><img src="../img/team/england.gif"></div>
</div> <!-- around fixture text 3 ENDS-->
<div class=notes></div>
</div> <!-- around fixture and notes 2 ENDS-->
<div class = fLeft> <!-- around all of channel types 2--> <div> <!-- around channel type group 3-->
<div class=fLeft_icon_live_l> <!-- around icon 4-->
<img src="../img/icon/live3.png"/>
</div>
<div class=fLeft_time_live> <!-- around icon 4-->
ST: 18:00
</div> <!-- around icon 4 ENDS--> <div class = fLeft_live> <!-- around all tables of a channel type 4--> <table border="0" cellspacing="0" cellpadding="0"><tr><td class=chan_col> <a href="https://connect.bein.net/" target="_blank" class = chan_live_iptvcable> beIN Connect MENA </a></td><td width = 0></td>
</tr></table> <table border="0" cellspacing="0" cellpadding="0"><tr><td class=chan_col> <a href="https://tr.beinsports.com/kullanici/giris?ReturnUrl=" target="_blank" class = chan_live_iptvcable> beIN Connect TURKEY </a></td><td width = 0></td>
</tr></table>
解决方案
以下是您如何实现它:
import requests
import re
import unidecode
from bs4 import BeautifulSoup
# Get page source
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
response = requests.get('https://liveonsat.com/uk-england-all-football.php', headers=headers)
soup = BeautifulSoup(response.content)
# process results
for match in soup.find_all('div',class_='blockfix'):
#Competitors list. Using Regex, we look for a div containing two competitors name sepatated by a ' v '
competitors = match.find('div', text = re.compile(r'(.*) v (.*)')).text
# Looking at the match date by searching the previous h2 tag with time_head as class attribute
match_date = match.find_previous('h2',class_='time_head').text
fLeft_time_live = match.find('div',class_='fLeft_time_live').text.strip()
#Match time
channels = match.find('div',class_='fLeft_live')
print("Competitors ", competitors)
print("Match date", match_date)
print("Match time", fLeft_time_live)
#Grab tv transmission times
for channel in channels.find_all('a'):
# if the show time is available, it will be contained in a "mouseover" tag
# we try to find this tag, otherwise we just display the channel name
try:
show_date = BeautifulSoup(channel.get('onmouseover')).text
except:
print(" " ,channel.text.strip().replace('📺',''), "- no time displayed - ",)
continue
show_date = unidecode.unidecode(show_date )
#Some regex logic to extract the show date
pattern = r"CAPTION, '(.*)'\)"
show_date = re.search(pattern,show_date ).group(1)
print(" ", show_date )
print()
输出
Competitors Huddersfield v Luton Town
Match date Friday, 10th July
Match time ST: 19:00
beIN Connect MENA - no time displayed -
beIN Connect TURKEY - no time displayed -
beIN Sports MENA 12 HD - 2020-07-10 17:00:00
beIN Sports MENA 2 HD - 2020-07-10 17:00:00
beIN Sports Turkey 4 HD - 2020-07-10 17:00:00
Eleven Sports 2 Portugal HD - 2020-07-10 17:00:00
....
编辑:更正了匹配日期提取...
推荐阅读
- javascript - 在处理 React 时如何将 react-particles-js 设置为背景
- batch-file - Jenkins 中的 WinSCP 因“无法打开文件”...\WinSCP.ini“而失败。该进程无法访问该文件,因为它正被另一个进程使用”
- git - 通过删除二进制文件来减少 git 包的大小
- flutter - Flutter 如何从一个 List 中的 assets 文件夹中获取所有文件
- typescript - 打字稿:使一个参数的一种类型依赖于另一个参数的值
- excel - Excel 在验证列表中循环并导出为工作簿
- react-native - React Native 中的嵌套导航器导致 iOS 中的 View Controller 错误
- java - 是否可以在处理器中扫描类路径上的注释?
- reactjs - 打字稿不会因类型不匹配而失败
- spring-boot - 如何使用 spring data jpa 和 mongotemplate 了解 mongo 批量查询的执行统计信息?