首页 > 解决方案 > BeautifulSoup 4 - 在 div 之外抓取元素(h2)

问题描述

我正在尝试从以下以下站点抓取一些足球比赛数据:

https://liveonsat.com/uk-england-all-football.php

查看网站的源代码,我发现大部分信息(团队名称、开始时间和频道)都包含在外部 div 中( div class="blockfix")。我可以使用下面的代码成功地抓取这些数据:

import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
import tkinter as tk
from tkinter import messagebox
from tkinter import *
from PIL import ImageTk, Image


def makesoup(url):
    page=requests.get(url)
    return BeautifulSoup(page.text,"lxml")
   
    
    
def matchscrape(g_data):
    
    for item in g_data:
        try:
            match = item.find_all("div", {"class": "fix"})[0].text
            print(match)
        except:
            pass
        try:
            starttime = item.find_all("div", {"class": "fLeft_time_live"})[0].text
            print(starttime)
        except:
            pass
        try:
            channel = item.find_all("td", {"class": "chan_col"})
            for i in channel:
                    print(i.get_text().strip())
        except:
            pass
            
            
            
def start():
    soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
    matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))

    
        
        
root = tk.Tk()
root.resizable(False, False)
root.geometry("600x600")
root.wm_title("liveonsat scraper")
Label = tk.Label(root, text = 'liveonsat scraper', font = ('Comic Sans MS',18))
button = tk.Button(root, text="Scrape Matches", command=start)
button3 = tk.Button(root,  text = "Quit Program", command=quit)
Label.pack()
button.pack()
button3.pack()
status_label = tk.Label(text="")
status_label.pack()
root.mainloop()

例如,我收到以下输出:

输出

我遇到的问题是一个元素(匹配日期)包含在 div 之外( div class="blockfix")。我不确定如何检索这些数据。我尝试更改以下代码:

def start():
    soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
    matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))

def start():
    soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
    matchscrape(g_data = soup.findAll("td", {"height": "50"})) 

因为这个元素包含匹配日期的 h2 标签( h2 class="time_head),但是当我尝试这个时,我得到一个完全不同的输出,这是不正确的(见下面的代码)

def matchscrape(g_data):
    
    for item in g_data:
        try:
            match = item.find_all("div", {"class": "fix"})[0].text
            print(match)
        except:
            pass
        try:
            matchdate = item.find_all("h2", {"class": "time_head"})[0].text
            print(matchdate)
        except:
            pass
        try:
            starttime = item.find_all("div", {"class": "fLeft_time_live"})[0].text
            print(starttime)
        except:
            pass
        try:
            channel = item.find_all("td", {"class": "chan_col"})
            for i in channel:
                    print(i.get_text().strip())
        except:
            pass
            
            
            
def start():
    soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
    matchscrape(g_data = soup.findAll("td", {"height": "50"}))

不正确的输出:(由于只有一个匹配名称,时间和日期与100个频道名称一起输出)

错误输出

进一步澄清。我试图达到的最终结果是每场比赛、每场比赛的时间、显示每场比赛的频道和日期比赛被显示为被抓取和输出(打印)。

感谢任何可以在此问题上为我提供指导或帮助的人。如果需要进一步澄清或其他任何事情,我将非常乐意提供。

更新:下面是一个匹配的注释中要求的 HTML 代码作为示例。我在显示时遇到问题的元素是 h2 class="time_head"

<div style="clear:right">    <div class=floatAndClearL><h2 class = sport_head >Football</h2></div>  <!-- sport_head -->
    <div class=floatAndClearL><h2 class = time_head>Friday, 10th  July</h2></div> <!-- time_head -->         <div><span class = comp_head>English Championship - Week 43</span></div>
       <div class = blockfix >                <!-- block 1-->
    <div class=fix>                 <!-- around fixture and notes 2-->
          <div class=fix_text>               <!-- around fixture text 3-->
              <div class = imgCenter><span><img src="../img/team/england.gif"></span></div>
              <div class = fLeft style="width:270px;text-align:center;background-color:#ffd379;color:#800000;font-size:10pt;font-family:Tahoma, Geneva, sans-serif">Huddersfield v Luton Town</div>
              <div class = imgCenter><img src="../img/team/england.gif"></div>
    </div>                  <!-- around fixture text 3 ENDS-->
        <div class=notes></div>
     </div>                  <!-- around fixture and notes 2 ENDS-->
            <div class = fLeft>                <!-- around all of channel types 2-->     <div>             <!-- around channel type group 3-->
       <div class=fLeft_icon_live_l>       <!-- around icon 4-->
         <img src="../img/icon/live3.png"/>
       </div>
       <div class=fLeft_time_live>       <!-- around icon 4-->
         ST: 18:00
       </div>           <!-- around icon 4 ENDS-->        <div class = fLeft_live>       <!-- around all tables of a channel type 4-->       <table border="0" cellspacing="0" cellpadding="0"><tr><td class=chan_col>  <a href="https://connect.bein.net/" target="_blank"  class = chan_live_iptvcable>              beIN Connect MENA </a></td><td width = 0></td>
                    </tr></table>       <table border="0" cellspacing="0" cellpadding="0"><tr><td class=chan_col>  <a href="https://tr.beinsports.com/kullanici/giris?ReturnUrl=" target="_blank"  class = chan_live_iptvcable>              beIN Connect TURKEY </a></td><td width = 0></td>
                    </tr></table>

标签: pythonweb-scrapingbeautifulsoup

解决方案


以下是您如何实现它:

import requests
import re
import unidecode
from bs4 import BeautifulSoup

# Get page source
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

response = requests.get('https://liveonsat.com/uk-england-all-football.php', headers=headers)
soup = BeautifulSoup(response.content)

# process results

for match in soup.find_all('div',class_='blockfix'):
    #Competitors list. Using Regex, we look for a div containing two competitors name sepatated by a ' v '
    competitors = match.find('div', text = re.compile(r'(.*) v (.*)')).text
    # Looking at the match date by searching the previous h2 tag with time_head as class attribute
    match_date  = match.find_previous('h2',class_='time_head').text
    fLeft_time_live = match.find('div',class_='fLeft_time_live').text.strip()
    #Match time
    channels = match.find('div',class_='fLeft_live')
    print("Competitors ", competitors)
    print("Match date", match_date)
    print("Match time", fLeft_time_live)
    
    #Grab tv transmission times
    for channel in channels.find_all('a'):
        # if the show time is available, it will be contained in a "mouseover" tag
        # we try to find this tag, otherwise we just display the channel name
        try:
            show_date = BeautifulSoup(channel.get('onmouseover')).text
        except:
            print("  " ,channel.text.strip().replace('📺',''), "- no time displayed - ",)
            continue
        show_date  = unidecode.unidecode(show_date )
        #Some regex logic to extract the show date
        pattern = r"CAPTION, '(.*)'\)"
        show_date  = re.search(pattern,show_date ).group(1)
        print("  ", show_date )
        
    print()

输出

Competitors  Huddersfield v Luton Town
Match date Friday, 10th  July
Match time ST: 19:00
   beIN Connect MENA  - no time displayed - 
   beIN Connect TURKEY  - no time displayed - 
   beIN Sports MENA 12 HD  - 2020-07-10 17:00:00
   beIN Sports MENA 2 HD  - 2020-07-10 17:00:00
   beIN Sports Turkey 4 HD  - 2020-07-10 17:00:00
   Eleven Sports 2 Portugal HD  - 2020-07-10 17:00:00
   ....

编辑:更正了匹配日期提取...


推荐阅读