首页 > 解决方案 > 使用 BeautifulSoup 时迭代失败

问题描述

我正在使用 BeautifulSoup 尝试从网页中提取数据。但由于某种原因,它无法迭代在季节大于 1 中发现的项目。这种行为似乎没有理由,因为节点在我看来完全相同。

def scrape_show(show):
    source = requests.get(show.url).text
    soup = BeautifulSoup(source, 'lxml')

    # All seasons and episodes
    area = soup.find('div', class_='play_video-area-aside play_video-area-aside--related-videos play_video-area-aside--related-videos--titlepage')
    for article in area:
        if "season" in article.get('id'):
            season = article.h2.a.find('span', class_='play_accordion__section-title-inner').text
            print(season + " -- " + article.get('id'))
            # All content for the given season

            ul = article.find('ul')
            if ul is None:
                print("null!")  # This should not happen

示例输出:

Season 1 -- section-season1-xxxx
Season 2 -- section-season2-xxxx
null!

https://www.svtplay.se/andra-aket(来自示例的网址)

html源码

标签: python-3.xweb-scraping

解决方案


并非所有季节都以 HTML 格式提供数据,仅适用于第 1 季。但信息以 JSON 格式嵌入页面中。re您可以使用andjson模块解析这些数据:

import re
import json
import requests

url = 'https://www.svtplay.se/andra-aket?tab=season-1-18927182'

data = json.loads( re.findall(r"root\['__svtplay_apollo'\] = (\{.*?\});", requests.get(url).text)[0] )

from pprint import pprint

# pprint(data) # <-- uncommment this to see all the data

for k in data:
    if k.startswith('Episode:') or (k.startswith('$Episode:') and k.endswith('urls')):
        print(k)
        pprint(data[k])
        print('-' * 80)

打印(关于第 1 集和第 2 集的数据及其 URL):

Episode:1383301-001
{'__typename': 'Episode',
 'accessibilities': {'json': ['AudioDescribed', 'SignInterpreted'],
                     'type': 'json'},
 'duration': 1700,
 'id': '1383301-001',
 'image': {'generated': False,
           'id': 'Image:18926434',
           'type': 'id',
           'typename': 'Image'},
 'live': None,
 'longDescription': 'Madde och Petter flyttar tillsammans med sin 13-åriga '
                    'dotter Ida till Björkfjället, en liten skidort i svenska '
                    'fjällen. Madde är uppvuxen där men för '
                    'Stockholms-hipstern Petter är det ett chockartat '
                    'miljöombyte. Maddes mamma Ingegerd har gått i pension och '
                    'lämnat över ansvaret för familjens lilla hotell till '
                    'Madde. Hon och Petter ska nu driva "Gammelgården" med '
                    'Maddes bror Tommy, vilket visar sig vara en inte helt '
                    'lätt uppgift. I rollerna: Sanna Sundqvist, Jakob '
                    'Setterberg, William Spetz, Bert-Åke Varg, Mattias '
                    'Fransson och Lena T Hansson. Del 1 av 8.',
 'name': 'Avsnitt 1',
 'nameRaw': '',
 'positionInSeason': 'Säsong 1 — Avsnitt 1',
 'restrictions': {'generated': True,
                  'id': '$Episode:1383301-001.restrictions',
                  'type': 'id',
                  'typename': 'Restrictions'},
 'slug': 'avsnitt-1',
 'svtId': 'jBD1gw8',
 'urls': {'generated': True,
          'id': '$Episode:1383301-001.urls',
          'type': 'id',
          'typename': 'Urls'},
 'validFrom': '2019-07-25T02:00:00+02:00',
 'validFromFormatted': 'Tor 25 jul 02:00',
 'validTo': '2020-01-21T23:59:00+01:00',
 'variants': [{'generated': False,
               'id': 'Variant:1383301-001A',
               'type': 'id',
               'typename': 'Variant'},
              {'generated': False,
               'id': 'Variant:1383301-001S',
               'type': 'id',
               'typename': 'Variant'},
              {'generated': False,
               'id': 'Variant:1383301-001T',
               'type': 'id',
               'typename': 'Variant'}],
 'videoSvtId': '8PbQdAj'}
--------------------------------------------------------------------------------
$Episode:1383301-001.urls
{'__typename': 'Urls',
 'svtplay': '/video/19970142/andra-aket/andra-aket-sasong-1-avsnitt-1'}
--------------------------------------------------------------------------------

... and so on.

推荐阅读