首页 > 解决方案 > 使用 python/beautiful soup 为 Kodi 插件从网站抓取链接

问题描述

我试图从(对于 Kodi 插件)抓取媒体链接的网站没有太多的类等标记,但每个链接都采用某种独特的布局。

我已经从另一个工作插件创建了基本的 Kodi 插件,但是我在获取 Python/BeautifulSoup 抓取链接时遇到了问题。其他插件使用类等标头,但我试图从中刮取的网站并没有太多使用这种方式。

我试过各种论坛都没有运气,大多数 Kodi 插件论坛都很老旧而且不太活跃。我看过的指南从第 1 步到第 1000 步似乎很快,而且它提供的示例并不相关。我查看了大约 30 个不同的插件,认为应该有所帮助,但我无法解决。

我试图抓取的媒体链接、剧集标题、描述和图像列在www.thisiscriminal.com/episodes上

到目前为止我完成的完整插件位于Github-repository

我可以在源代码中看到它们已经清楚地列出(见代码)

我基本上只需要能够解析一个网站,找到每一集的以下位,将它们作为链接填充到 kodi 插件页面上,然后在下面列出下一个。任何帮助将不胜感激。我已经花了大约 3 天的时间来尝试这样做,对于我从 2002 年开始的 IT 学位退学感到非常高兴和恼火。

我需要拉的网站代码

(episode image)
<img width="300" height="300" ...
https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png" ../>    

(episode title)
<h3><a href="https://thisiscriminal.com/episode-115-cecilia-5-24-19/">Cecilia</a></h3>

(episode number)
<h4>Episode #115</h4>

(episode link)
<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3"

(episode description)
</header>When Cecilia....</article>

代码

import requests
import re
from bs4 import BeautifulSoup

def get_soup(url):
    """
    @param: url of site to be scraped
    """
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    print "type: ", type(soup)
    return soup

get_soup("https://thisiscriminal.com/episodes")

def get_playable_podcast(soup):
    """
    @param: parsed html page
    """
    subjects = []

    for content in soup.find_all('a'):

        try:
            link = content.find('<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/')
            link = link.get('href')
            print "\n\nLink: ", link

            title = content.find('<h4>Episode ')
            title = title.get_text()

            desc = content.find('div', {'class': 'summary'})
            desc = desc.get_text()


            thumbnail = content.find('img')
            thumbnail = thumbnail.get('src')
        except AttributeError:
            continue


        item = {
                'url': link,
                'title': title,
                'desc': desc,
                'thumbnail': thumbnail
        }

        #needto check that item is not null here
        subjects.append(item)

    return subjects

2019-06-09 00:05:35.719 T:1916360240 错误:已要求窗口 10502 中的控件 55 聚焦,但它不能 2019-06-09 00:05:41.312 T:1165988576 错误:抛出异常(PythonToCppException ) : -->Python 回调/脚本返回以下错误<- - 注意:忽略这会导致内存泄漏!错误类型:错误内容:“ascii”编解码器无法解码位置 0 中的字节 0xa0:序数不在范围内(128) 回溯(最近一次调用最后一次):文件“/home/osmc/.kodi/addons/plugin.audio .abcradionational/addon.py”,第 44 行,在 desc = soup.get_text().replace('\xa0', ' ').replace('\n', ' ') UnicodeDecodeError: 'ascii' codec can't解码位置 0 中的字节 0xa0:序数不在范围内(128)-->Python 脚本错误报告结束<-- 2019-06-09 00:05:41.636 T:1130349280 错误:

标签: pythonweb-scrapingpluginsbeautifulsoupkodi

解决方案


好消息是该页面获得了内容的 wp json 源加载,您可以针对此发出简单的 xhr。其他答案似乎很好地涵盖了如何找到它。

然后,您可以根据需要从该 json 解析信息。文本描述是返回的 json 中的 html,因此您可以将其传递给 bs4 并根据需要进行解析。下面的例子。您可以在此处探索与 Cecilia 相关的 json 对象,或者将以下内容粘贴到 json 查看器中:

{'title': 'Cecilia', 'excerpt': {'short': 'When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another...', 'long': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your...", 'full': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your first purchase..."}, 'content': '<p data-pm-context="[]">When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don&#8217;t.”&lt;/p>\n<p data-pm-context="[]">Sponsors:</p>\n<p><strong>Article</strong> Visit <a href="http://article.com/criminal">article.com/criminal </a>to get $50 off your first purchase of $100 or more.</p>\n<p><a href="https://www.therealreal.com/"><strong>The Real Real</strong></a> Shop in-store, online, or download the app, and get 20% off select items with the promo code REAL.</p>\n<p><strong>Simplisafe</strong> Protect your home today and get free shipping at <a href="http://SimpliSafe.com/CRIMINAL">SimpliSafe.com/CRIMINAL</a></p>\n<p><strong>Squarespace</strong> Try <a href="http://Squarespace.com/criminal">Squarespace.com/criminal </a>for a free trial and when you’re ready to launch, use the offer code INVISIBLE to save 10% off your first purchase of a website or domain.</p>\n<p><strong>Sun Basket</strong> Go to <a href="http://sunbasket.com/criminal">sunbasket.com/criminal </a>to get up to $80 off today!</p>\n', 'image': {'thumb': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-150x150.png', 'medium': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-300x300.png', 'large': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-1024x1024.png', 'full': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png'}, 'episodeNumber': '115', 'audioSource': 'https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3', 'musicCredits':"FALSE", 'id': 3129, 'slug': 'episode-115-cecilia-5-24-19', 'date': '2019-05-24 19:43:44', 'permalink': 'https://thisiscriminal.com/episode-115-cecilia-5-24-19/', 'next':"None", 'prev': {'slug': 'episode-114-philip-and-becky', 'title': 'Episode 114: Philip and Becky (5.10.2019)'}}

该请求是一个 queryString url,因此您可以更改要返回的项目数,并且在响应中您将看到列出的页面总数,因此您知道需要多少请求才能返回所有内容。

如果你看这里

posts=1000&page=1

您可以看到两个可以相应更改的参数。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000&page=1').json()

for post in r['posts']:
    title = post['title']
    soup = bs(post['content'])
    desc = soup.select_one('p').text  # soup.get_text().replace('\xa0', ' ').replace('\n', ' ')
    img = post['image']['full']
    episode_link = post['audioSource'] #sure this is what you wanted?
    episode_number = post['episodeNumber']

推荐阅读