首页 > 解决方案 > 如何使用 Python 访问 RSS 提要中的图像和图像 url?

问题描述

我目前使用 feedparser 在 Python 中有这段代码:

import feedparser

RSS_FEEDS = {'cnn': 'http://rss.cnn.com/rss/edition.rss'}    

def get_news_test(publication="cnn"):
    feed = feedparser.parse(RSS_FEEDS[publication])
    articles_cnn = feed['entries']

    for article in articles_cnn:
        print(article)


get_news_test()

上面的代码返回所有当前的文章。这是它返回的其中一篇文章的示例:


{'title': "China's internet shutdowns tactics are spreading worldwide", 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://rss.cnn.com/rss/edition.rss', 'value': "China's internet shutdowns tactics are spreading worldwide"}, 'summary': 'When Hong Kong police fired tear gas at peaceful pro-democracy protesters in 2014, the news moved swiftly through social media. Photos and videos of mostly student demonstrators being gassed helped fuel the outrage that ultimately drove hundreds of thousands of people into the streets.', 'summary_detail': {'type': 'text/html', 'language': None, 'base': 'http://rss.cnn.com/rss/edition.rss', 'value': 'When Hong Kong police fired tear gas at peaceful pro-democracy protesters in 2014, the news moved swiftly through social media. Photos and videos of mostly student demonstrators being gassed helped fuel the outrage that ultimately drove hundreds of thousands of people into the streets.'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.cnn.com/2019/01/17/africa/internet-shutdown-zimbabwe-censorship-intl/index.html'}], 'link': 'https://www.cnn.com/2019/01/17/africa/internet-shutdown-zimbabwe-censorship-intl/index.html', 'id': 'https://www.cnn.com/2019/01/17/africa/internet-shutdown-zimbabwe-censorship-intl/index.html', 'guidislink': False, 'published': 'Fri, 18 Jan 2019 07:40:48 GMT', 'published_parsed': time.struct_time(tm_year=2019, tm_mon=1, tm_mday=18, tm_hour=7, tm_min=40, tm_sec=48, tm_wday=4, tm_yday=18, tm_isdst=0), 'media_content': [{'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-super-169.jpg', 'height': '619', 'width': '1100'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-large-11.jpg', 'height': '300', 'width': '300'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-vertical-large-gallery.jpg', 'height': '552', 'width': '414'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-video-synd-2.jpg', 'height': '480', 'width': '640'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-live-video.jpg', 'height': '324', 'width': '576'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-t1-main.jpg', 'height': '250', 'width': '250'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-vertical-gallery.jpg', 'height': '360', 'width': '270'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-story-body.jpg', 'height': '169', 'width': '300'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-t1-main.jpg', 'height': '250', 'width': '250'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-assign.jpg', 'height': '186', 'width': '248'}, {'medium': 'image', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/190116165508-zimbabwe-protest-0115-01-hp-video.jpg', 'height': '144', 'width': '256'}]}

现在我知道我可以通过调用返回其中的某些部分,例如标题:

print(article.title)

但是,我对如何从提要中获取图像数据感到困惑。

标签: pythonrssfeedparser

解决方案


每个文章条目都有一个资产列表media_content。每个资产节点都包含媒体类型(我只看到过'image')、大小、url 等。

要简单地列出每个资产的媒体类型和 url,您可以使用以下内容:

import feedparser

feed = feedparser.parse("http://rss.cnn.com/rss/edition.rss")

for article in feed["entries"]:
    for media in article.media_content:
        print(f"medium: {media['medium']}")
        print(f"   url: {media['url']}")

输出:

medium: image
   url: https://cdn.cnn.com/cnnnext/dam/assets/190107112254-01-game-of-thrones-spain-castle-of-zafra-t1-main.jpg
medium: image
   url: https://cdn.cnn.com/cnnnext/dam/assets/190107112254-01-game-of-thrones-spain-castle-of-zafra-assign.jpg
medium: image
   url: https://cdn.cnn.com/cnnnext/dam/assets/190107112254-01-game-of-thrones-spain-castle-of-zafra-hp-video.jpg
...

如果你想请求和保存类型的资产'image',你可以使用requests

import feedparser
import os
import requests

feed = feedparser.parse("http://rss.cnn.com/rss/edition.rss")

for article in feed["entries"]:
    for media in article.media_content:
        if media["medium"] == "image":
            img_data = requests.get(media["url"]).content
            with open(os.path.basename(media["url"]), "wb") as handler:
                handler.write(img_data)

推荐阅读