首页 > 解决方案 > Python中带有过滤功能的漂亮汤查询

问题描述

我正在尝试将每篇文章的内容保存在其自己的文本文件中。我遇到的问题是想出一种beautiful soup方法,News只返回该类型的文章,而忽略其他文章类型。

有问题的网站:https ://www.nature.com/nature/articles

信息

当前代码

我能够达到可以查询该类型<span>文章的程度News,但我正在努力采取后续步骤以返回其他文章的特定信息。

我怎样才能更进一步?对于 type 的文章News,我还希望能够返回该文章titlebody同时忽略其他不属于 type 的文章News

# Send HTTP requests
import requests

from bs4 import BeautifulSoup


class WebScraper:

    @staticmethod
    def get_the_source():
        # Obtain the URL
        url = 'https://www.nature.com/nature/articles'

        # Get the webpage
        r = requests.get(url)

        # Check response object's status code
        if r:
            the_source = open("source.html", "wb")

            soup = BeautifulSoup(r.content, 'html.parser')

            type_news = soup.find_all("span", string='News')

            for i in type_news:
                print(i.text)

            the_source.write(r.content)

            the_source.close()

            print('\nContent saved.')
        else:
            print(f'The URL returned {r.status_code}!')


WebScraper.get_the_source()

新闻类型文章的示例 HTML

源代码有其他 19 篇文章类型相似和不同的文章。

   <article class="u-full-height c-card c-card--flush" itemscope itemtype="http://schema.org/ScholarlyArticle">
        
            
                
                    <div class="c-card__image">
                        <picture>
                            <source
                                type="image/webp"
                                srcset="
                                    //media.springernature.com/w165h90/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 160w,
                                    //media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 290w"
                                sizes="
                                    (max-width: 640px) 160px,
                                    (max-width: 1200px) 290px,
                                    290px">
                            <img src="//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg"
                                 alt=""
                                 itemprop="image">
                        </picture>
                    </div>
                
            
        
        <div class="c-card__body u-display-flex u-flex-direction-column">
            <h3 class="c-card__title" itemprop="name headline">
                <a href="/articles/d41586-021-00485-2"
                   class="c-card__link u-link-inherit"
                   itemprop="url"
                   data-track="click"
                   data-track-action="view article"
                   
                   data-track-label="link">Mars arrivals and Etna eruption — February's best science images</a>
            </h3>
            
                
                    <div class="c-card__summary u-mb-16 u-hide-sm-max"
                         itemprop="description">
                        <p>The month’s sharpest science shots, selected by <i>Nature's</i> photo team.</p>
                    </div>
                
            
            <div class="u-mt-auto">
                
                    <ul data-test="author-list" class="c-author-list c-author-list--compact u-mb-4">
                        <li itemprop="creator" itemscope="" itemtype="http://schema.org/Person"><span itemprop="name">Emma Stoye</span></li>
                    </ul>
                
                <div class="c-card__section c-meta">
                    <span class="c-meta__item c-meta__item--block-at-xl" data-test="article.type">
                        <span class="c-meta__type">News</span>
                    </span>
                    
                    
                        <time class="c-meta__item c-meta__item--block-at-xl" datetime="2021-03-05" itemprop="datePublished">05 Mar 2021</time>
                    
                </div>
            </div>
        </div>
    </article>


    
</div>
                    </li>
                
                    <li class="app-article-list-row__item">
                        <div class="u-full-height" data-native-ad-placement="false">

标签: pythonweb-scrapingbeautifulsoup

解决方案


最简单的方法是在查询字符串中添加新闻作为参数,并且每次点击都会获得更多结果

https://www.nature.com/nature/articles?type=news

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.nature.com/nature/articles?type=news')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item')

for n in news_articles:
    print(n.select_one('.c-card__link').text)

新闻第 2 页的各种参数:

https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&type=news&page=2

如果您在手动过滤页面或选择不同的页码时监控浏览器网络选项卡,您可以看到查询字符串的构造逻辑并相应地调整您的请求,例如

https://www.nature.com/nature/articles?type=news&year=2021

否则,您可以使用 css 选择器进行更复杂的(包含/排除)排除,这取决于article节点是否具有包含“新闻”(包含)的特定子节点;排除众生 带有另一个词/符号的新闻(根据类别列表):

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.nature.com/nature/articles')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item:has(.c-meta__type:contains("News"):not( \
                            :contains("&"), \
                            :contains("in"), \
                            :contains("Career"), \
                            :contains("Feature")))') #exclusion n

for n in news_articles:
    print(n.select_one('.c-card__link').text)

如果您想要 News & 或 News In 等,您可以从 :not() 列表中删除类别...


推荐阅读