首页 > 解决方案 > 从IDEAS中提取学术出版物信息

问题描述

我想从特定IDEAS 的页面中提取出版物列表。我想检索有关论文名称、作者和年份的信息。但是,我有点坚持这样做。通过检查页面,所有信息都在里面div class="tab-pane fade show active" [...],然后h3我们确实有出版年份,而在每个里面li class="list-group-item downfree" [...]我们可以找到每篇论文的相关作者(如图所示。最后,我愿意获得的是一个包含三列的数据框:标题、作者和年份。

尽管如此,虽然我能够检索每篇论文的名称,但当我还想添加年份和作者时,我会感到困惑。到目前为止,我写的是以下短代码:

from requests import get
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

containers = soup.findAll("div", {'class': 'tab-pane fade show active'})

title_list = []
year_list = []

for container in containers:

    year = container.findAll('h3')
    year_list.append(int(year[0].text))

    title_containers = container.findAll("li", {'class': 'list-group-item downfree'})
    title = title_containers[0].a.text
    title_list.append(title)  

我得到的是两个列表,每个列表只有一个元素。这是因为初始容器的大小为 1。关于如何检索作者姓名我不知道,我尝试了几种方法都没有成功。我想我必须使用“by”作为分隔符来划分标题。

我希望有人可以帮助我或重新定向到面临类似情况的其他讨论。先感谢您。为我的(可能)愚蠢的问题道歉,我仍然是使用 BeautifulSoup 进行网络抓取的初学者。

标签: htmlpython-3.xweb-scrapingbeautifulsoup

解决方案


您可以像这样获得所需的信息:

from requests import get
import pprint
from bs4 import BeautifulSoup

url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one("#content")
title_list = []
author_list = []
year_list = [int(h.text) for h in container.find_all('h3')]
for panel in container.select("div.panel-body"):
    title_list.append([x.text for x in panel.find_all('a')])
    author_list.append([x.next_sibling.strip() for x in panel.find_all('i')])
result = list(zip(year_list, title_list, author_list))

pp = pprint.PrettyPrinter(indent=4, width=250)
pp.pprint(result)

输出:

[   (   2020,
        ['The Role Of Public Procurement As Innovation Lever: Evidence From Italian Manufacturing Firms', 'A voyage in the role of territory: are territories capable of instilling their peculiarities in local production systems'],
        ['Francesco Crespi & Serenella Caravella', 'Cristina Vaquero-Piñeiro']),
    (   2019,
        [   'Probability Forecasts and Prediction Markets',
            'R&D Financing And Growth',
            'Mission-Oriented Innovation Policies: A Theoretical And Empirical Assessment For The Us Economy',
            'Public Investment Fiscal Multipliers: An Empirical Assessment For European Countries',
            'Consumption Smoothing Channels Within And Between Households',
            'A critical analysis of the secular stagnation theory',
            'Further evidence of the relationship between social transfers and income inequality in OECD countries',
            'Capital accumulation and corporate portfolio choice between liquidity holdings and financialisation'],
        [   'Julia Mortera & A. Philip Dawid',
            'Luca Spinesi & Mario Tirelli',
            'Matteo Deleidi & Mariana Mazzucato',
            'Enrico Sergio Levrero & Matteo Deleidi & Francesca Iafrate',
            'Simone Tedeschi & Luigi Ventura & Pierfederico Asdrubal',
            'Stefano Di Bucchianico',
            "Giorgio D'Agostino & Luca Pieroni & Margherita Scarlato",
            'Giovanni Scarano']),
    (   2018, ...

我使用列表理解获得了多年。我通过将列表附加到每个 div 元素中所需元素的 title_list 和 title_list 来获得标题和作者,panel-body再次使用列表理解并使用 next.siblingi获取作者的元素。然后我压缩了三个列表并将结果转换为一个列表。最后我漂亮地打印了结果。


推荐阅读