首页 > 解决方案 > 为什么我的代码无法从此网页中抓取

问题描述

所以我试图在 python https://journals.sagepub.com/toc/CPS/current

我的主要目标是刮掉那里出现的所有论文的标题。检查页面的检查结构后,我得到了以下代码:

url = "https://journals.sagepub.com/toc/CPS/current"
req = Request(url, headers = { "User-Agent": "Mozilla/5.0"})
webpage = urlopen(req).read()
page_soup = BeautifulSoup(webpage,"html.parser")
nameList = page_soup.findAll("h3", {"class":"heading-title"})
List = []
for name in nameList:
    List.append(name.get_text())
nameList

但是,由于某种原因,我的新列表总是空的。我已经在其他页面上使用了这种方法并且我得到了很好的结果,所以我不确定这里缺少什么。

有任何想法吗?

标签: pythonweb-scraping

解决方案


似乎urllib从服务器获得正确结果有问题。尝试requests模块,它更强大:

import requests
from bs4 import BeautifulSoup

url = "https://journals.sagepub.com/toc/CPS/current"
req = requests.get(url)
page_soup = BeautifulSoup(req.content, "html.parser")
nameList = page_soup.findAll("h3", {"class": "heading-title"})
List = []
for name in nameList:
    List.append(name.get_text())
print(List)

印刷:

[
    "When Does the Public Get It Right? The Information Environment and the Accuracy of Economic Sentiment",
    "Does Affirmative Action Work? Evaluating India’s Quota System",
    "Legacies of Resistance: Mobilization Against Organized Crime in Mexico",
    "Political Institutions and Coups in Dictatorships",
    "Generous to Workers ≠ Generous to All: Implications of European Unemployment Benefit Systems for the Social Protection of Immigrants",
    "Drinking Alone: Local Socio-Cultural Degradation and Radical Right Support—The Case of British Pub Closures",
]

推荐阅读