python - 如何使用 Python 从包含“显示更多”的网页中提取所有 url?
问题描述
我是网页抓取领域的新手,我之前使用代码从包含多个页面的网站中提取 url,然后将它们保存在 txt 文件中。我想将它应用到一个新网站,但它只有一个页面但有一个“显示更多”按钮。
这是网页: http ://sdg.iisd.org/news/
这是我的代码:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
links = []
for i in range(#221):
url = 'http://sdg.iisd.org/news/' #+ str(i) <-- for webpage with many pages
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
print('Page: ' + str(i))
soup = BeautifulSoup(response.text,'lxml')
div = soup.findAll('article')
for article in div:
a = article.find('a')
link = a['href']
links.append('https://sdg.iisd.org/news' + link)
print(len(links))
with open('urls.txt', 'w') as file:
for link in links:
file.write(link + '\n')
有些人建议使用 Selenium,但我找不到我拥有的类似应用程序的示例。您知道我可以使用和更改我的代码以获取页面的所有链接吗?
解决方案
如果您记录浏览器的网络流量,您可以看到按下按钮会通过 HTTP POSTShow more
发出 XHR 请求,并且响应是 HTML。http://sdg.iisd.org/wp-admin/admin-ajax.php
您也可以从浏览器的开发工具中复制 POST 有效负载。使用有效负载字典中的pageNumber
和ppp
键值对data
来获取不同的文章:
def main():
import requests
from bs4 import BeautifulSoup as Soup
from operator import itemgetter
url = "http://sdg.iisd.org/wp-admin/admin-ajax.php"
data = {
"template": "load_more",
"post_type": "news",
"sdgs": "",
"issues": "",
"globalpartnership": "",
"actors": "",
"actions": "",
"regions": "",
"behaviour": "exact",
"sort_by": "DESC",
"pageNumber": "1",
"ppp": "12",
"action": "more_post_ajax",
"author": ""
}
response = requests.post(url, data=data)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
article_urls = list(map(itemgetter("href"), soup.select("article > a")))
print(article_urls)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
输出:
['http://sdg.iisd.org/news/wef-event-explores-ways-to-fix-international-trade-system/', 'http://sdg.iisd.org/news/wto-members-resume-negotiations-on-fisheries-subsidies/', 'http://sdg.iisd.org/news/informal-ministerial-highlights-role-of-trade-in-promoting-covid-19-recovery/', 'http://sdg.iisd.org/news/wto-imf-project-uneven-covid-19-recovery-across-and-within-countries/', 'http://sdg.iisd.org/news/53-wto-members-commit-to-ease-restrictions-on-humanitarian-food-aid/', 'http://sdg.iisd.org/news/development-goals-can-work-even-amid-crisis-but-we-need-to-measure-better/', 'http://sdg.iisd.org/news/unctad-partners-launch-tool-to-identify-exchange-traded-funds-with-sdg-alignment/', 'http://sdg.iisd.org/news/tool-helps-measure-quality-of-stakeholder-engagement-in-sdgs/', 'http://sdg.iisd.org/news/unctad-reveals-economic-slowdown-before-covid-19-provides-key-data-on-rcep-agreement/', 'http://sdg.iisd.org/news/unep-report-identifies-top-actions-to-minimize-adverse-impacts-of-pesticides-fertilizers/', 'http://sdg.iisd.org/news/regions-to-hold-sustainable-development-forums-ahead-of-2021-hlpf/', 'http://sdg.iisd.org/news/ndc-partnership-reflects-on-milestone-year-for-climate-ambition/']
>>>
推荐阅读
- python - 我正在尝试在 Pycharm 中使用 Selenium webdriver,您可以在您的项目中使用它还是仅在 Python 控制台中使用它?
- flutter - 如何以简单的方式在本地保存对象列表
- python - 从文本文件中读取 np 数组
- gdb - fedora 中的 debuginfo 处理,如何删除 debuginfo
- javascript - 如何正确使用拼接和切片【Typescript】
- django - 我的 Django 信号超出了最大递归深度
- javascript - 如何使用 JavaScript reduce() 方法在间隔之间添加范围
- javascript - 将元素拖放到第二个 ul 中使其为空
- invoice - Odoo 13:如何在发票中打印 move_id(付款号)?
- r - 如何从数据框中删除行而不重写它?