web-scraping - 在 html 网页中加载更多分页 - Webscraping
问题描述
这是我想从以下网址抓取数据的网址:https ://en.prothomalo.com/search?q= road%20accident 但它没有每次点击都会更改网址的分页,而是只有一个加载更多按钮和单击它不会更改 url 或脚本中的任何内容。如何在python中使用beautifulsoup自动抓取整个页面而不手动单击它?我在stackoverflow中看到了类似的问题,但那是针对json的。但看起来我的网址是在 html 中。
检查加载更多按钮会显示这行代码:
<span class="load-more-content more-m_content_1XWY0 more-m_en-content_2lUOO">Load More</span>
解决方案
下一页以 Json 格式从外部 URL 加载 Javascript。您可以使用requests
库来模拟它。例如:
import json
import requests
url = "https://en.prothomalo.com/api/v1/advanced-search"
params = {
"fields": "headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards",
"offset": "0",
"limit": "6",
"q": "road accident",
}
for offset in range(0, 100, 6): # <-- increase offset here
params["offset"] = offset
data = requests.get(url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i in data["items"]:
print(i["headline"])
印刷:
Four people killed in Jashore road accident
Three killed in Mymensingh road accident
2 killed in Fatullah road accident
7 killed in three road accidents in Chattogram, Rangamati
ASI killed in Chattogram road accident
Mother-son killed in Sylhet road accident
RAB member, another killed in Gazipur road accident
3 killed in Chattogram road accident
Road accident kills two workers in Noakhali
Couple killed in Meherpur road accident
4 killed, 7 injured in Rangpur road accident
3 Bangladeshis killed in Oman road accident
Implement transport act to halt road accident deaths
One killed, 3 injured in Panchagarh road accident
Two musicians killed in Chattogram road accident
One killed in Panchagarh road accident
Road accident kills one in Narail
5 Bangladeshi workers killed in Oman road accident
Two killed in Sylhet road accident
... and so on.