首页 > 解决方案 > 在 html 网页中加载更多分页 - Webscraping

问题描述

这是我想从以下网址抓取数据的网址:https ://en.prothomalo.com/search?q= road%20accident 但它没有每次点击都会更改网址的分页,而是只有一个加载更多按钮和单击它不会更改 url 或脚本中的任何内容。如何在python中使用beautifulsoup自动抓取整个页面而不手动单击它?我在stackoverflow中看到了类似的问题,但那是针对json的。但看起来我的网址是在 html 中。

检查加载更多按钮会显示这行代码:

<span class="load-more-content more-m_content_1XWY0 more-m_en-content_2lUOO">Load More</span>

标签: web-scrapingbeautifulsouppagination

解决方案


下一页以 Json 格式从外部 URL 加载 Javascript。您可以使用requests库来模拟它。例如:

import json
import requests


url = "https://en.prothomalo.com/api/v1/advanced-search"

params = {
    "fields": "headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards",
    "offset": "0",
    "limit": "6",
    "q": "road accident",
}

for offset in range(0, 100, 6): # <-- increase offset here
    params["offset"] = offset
    data = requests.get(url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for i in data["items"]:
        print(i["headline"])

印刷:

Four people killed in Jashore road accident
Three killed in  Mymensingh road accident
2 killed in Fatullah road accident
7 killed in three road accidents in Chattogram, Rangamati


ASI killed in Chattogram road accident
Mother-son killed in Sylhet road accident


RAB member, another killed in Gazipur road accident 
3 killed in Chattogram road accident 
Road accident kills two workers in Noakhali
Couple killed in Meherpur road accident
4 killed, 7 injured in Rangpur road accident
3 Bangladeshis killed in Oman road accident


Implement transport act to halt road accident deaths
One killed, 3 injured in Panchagarh road accident 
Two musicians killed in Chattogram road accident
One killed in Panchagarh road accident


Road accident kills one in Narail


5 Bangladeshi workers killed in Oman road accident
Two killed in Sylhet road accident

... and so on.

推荐阅读