首页 > 解决方案 > 如何爬取所有页面?

问题描述

我正在尝试抓取网站的文本。但它只爬取了 12 篇文章。我不知道为什么会这样。我想知道如果我想抓取其他页面,我该怎么办?

import requests
from bs4 import BeautifulSoup

x = int(input("start page:"))
while x < int(input("end page:")):
    x = x + 1
    url = "https://www.mmtimes.com/national-news.html?page=" + str(x)
    result = requests.get(url)
    bs_obj = BeautifulSoup(result.content, "html.parser")
    content = bs_obj.find("div", {"class": "msp-three-col"})
    read_more = content.findAll("div", {"class": "read-more"})

    for item in read_more:
        atag = item.find('a')
        link = "https://www.mmtimes.com" + atag["href"]
        linkResult = requests.get(link)
        subpage = BeautifulSoup(linkResult.content, "html.parser")
        fnresult = subpage.find("div", {"class": "field-item even"})
        print(fnresult.text)
    print("Total "+str(len(read_more))+" articles"))

标签: python

解决方案


查看下面的代码,我做了一些更改。这将产生所需的输出。

import requests
from bs4 import BeautifulSoup

x = int(input("start page:"))
y = input("end page:")

article_count = 0
while x <= int(y):
    url = "https://www.mmtimes.com/national-news.html?page=" + str(x)
    result = requests.get(url)
    bs_obj = BeautifulSoup(result.content, "html.parser")
    content = bs_obj.find("div", {"class": "msp-three-col"})
    read_more = content.findAll("div", {"class": "read-more"})

    for item in read_more:
        atag = item.find('a')
        link = "https://www.mmtimes.com" + atag["href"]
        linkResult = requests.get(link)
        subpage = BeautifulSoup(linkResult.content, "html.parser")
        fnresult = subpage.find("div", {"class": "field-item even"})
        print(fnresult.text)
    article_count += len(read_more)
    print("Total "+str(article_count)+" articles")
    x += 1

推荐阅读