python - 如何爬取所有页面?
问题描述
我正在尝试抓取网站的文本。但它只爬取了 12 篇文章。我不知道为什么会这样。我想知道如果我想抓取其他页面,我该怎么办?
import requests
from bs4 import BeautifulSoup
x = int(input("start page:"))
while x < int(input("end page:")):
x = x + 1
url = "https://www.mmtimes.com/national-news.html?page=" + str(x)
result = requests.get(url)
bs_obj = BeautifulSoup(result.content, "html.parser")
content = bs_obj.find("div", {"class": "msp-three-col"})
read_more = content.findAll("div", {"class": "read-more"})
for item in read_more:
atag = item.find('a')
link = "https://www.mmtimes.com" + atag["href"]
linkResult = requests.get(link)
subpage = BeautifulSoup(linkResult.content, "html.parser")
fnresult = subpage.find("div", {"class": "field-item even"})
print(fnresult.text)
print("Total "+str(len(read_more))+" articles"))
解决方案
查看下面的代码,我做了一些更改。这将产生所需的输出。
import requests
from bs4 import BeautifulSoup
x = int(input("start page:"))
y = input("end page:")
article_count = 0
while x <= int(y):
url = "https://www.mmtimes.com/national-news.html?page=" + str(x)
result = requests.get(url)
bs_obj = BeautifulSoup(result.content, "html.parser")
content = bs_obj.find("div", {"class": "msp-three-col"})
read_more = content.findAll("div", {"class": "read-more"})
for item in read_more:
atag = item.find('a')
link = "https://www.mmtimes.com" + atag["href"]
linkResult = requests.get(link)
subpage = BeautifulSoup(linkResult.content, "html.parser")
fnresult = subpage.find("div", {"class": "field-item even"})
print(fnresult.text)
article_count += len(read_more)
print("Total "+str(article_count)+" articles")
x += 1
推荐阅读
- php - 特定的 WHERE 子句
- python - 模块“twitter”没有属性“Twitter”
- c# - .Net Core 2.1 的 HttpClient 挂起
- python - 如何在 python 中获得一个循环以返回原始的 while 语句。
- django - 如何使用 DEBUG = False 在 django 中*始终*记录异常和堆栈跟踪
- python - 如何将两个列表加在一起,避免重复,并为元素排序?
- windows - 无法安装货物树:无法确定视觉工作室生成器
- php - Laravel 5 - 具有嵌套关系的 API 资源
- typescript - 使用类名的 Typescript 或 Angular5 禁用按钮
- c# - Unity C#:MainCamera - 标签旋转画布?