python-3.x - 如何在python中使用请求抓取多个页面
问题描述
最近开始涉足网络抓取,我已经成功了,但现在我卡住了,我找不到答案或弄明白。
这是我从单个页面抓取和导出信息的代码
import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
print (heading, reading)
import csv
from datetime import datetime
# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([heading, reading, datetime.now()])
当我尝试同时抓取多个页面时会出现问题。它们都是一样的,只是分页变化,例如
- https://www.example.com/page.aspx?sign=1
- https://www.example.com/page.aspx?sign=2
- https://www.example.com/page.aspx?sign=3
- https://www.example.com/page.aspx?sign=4等
我如何将所有数据粘贴到元组或数组中并导出到 csv,而不是编写相同的代码 20 次。提前谢谢了。
解决方案
只需循环尝试,直到没有可用页面(请求不正确)。应该很容易得到。
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
results = []
page_number = 1
while True:
response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
if response.status_code != 200:
break
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
# write a list
# results.append([heading, reading, datetime.now()])
# or tuple.. your call
results.append((heading, reading, datetime.now()))
page_number = page_number + 1
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
for result in results:
writer.writerow(result)
推荐阅读
- reactjs - Relay Modern:在optimisticUpdater中删除记录
- reactjs - 复选框选中的属性在 reactJS 中不起作用
- spring-boot - 如何在不运行tomcat的情况下运行springboot测试?
- go - 不断重新连接到 Cassandra
- html - Twig 模板变量短代码不显示
- visual-studio - 无法设置 CUDA 编译环境
- angular - 如何在 *ngFor 下拉列表中添加搜索过滤器?
- r - 将三张图合二为一
- python - 打印数据帧的函数,它使用 df 名称作为参数
- javascript - 如何过滤掉数组中不包含特定值的单词?