首页 > 解决方案 > 如何在python中使用请求抓取多个页面

问题描述

最近开始涉足网络抓取,我已经成功了,但现在我卡住了,我找不到答案或弄明白。
这是我从单个页面抓取和导出信息的代码

import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]

#finds the right paragraph to grab
reading = soup.find_all('p')[0].text

print (heading, reading)

import csv
from datetime import datetime

# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
 writer = csv.writer(csv_file)
 writer.writerow([heading, reading, datetime.now()])

当我尝试同时抓取多个页面时会出现问题。它们都是一样的,只是分页变化,例如

我如何将所有数据粘贴到元组或数组中并导出到 csv,而不是编写相同的代码 20 次。提前谢谢了。

标签: python-3.xweb-scrapingbeautifulsoup

解决方案


只需循环尝试,直到没有可用页面(请求不正确)。应该很容易得到。

import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime

results = []
page_number = 1

while True:
    response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
    if response.status_code != 200:
        break
    soup = BeautifulSoup(page.content, 'html.parser')
    #finds the right heading to grab
    box = soup.find('h1').text
    heading = box.split()[0]
    #finds the right paragraph to grab
    reading = soup.find_all('p')[0].text
    # write a list
    # results.append([heading, reading, datetime.now()])
    # or tuple.. your call
    results.append((heading, reading, datetime.now()))
    page_number = page_number + 1

with open('index.csv', 'a') as csv_file:
 writer = csv.writer(csv_file)
 for result in results:
    writer.writerow(result)

推荐阅读