首页 > 解决方案 > 如何使用此脚本抓取多个页面?

问题描述

我有一个 URL 列表,我创建了一个循环,目标是查看所有这些链接并在多个页面上抓取一些数据。

也许是因为我混合了 Selenium 和 BeautiffulSoup 并且没有正确完成,但是我的脚本给了我输出错误的 csv 文件。

如果我告诉脚本浏览 2 页,则输出将是 csv 文件,其中包含来自第一页的数据但两次。像那样 :

输出

正如你所看到的,doublons 而不是两页的评论用 selenium 滚动。

这是我的脚本:

import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
import time
import random
from selenium import webdriver
import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait     
from selenium.webdriver.common.by import By     
from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.common.keys import Keys

PATH = "driver\chromedriver.exe"

options = webdriver.ChromeOptions() 
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')

driver = webdriver.Chrome(options=options, executable_path=PATH)

driver.get('https://www.tripadvisor.ca/')
driver.maximize_window()
time.sleep(2)

j = 2 #number of pages

for url in linksfinal: 

    driver.get(url) 

    results = requests.get(url)

    comms = []
    notes = []
    dates = []
    
    soup = BeautifulSoup(results.text, "html.parser")

    name = soup.find('h1', class_= '_1mTlpMC3').text.strip()

    commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')

    for k in range(j): #iterate over n pages

        for container in commentary:

            comm  = container.find('q', class_ = 'IRsGHoPm').text.strip()
            comms.append(comm)

            comm1 = str(container.find("div", class_="nf9vGX55").find('span'))
            rat = re.findall(r'\d+', str(comm1))
            rat1 = (str(rat))[2]
            notes.append(rat1)

            time.sleep(3) 


        next = driver.find_element_by_xpath('//a[@class="ui_button nav next primary "]')
          
        next.click()

    data = pd.DataFrame({
    'comms' : comms,
    'notes' : notes,
    #'dates' : dates
    })

    data.to_csv(f"{name}.csv", sep=';', index=False)

    time.sleep(3)

我想它必须与我的缩进有关,但我看不出在哪里?

标签: pythonpython-3.xseleniumweb-scraping

解决方案


好吧,您尝试在循环内迭代 n 个页面for k in range(j):,但实际上您仍在迭代while的containers哪些成员取自that 取自that is taken from 。 换句话说,也许您正在单击下一步按钮,但您仍在迭代开始时收集的数据,在命令显示的第一页上。UPD 我不确定这是否可行,但我想你应该这样做:commentarycommentarysoupresultsresults = requests.get(url)
next.click()requests.get(url)


driver.get('https://www.tripadvisor.ca/')
driver.maximize_window()
time.sleep(2)

j = 2 #number of pages

for url in linksfinal: 

    driver.get(url) 

    results = requests.get(url)

    comms = []
    notes = []
    dates = []
    
    soup = BeautifulSoup(results.text, "html.parser")

    name = soup.find('h1', class_= '_1mTlpMC3').text.strip()

    commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')

    for k in range(j): #iterate over n pages

        for container in commentary:

            comm  = container.find('q', class_ = 'IRsGHoPm').text.strip()
            comms.append(comm)

            comm1 = str(container.find("div", class_="nf9vGX55").find('span'))
            rat = re.findall(r'\d+', str(comm1))
            rat1 = (str(rat))[2]
            notes.append(rat1)

            time.sleep(3) 


        next = driver.find_element_by_xpath('//a[@class="ui_button nav next primary "]')
          
        next.click()

        soup = BeautifulSoup(results.text, "html.parser")

        commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')

    data = pd.DataFrame({
    'comms' : comms,
    'notes' : notes,
    #'dates' : dates
    })

    data.to_csv(f"{name}.csv", sep=';', index=False)

    time.sleep(3)

推荐阅读