python - 如何使用此脚本抓取多个页面?
问题描述
我有一个 URL 列表,我创建了一个循环,目标是查看所有这些链接并在多个页面上抓取一些数据。
也许是因为我混合了 Selenium 和 BeautiffulSoup 并且没有正确完成,但是我的脚本给了我输出错误的 csv 文件。
如果我告诉脚本浏览 2 页,则输出将是 csv 文件,其中包含来自第一页的数据但两次。像那样 :
正如你所看到的,doublons 而不是两页的评论用 selenium 滚动。
这是我的脚本:
import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
import time
import random
from selenium import webdriver
import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
driver = webdriver.Chrome(options=options, executable_path=PATH)
driver.get('https://www.tripadvisor.ca/')
driver.maximize_window()
time.sleep(2)
j = 2 #number of pages
for url in linksfinal:
driver.get(url)
results = requests.get(url)
comms = []
notes = []
dates = []
soup = BeautifulSoup(results.text, "html.parser")
name = soup.find('h1', class_= '_1mTlpMC3').text.strip()
commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')
for k in range(j): #iterate over n pages
for container in commentary:
comm = container.find('q', class_ = 'IRsGHoPm').text.strip()
comms.append(comm)
comm1 = str(container.find("div", class_="nf9vGX55").find('span'))
rat = re.findall(r'\d+', str(comm1))
rat1 = (str(rat))[2]
notes.append(rat1)
time.sleep(3)
next = driver.find_element_by_xpath('//a[@class="ui_button nav next primary "]')
next.click()
data = pd.DataFrame({
'comms' : comms,
'notes' : notes,
#'dates' : dates
})
data.to_csv(f"{name}.csv", sep=';', index=False)
time.sleep(3)
我想它必须与我的缩进有关,但我看不出在哪里?
解决方案
好吧,您尝试在循环内迭代 n 个页面for k in range(j):
,但实际上您仍在迭代while的containers
哪些成员取自that 取自that is taken from 。
换句话说,也许您正在单击下一步按钮,但您仍在迭代开始时收集的数据,在命令显示的第一页上。UPD
我不确定这是否可行,但我想你应该这样做:commentary
commentary
soup
results
results = requests.get(url)
next.click()
requests.get(url)
driver.get('https://www.tripadvisor.ca/')
driver.maximize_window()
time.sleep(2)
j = 2 #number of pages
for url in linksfinal:
driver.get(url)
results = requests.get(url)
comms = []
notes = []
dates = []
soup = BeautifulSoup(results.text, "html.parser")
name = soup.find('h1', class_= '_1mTlpMC3').text.strip()
commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')
for k in range(j): #iterate over n pages
for container in commentary:
comm = container.find('q', class_ = 'IRsGHoPm').text.strip()
comms.append(comm)
comm1 = str(container.find("div", class_="nf9vGX55").find('span'))
rat = re.findall(r'\d+', str(comm1))
rat1 = (str(rat))[2]
notes.append(rat1)
time.sleep(3)
next = driver.find_element_by_xpath('//a[@class="ui_button nav next primary "]')
next.click()
soup = BeautifulSoup(results.text, "html.parser")
commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')
data = pd.DataFrame({
'comms' : comms,
'notes' : notes,
#'dates' : dates
})
data.to_csv(f"{name}.csv", sep=';', index=False)
time.sleep(3)
推荐阅读
- python - 更改数据类型
- node.js - Angular/Node/ng 应用程序的组件是什么?它们各自做什么以及它们之间的关系如何?
- django - Django + Elastic Beanstalk,access_log 中有 500 个响应代码,但 error_log 中没有
- docker - Sitecore 码头工人构建。Invoke-RemoteScript.ps1 超时
- javascript - 使用历史推送重定向用户反应
- java - spring如何触发为ChainedTransactionManager创建JpaTransactionManager bean
- bash - 历史命令未被识别为 child_process 中的内部命令
- python - 从源代码安装python3.6时没有名为venv的模块
- reactjs - 如何在每次重新渲染时改变反应钩子中的状态变量?
- sql - 在sql server中合并行