首页 > 解决方案 > 将 selenium 和 Beautiful soup 中的多个字符串转换为 CSV 文件

问题描述

我有这个刮板,我想在 Google Colab 中导出为 csv 文件。我收到了作为字符串值的抓取信息,但我无法将其转换为 csv。我希望每个抓取的属性“标题”、“大小”等填充 csv 文件中的列。我已经通过 Beautiful soup 运行字符串以删除 HTML 格式。请参阅下面的代码以提供帮助。

import pandas as pd
import time
import io
from io import StringIO
import csv
#from google.colab import drive
#drive.mount('drive')
#Use new Library (kora.selenium) to run chromedriver 
from kora.selenium import wd
#Import BeautifulSoup to parse HTML formatting
from bs4 import BeautifulSoup
wd.get("https://www.grailed.com/sold/EP8S3v8V_w") #Get webpage

ScrollNumber=round(200/40)+1
for i in range(0,ScrollNumber):
  wd.execute_script("window.scrollTo(0,document.body.scrollHeight)")
  time.sleep(2)

#--------------#
#Each new attribute will have to found using XPATH because Grailed's website is written in Javascript (js.react) not HTML
#Only 39 results will show because the JS page is infinite scroll and selenium must be told to keep scrolling.
follow_loop = range(2, 200)
for x in follow_loop:
  #Title 
    title = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
    title += str(x)
    title += "]/a/div[3]/div[2]/p"
    title = wd.find_elements_by_xpath(title)
    title = str(title)
  #Price 
    price = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
    price += str(x)
    price += "]/div/div/p/span"
    price = wd.find_elements_by_xpath(price)
    price = str(price)
  #Size 
    size = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
    size += str(x)
    size += "]/a/div[3]/div[1]/p[2]"
    size = wd.find_elements_by_xpath(size)
    size = str(size)
  #Sold 
    sold = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
    sold += str(x)
    sold += "]/a/p/span"
    sold = wd.find_elements_by_xpath(sold)
    sold = str(sold)
  #Clean HTML formatting using Beautiful soup
    cleantitle = BeautifulSoup(title, "lxml").text
    cleanprice = BeautifulSoup(price, "lxml").text
    cleansize = BeautifulSoup(size, "lxml").text
    cleansold = BeautifulSoup(sold, "lxml").text

标签: pythonpandasstringseleniumcsv

解决方案


这是很多工作,哈哈

from selenium import webdriver
import time
import csv

driver = webdriver.Chrome()

driver.get("https://www.grailed.com/sold/EP8S3v8V_w")

scroll_count = round(200 / 40) + 1
for i in range(scroll_count):
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
    time.sleep(2)

time.sleep(3)

titles = driver.find_elements_by_css_selector("p.listing-designer")
prices = driver.find_elements_by_css_selector("p.sub-title.sold-price")
sizes = driver.find_elements_by_css_selector("p.listing-size.sub-title")
sold = driver.find_elements_by_css_selector("div.-overlay")

data = [titles, prices, sizes, sold]

data = [list(map(lambda element: element.text, arr)) for arr in data]

with open('sold_shoes.csv', 'w') as file:
    writer = csv.writer(file)
    j = 0
    while j < len(titles):
        row = []
        for i in range(len(data)):
            row.append(data[i][j])
        writer.writerow(row)
        j += 1

我不确定为什么它会在文件的每一行之间换行,但我认为这不是问题。此外,这是一个幼稚的解决方案,因为它假设每个列表的大小相同,考虑使用一个列表并从父级的子元素创建新列表。另外,我只是在没有 BeautifulSoup 的情况下使用 Selenium,因为它对我来说更容易,但你也应该学习 BS,因为它比 Selenium 刮得更快。快乐编码。


推荐阅读