python - 将 selenium 和 Beautiful soup 中的多个字符串转换为 CSV 文件
问题描述
我有这个刮板,我想在 Google Colab 中导出为 csv 文件。我收到了作为字符串值的抓取信息,但我无法将其转换为 csv。我希望每个抓取的属性“标题”、“大小”等填充 csv 文件中的列。我已经通过 Beautiful soup 运行字符串以删除 HTML 格式。请参阅下面的代码以提供帮助。
import pandas as pd
import time
import io
from io import StringIO
import csv
#from google.colab import drive
#drive.mount('drive')
#Use new Library (kora.selenium) to run chromedriver
from kora.selenium import wd
#Import BeautifulSoup to parse HTML formatting
from bs4 import BeautifulSoup
wd.get("https://www.grailed.com/sold/EP8S3v8V_w") #Get webpage
ScrollNumber=round(200/40)+1
for i in range(0,ScrollNumber):
wd.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(2)
#--------------#
#Each new attribute will have to found using XPATH because Grailed's website is written in Javascript (js.react) not HTML
#Only 39 results will show because the JS page is infinite scroll and selenium must be told to keep scrolling.
follow_loop = range(2, 200)
for x in follow_loop:
#Title
title = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
title += str(x)
title += "]/a/div[3]/div[2]/p"
title = wd.find_elements_by_xpath(title)
title = str(title)
#Price
price = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
price += str(x)
price += "]/div/div/p/span"
price = wd.find_elements_by_xpath(price)
price = str(price)
#Size
size = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
size += str(x)
size += "]/a/div[3]/div[1]/p[2]"
size = wd.find_elements_by_xpath(size)
size = str(size)
#Sold
sold = "//*[@id='shop']/div/div/div[3]/div[2]/div/div["
sold += str(x)
sold += "]/a/p/span"
sold = wd.find_elements_by_xpath(sold)
sold = str(sold)
#Clean HTML formatting using Beautiful soup
cleantitle = BeautifulSoup(title, "lxml").text
cleanprice = BeautifulSoup(price, "lxml").text
cleansize = BeautifulSoup(size, "lxml").text
cleansold = BeautifulSoup(sold, "lxml").text
解决方案
这是很多工作,哈哈
from selenium import webdriver
import time
import csv
driver = webdriver.Chrome()
driver.get("https://www.grailed.com/sold/EP8S3v8V_w")
scroll_count = round(200 / 40) + 1
for i in range(scroll_count):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(2)
time.sleep(3)
titles = driver.find_elements_by_css_selector("p.listing-designer")
prices = driver.find_elements_by_css_selector("p.sub-title.sold-price")
sizes = driver.find_elements_by_css_selector("p.listing-size.sub-title")
sold = driver.find_elements_by_css_selector("div.-overlay")
data = [titles, prices, sizes, sold]
data = [list(map(lambda element: element.text, arr)) for arr in data]
with open('sold_shoes.csv', 'w') as file:
writer = csv.writer(file)
j = 0
while j < len(titles):
row = []
for i in range(len(data)):
row.append(data[i][j])
writer.writerow(row)
j += 1
我不确定为什么它会在文件的每一行之间换行,但我认为这不是问题。此外,这是一个幼稚的解决方案,因为它假设每个列表的大小相同,考虑使用一个列表并从父级的子元素创建新列表。另外,我只是在没有 BeautifulSoup 的情况下使用 Selenium,因为它对我来说更容易,但你也应该学习 BS,因为它比 Selenium 刮得更快。快乐编码。
推荐阅读
- java - 在 ignite 中启用身份验证
- asp.net - EF 操作上的 ASP.NET MVC 异步等待。如何调试以及如何处理错误?
- ios - Xcode模拟器XS:为什么XS iPhone上的边框在截图中变成了一个矩形?
- asp.net-mvc - 如何使用 asp.net mvc 从头开始添加用户角色
- plotly - 用 plotly 从 2D 直方图填充 3D 直方图
- javascript - 在后台选项卡中打开链接而不会失去焦点
- widget - 一种使用鼠标滚动布局小部件的方法?
- javascript - Internet Explorer 不支持 [...new Set(array)] 吗?
- c# - 从 Windows 上的 RSSI 检测蓝牙信号强度
- javascript - 检查赛普拉斯中的单选按钮