首页 > 解决方案 > 使用 python 从 selenium 脚本中以 csv、json、mysql/sql、txt 等格式存储抓取的数据

问题描述

我正在从网站上抓取数据,我想以 JSON、excel、sqlite 或文本格式等格式存储这些数据,以使数据看起来井井有条且合理。请帮我。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://www.amazon.in/Skybags-Brat-Black-Casual-Backpack/dp/B08Z1HHHTD/ref=sr_1_2?dchild=1&keywords=skybags&qid=1627786382&sr=8-2')

product_title = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "productTitle"))).text

print(product_title)

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@data-hook='see-all-reviews-link-foot']"))).click()
    
while True:
    for item in WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-hook='review']"))):
        reviewer = item.find_element_by_css_selector("span.a-profile-name").text
        review = ' '.join([i.text.strip() for i in item.find_elements_by_xpath(".//span[@data-hook='review-body']")])
        print(reviewer,review)

    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[@data-hook='pagination-bar']//a[contains(@href,'/product-reviews/') and contains(text(),'Next page')]"))).click()
        WebDriverWait(driver, 10).until(EC.staleness_of(item))
    except Exception as e:
        break

driver.quit()

标签: pythonjsonexcelseleniumweb-scraping

解决方案


将值存储 在字典中product_titlereviewreviewer使用json模块将其转换为 Json 格式。

您可以以这种格式存储数据,最后将列表转换为 JSON。

lst = [{"product_title": <title>, "reviews": [{"review": <review>, "reviewer": <reviewer>}, {"review": <review>, "reviewer": <reviewer>}....]
import json
json.dumps(lst)

将数据写入 JSON 文件

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(lst , f, ensure_ascii=False)

推荐阅读