首页 > 解决方案 > 如何正确保存 Pandas Dataframe 中多个页面的抓取数据?

问题描述

我写了一个网络爬虫,它将抓取同一网站的多个网页产品(同一网站我的意思是 ebay.com 的网站,多个网页产品的意思是 ebay.com/perfumes、ebay.com/cameras)。我正在尝试以我正在使用 Pandas 数据框的 csv 文件的形式保存抓取的数据。我可以在我的终端上打印数据,但这些数据是单独打印的,我希望这些数据一起打印。以及保存在 csv 文件中的数据是最新数据,因此不会保存第一组数据。这是 webscraper 的代码,我创建了一个数据框并将其保存在 csv 文件中。

import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd


def scrape_products():
    df = pd.DataFrame(columns=['Name', 'Price', 'Condition', 'Category', 'Item No', 'EAN', 'Postage', 'RRP'])
    website_address = [
        'https://www.ebay.co.uk/itm/The-Discworld-series-Carpe-jugulum-by-Terry-Pratchett-Paperback-Amazing-Value/293566021594?hash=item4459e5ffda:g:yssAAOSw3NBfQ7I0',
        'https://www.ebay.co.uk/itm/Edexcel-AS-A-level-history-Germany-and-West-Germany-1918-89-by-Barbara/293497601580?hash=item4455d1fe2c:g:6lYAAOSwbRFeXGqL']
    options = webdriver.ChromeOptions()
    options.add_argument('start-maximized')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option("useAutomationExtension", False)

    browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
    for web in website_address:
        browser.get(web)
        time.sleep(2)

        product_price_raw_list = browser.find_element_by_xpath('//*[@id="vi-mskumap-none"]').text
        product_name_raw_lst = browser.find_element_by_xpath('//*[@id="itemTitle"]').text
        product_condition = browser.find_element_by_xpath('//*[@id="vi-itm-cond"]').text
        product_category = browser.find_element_by_xpath(
            '//*[@id="vi-VR-brumb-lnkLst"]/table/tbody/tr/td/ul/li[1]').text
        product_ebay_item_no = browser.find_element_by_xpath('//*[@id="descItemNumber"]').text
        product_ean = browser.find_element_by_xpath('//*[@id="viTabs_0_is"]/div/table[2]/tbody/tr[2]/td[4]').text
        product_postage = browser.find_element_by_xpath('//*[@id="shSummary"]').text
        product_rrp = browser.find_elements_by_css_selector('.actPanel  div div:nth-child(2) span')
        if product_rrp:  # has results
            print(product_rrp[0].text)
        else:
            print('no rpp')
        # return product_price_raw_list, product_name_raw_lst, product_condition, product_category, product_ebay_item_no, product_ean, product_postage, product_rrp

        data_frame = pd.DataFrame([[product_price_raw_list, product_name_raw_lst, product_condition, product_category, product_ebay_item_no, product_ean, product_postage, product_rrp]], columns=['Name', 'Price', 'Condition', 'Category', 'Item No', 'EAN', 'Postage', 'RRP'])
        final_df = df.append(data_frame, ignore_index=True)
        final_df.to_csv('saving_scraped.csv', index=False)
        print(final_df.head())
        print('END.')


if __name__ == "__main__":
    scrape_products()

这是我在终端上的输出:

[WDM] - Driver [/home/user/.wdm/drivers/chromedriver/linux64/74.0.3729.6/chromedriver] found in cache
Was:
£7.99
    Name                                              Price  ...                              Postage                                                RRP
0  £4.99  The Discworld series: Carpe jugulum by Terry P...  ...  Doesn't post to India | See details  [<selenium.webdriver.remote.webelement.WebElem...

[1 rows x 8 columns]
END.
£6.68
    Name                                              Price  ...                              Postage                                                RRP
0  £6.68  Edexcel AS/A-level history. Germany and West G...  ...  Doesn't post to India | See details  [<selenium.webdriver.remote.webelement.WebElem...

[1 rows x 8 columns]
END.

这是我的 csv 文件中保存的数据。

Name,Price,Condition,Category,Item No,EAN,Postage,RRP
£6.68,"Edexcel AS/A-level history. Germany and West Germany, 1918-89 by Barbara",Good,"Books, Comics & Magazines",293497601580,9781471876493,Doesn't post to India | See details,"[<selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-8"")>, <selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-9"")>, <selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-10"")>, <selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-11"")>, <selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-12"")>, <selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-13"")>, <selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-14"")>, <selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-15"")>, <selenium.webdriver.remote.webelement.WebElement (session=""35378d132e0de972988548942dd94321"", element=""0.28667503759225554-16"")>]"

如何确保我想在终端上打印并保存在我的 csv 文件中的数据应该像

Name,Price,Condition,Category,Item No,EAN,Postage,RRP
abc,12,good,movies,jno,987,2343,USA,9dollars
xyz,13,very good,scifi,ojk,7675,990,NZ,19Pounds

我在哪里出错,我不明白。请帮忙。谢谢!

标签: pythonpython-3.xpandasselenium-webdriverselenium-chromedriver

解决方案


我认为这条线

        final_df = df.append(data_frame, ignore_index=True)

应该替换为

        final_df = final_df.append(data_frame, ignore_index=True)

因此,您可以添加data_frame在循环的先前迭代中收集的内容,而不是覆盖它们

你也应该采取这些措施

        final_df.to_csv('saving_scraped.csv', index=False)
        print(final_df.head())
        print('END.')

退出循环,因为您不需要final_df在每次迭代中保存

如果要在循环遍历不同页面时监视进度,可以将循环中的 print 语句替换为

       print(data_frame.head())

所以你打印在那个迭代中被刮掉的任何东西,而不是final_df把所有的结果都聚合起来


推荐阅读