首页 > 解决方案 > 还有更有效的刮 JB HI-FI 的方法吗?这基本上需要一整天

问题描述

我的代码如下。搜索 1 个网站大约需要 10 秒。我基本上是在从 a 到 z 以及第 1 到 200 页搜索 Jb HI-FI。然后我将数据保存到带有项目标题(例如电视)的列表中,并且是相应的价格。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import time

name = []
price = []

alpha = ['a', 'c', 'e', 'g', 'i', 'k', 'm', 'o', 'q', 's', 'u', 'w', 'y']


for alphabet in alpha:
    for i in range(1, 200):

        url = 'https://www.jbhifi.com.au/?q=' + alphabet + '&hPP=36&idx=shopify_products&p=' + str(i)
        print(url)


        options = Options()
        options.add_argument('--headless')

        driver = webdriver.Firefox(options=options)
        driver.get(url)

        soup = BeautifulSoup(driver.page_source, 'lxml')

        ii = 0

        for item in soup.findAll("h4", {'class': 'ais-hit--title product-tile__title'}):
            ii = ii + 1
            name.append(item.get_text(strip=True))

        for item in soup.findAll(["span"], {'class': ['ais-hit--price price', 'sale']}, limit = ii):
            price.append(item.get_text(strip=True))

        driver.close()

标签: pythonwebbeautifulsoup

解决方案


我已经能够获得该data网站的位置并将其全部写入CSV文件。

它仅限于12000结果。我已经对其进行了排序ASC并能够将其全部提取出来,这样会更好,而不是继续搜索alphabet会导致结果重复。

这是您可以在线运行的代码

import requests
import csv
from tqdm import tqdm

name = []
sub = []
for item in tqdm(range(0, 12)):
    data = {"requests": [
        {"indexName": "shopify_products_price_asc", "params": f'hitsPerPage=1000&page={item}&filters=(price > 0 AND product_published = 1 AND availability.displayProduct = 1)&facets=["facets.Price","facets.Category","facets.Brand"]&tagFilters='}]}
    r = requests.post('https://vtvkm5urpx-1.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1);%20Browser%20(lite);%20instantsearch.js%202.10.5;%20JS%20Helper%20(2.28.0)&x-algolia-application-id=VTVKM5URPX&x-algolia-api-key=a0c0108d737ad5ab54a0e2da900bf040', json=data).json()
    for item in r['results']:
        for title in item['hits']:
            name.append(title['title'])
            if title['pricing']['displayWasPrice']:
                price, disccount = title['pricing']['displayPriceInc'], title['pricing']['saveAmount']
            else:
                price, disccount = title['pricing']['displayPriceInc'], "N/A"
            data = (price, disccount)
            sub.append(data)


with open('result.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Name', 'Price', 'Disccount'])
    for name_, (price, discount) in zip(name, sub):
        writer.writerow([name_, price, discount])

结果:在线查看


截屏:


推荐阅读