首页 > 解决方案 > 用漂亮的汤刮掉一个网站。但它不能刮每个

  • 标签
  • 问题描述

    我是一个完整的初学者,我在网络抓取中遇到了一些问题,我能够抓取图片、标题和价格。并且成功地抓取了索引[0]但是,每当我尝试运行循环或将索引硬编码为高于 0 的值时,它都表明它超出了范围。而且它不会刮掉任何其他<li>标签。有没有其他方法可以解决这个问题?此外,我加入了 selenium 以加载整个页面。任何帮助将不胜感激。

    from selenium import webdriver
    from bs4 import BeautifulSoup
    import time
    
    
    PATH = "C:\Program Files (x86)\chromedriver.exe"
    
    driver = webdriver.Chrome(PATH)
    
    driver.get("https://ca.octobersveryown.com/collections/all")
    
    scrolls = 22
    while True:
        scrolls -= 1
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        time.sleep(0.2)
        if scrolls < 0:
            break
    
    html = driver.page_source
    
    soup = BeautifulSoup(html, 'html.parser')
    
    bodies = (soup.find(id='content'))
    
    clothing = bodies.find_all('ul', class_='grid--full product-grid-items')
    
    for span_tag in soup.findAll(class_='visually-hidden'):
        span_tag.replace_with('')
    
    print(clothing[0].find('img')['src'])
    print(clothing[0].find(class_='product-title').get_text())
    print(clothing[0].find(class_='grid-price-money').get_text())
    
    time.sleep(8)
    
    driver.quit()
    

    标签: pythonhtmlseleniumweb-scrapingbeautifulsoup

    解决方案


    如果您只想在BeautifulSoup没有 selenium 的情况下使用,您可以模拟页面正在发出的 Ajax 请求。例如:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'
    
    page = 1
    while True:
        soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )
    
        li = soup.find_all('li', recursive=False)
        if not li:
            break
    
        for l in li:
            print(l.select_one('p a').get_text(strip=True))
            print('https:' + l.img['src'])
            print(l.select_one('.grid-price').get_text(strip=True, separator=' '))
            print('-' * 80)
    
        page += 1
    

    印刷:

    LIGHTWEIGHT RAIN SHELL
    https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-dark-red-1_large.jpg?v=1598583974
    £178.00
    --------------------------------------------------------------------------------
    LIGHTWEIGHT RAIN SHELL
    https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-black-1_large.jpg?v=1598583976
    £178.00
    --------------------------------------------------------------------------------
    ALL COUNTRY HOODIE
    https://cdn.shopify.com/s/files/1/1605/0171/products/all-country-hoodie-white-1_large.jpg?v=1598583978
    £148.00
    --------------------------------------------------------------------------------
    
    ...and so on.
    

    编辑(保存为 CSV):

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'
    
    page = 1
    all_data = []
    while True:
        soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )
    
        li = soup.find_all('li', recursive=False)
        if not li:
            break
    
        for l in li:
            d = {'name': l.select_one('p a').get_text(strip=True),
                 'link': 'https:' + l.img['src'],
                 'price': l.select_one('.grid-price').get_text(strip=True, separator=' ')}
            all_data.append(d)
            print(d)
            print('-' * 80)
    
        page += 1
    
    df = pd.DataFrame(all_data)
    df.to_csv('data.csv')
    print(df)
    

    印刷:

                                      name  ...            price
    0               LIGHTWEIGHT RAIN SHELL  ...          £178.00
    1               LIGHTWEIGHT RAIN SHELL  ...          £178.00
    2                   ALL COUNTRY HOODIE  ...          £148.00
    3                   ALL COUNTRY HOODIE  ...          £148.00
    4                   ALL COUNTRY HOODIE  ...          £148.00
    ..                                 ...  ...              ...
    271  OVO ESSENTIALS LONGSLEEVE T-SHIRT  ...           £58.00
    272                OVO ESSENTIALS POLO  ...           £68.00
    273             OVO ESSENTIALS T-SHIRT  ...           £48.00
    274                 OVO ESSENTIALS CAP  ...           £38.00
    275           POM POM COTTON TWILL CAP  ...  £32.00 SOLD OUT
    
    [276 rows x 3 columns]
    

    并保存data.csv(来自 LibreOffice 的屏幕截图):

    在此处输入图像描述


    推荐阅读