首页 > 解决方案 > Webscraping 将产品名称和相应的价格添加到 pandas 数据框

问题描述

我正在练习网页抓取,并想将产品名称和价格提取到熊猫数据框中。

这是我的代码”

for web in website:
  r=re.get(web)
  soup = BeautifulSoup(r.content, 'html.parser')
  for i in soup.find_all("div", {"class":"row productspm"}): 
    for name in soup.find_all("h4"):
      if name.text not in productname:
        productname.append(name.text)
    for price in soup.find_all("p",{'class':"price text-right"}):
      prices.append(price.text)


print(len(productname))

当我提取数据时。我没有收到任何错误,但数据框包含所有错误信息。

首先,不是提取 43 个产品,而是提取 61 个产品名称。其次,产品的价格与网站上显示的价格不符。当产品打折时,他们使用不同的 html 代码,这会在抓取中产生问题。

以下是网站上非销售产品的 HTML 代码:

<div class="product-layout product-grid col-lg-3 col-md-4 col-sm-6 col-xs-12">
          <div class="product-thumb transition">
      <div class="image"><a href="---"><img src="--" alt="BREATHING BAG 3L N-LATEX PARKER" title="BREATHING BAG 3L N-LATEX PARKER" class="img-responsive center-block"></a>
          <!-- Webiarch Images Start -->
                                   
          <!-- End -->
                              <div class="topbutton">
        <button type="button" data-toggle="tooltip" title="" onclick="wishlist.add('250');" data-original-title="Add to Wish List"><svg width="20px" height="20px"><use xlink:href="#wishlist"></use></svg><span class="hidden-xs"></span></button>
        <button type="button" data-toggle="tooltip" title="" onclick="compare.add('250');" class="wishcom" data-original-title="Compare this Product"><svg width="20px" height="20px"><use xlink:href="#pcom"></use></svg><span class="hidden-xs"></span></button>
         <div class="bquickv" title="" data-toggle="tooltip" data-original-title="quickview"><div class="webi-ownstyle webi-quickview"><a href="#"><svg width="20px" height="20px"><use xlink:href="#pquick"></use></svg></a></div></div>
      </div>
      </div>
      <div class="caption">
        <h4><a href="---">BREATHING BAG 3L N-LATEX PARKER</a></h4>
        <p class="list-des">BREATHING
  BAG 3L N-LATEX PARKER..</p>
                  <div class="rating pull-left">          <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
                    <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
                    <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
                    <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
                    <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
          </div>
                

                
                    <p class="price text-right"> SAR 150</p>
                          <div class="clearfix"></div>
      <div class="button-group">
        <button type="button" onclick="cart.add('250');" class="acart">
          <span>Add to Cart</span>
        </button>
      </div>
      </div>
      
    </div>
        </div>

这是有销售的产品。

<div class="product-layout product-grid col-lg-3 col-md-4 col-sm-6 col-xs-12">
          <div class="product-thumb transition">---" title="Everbrite In-Office Tooth Whitening Kit (3 Patients)" class="img-responsive center-block"></a>
          <!-- Webiarch Images Start -->
                                   
          <!-- End -->
                                 <span class="salep">sale</span>
                      <div class="topbutton">
        <button type="button" data-toggle="tooltip" title="" onclick="wishlist.add('189');" data-original-title="Add to Wish List"><svg width="20px" height="20px"><use xlink:href="#wishlist"></use></svg><span class="hidden-xs"></span></button>
        <button type="button" data-toggle="tooltip" title="" onclick="compare.add('189');" class="wishcom" data-original-title="Compare this Product"><svg width="20px" height="20px"><use xlink:href="#pcom"></use></svg><span class="hidden-xs"></span></button>
         <div class="bquickv" title="" data-toggle="tooltip" data-original-title="quickview"><div class="webi-ownstyle webi-quickview"><a href="#"><svg width="20px" height="20px"><use xlink:href="#pquick"></use></svg></a></div></div>
      </div>
      </div>
      <div class="caption">
        <h4><a href="---">Everbrite In-Office Tooth Whitening Kit (3 Patients)</a></h4>
        <p class="list-des">Everbrite In-Office Tooth Whitening Kit (3 Patients)
Used for Dentamerica Whitening System. One hour..</p>
                  <div class="rating pull-left">          <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
                    <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
                    <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
                    <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
                    <span class="fa fa-stack"><i class="fa fa-star-o fa-stack-2x"></i></span>
          </div>
                

                
                   <p class="pricedis price text-right"><span class="price-new"> SAR 275</span> <span class="price-old"> SAR 345</span></p>
                          <div class="clearfix"></div>
      <div class="button-group">
        <button type="button" onclick="cart.add('189');" class="acart">
          <span>Add to Cart</span>
        </button>
      </div>
      </div>
      
    </div>
        </div>

有人可以让我知道我在哪里犯了错误以及如何纠正它。非常感谢

这是我得到的价格清单:

prices
[' SAR 110', ' SAR 41', ' SAR 1,760', ' SAR 150', ' SAR 3,103', ' SAR 5,770', ' SAR 540', ' SAR 4,900', ' SAR 2,650', ' SAR 603', ' SAR 58', ' SAR 15', ' SAR 15', ' SAR 3,200', ' SAR 35', ' SAR 890', ' SAR 75', ' SAR 10,500', ' SAR 1,560', ' SAR 2,421', ' SAR 4,904', ' SAR 223', ' SAR 5,072', ' SAR 1,600', ' SAR 9,700', ' SAR 354', ' SAR 25,600', ' SAR 1,800', ' SAR 84', ' SAR 256', ' SAR 120', ' SAR 349', ' SAR 2,100', ' SAR 21,500', ' SAR 15', ' SAR 3,450']

标签: pythonpandasweb-scraping

解决方案


It is very hard to answer and give a recommendation based on your input, so it would be really cool to improve your question.

What happens?

Problem of difference between name and price is the way you loop your response and append things to the lists. They are independent from each other.

How to fix that?

You should grab all the information in one step, like this:

data = []

for item in soup.select('div.row.productspm > div'):
    data.append({
        'name':item.h4.get_text(),
        'price': item.select_one('p.price').get_text('^^', strip=True).split('^^')[0]
    })

Cause it is not clear I grab only the regular price and the new price like this:

'price': item.select_one('p.price').get_text('^^', strip=True).split('^^')[0]

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd


page = requests.get("https://alrazimed.me/index.php?route=product/category&path=178_115")
soup = BeautifulSoup(page.content, "html.parser")

data = []

for item in soup.select('div.row.productspm > div'):
    data.append({
        'name':item.h4.get_text(),
        'price': item.select_one('p.price').get_text('^^', strip=True).split('^^')[0]
    })

pd.DataFrame(data)

Output

    name                                                price
0   C-BRIGHT Teeth whitening accelerators               SAR 3,103
1   Everbrite At-Home Tooth Whitening Kit               SAR 120
2   Everbrite In-Office Tooth Whitening Kit (3 Pat...   SAR 275
3   Everbrite In-Office Tooth Whitening Kit (Single)    SAR 135
4   FLOCARE – 0.4% Stannous Fluoride                    SAR 35
5   LITEX 686 LED CURING AND WHITENING SYSTEM           SAR 10,500

推荐阅读