首页 > 解决方案 > 从产品中抓取价格时出现 python web 抓取问题

问题描述

所以我有这个代码。我成功提取了页面的每个产品名称。

from bs4 import BeautifulSoup as soup 
from urllib.request import urlopen as uReq  


page_url = "https://www.tenniswarehouse-europe.com/catpage-WILSONRACS-EN.html"


uClient = uReq(page_url)


page_soup = soup(uClient.read(), "html.parser")
uClient.close()

containers = page_soup.findAll("div", {"class":"product_wrapper cf rac"})

for container in containers:
    name = container.div.img["alt"]
    print(name)

我试图从下面的 html 中提取价格。我尝试了与上述相同的方法,但遇到了一个错误,即索引超出范围。我也尝试过 div 价格在哪里,甚至是跨度,但无济于事。

<div class="product_wrapper cf rac">
   <div class="image_wrap">
      <a href="https://www.tenniswarehouse-europe.com/Wilson_Pro_Staff_RF_97_V130_Racket/descpageRCWILSON-97V13R-EN.html">
      <img class="cell_rac_img" src="https://img.tenniswarehouse-europe.com/cache/56/97V13R-thumb.jpg" srcset="https://img.tenniswarehouse-europe.com/cache/112/97V13R-thumb.jpg 2x" alt="Wilson Pro Staff RF 97 V13.0 Racket" />
      </a>
   </div>
   <div class="text_wrap">
      <a class="name " href="https://www.tenniswarehouse-europe.com/Wilson_Pro_Staff_RF_97_V130_Racket/descpageRCWILSON-97V13R-EN.html">Wilson Pro Staff RF 97 V13.0 Racket</a>
      <div class="pricing">
         <span class="price"><span class="convert_price">264,89 &euro;</span></span>
         <span class="msrp">SRP <span class="convert_price">300,00 &euro;</span></span>
      </div>
      <div class="pricebreaks">
         <span class="pricebreak">Price for 2: <span class="convert_price">242,90 &euro;</span>  each</span>
      </div>
      <div>
         <p>Wilson updates the cosmetic of Federer's RF97 but keeps the perfect spec profile and sublime feel that has come to define this iconic racket.  Headsize: 626cm². String Pattern: 16x19. Standard Length</p>
         <div class="cf">
            <div class="feature_links cf">
               <a class="review ga_event" href="/Reviews/97V13R/97V13Rreview.html" data-trackcategory="Product Info" data-trackaction="TWE Product Review" data-tracklabel="97V13R - Wilson Pro Staff RF 97 V13.0 Racket">TW Reviews</a>
               <a class="feedback ga_event" href="/feedback.html?pcode=97V13R" data-trackcategory="Product Info" data-trackaction="TWE Customer Review" data-tracklabel="97V13R - productName">Customer Reviews</a>
               <a class="video_popup ga_event" href="/productvideo.html?pcode=97V13R" data-trackcategory="Video" data-trackaction="Cat - Product Review" data-tracklabel="Wilson_Pro_Staff_RF_97_V130_Racket">Video</a>
            </div>
         </div>
      </div>
   </div>
</div>
</td>
<td class="cat_border_cell">
   <div class="product_wrapper cf rac">

标签: pythonweb-scrapingbeautifulsoup

解决方案


我想这对你有用:

prices = page_soup.findAll("span", {"class":"convert_price"})

然后,您将在页面上拥有一个包含所有价格的容器,您可以使用 访问单一价格prices[0] ... prices[len(prices)-1]。如果您想从价格中删除 html 标签,请执行以下操作prices[0].text

但是这个 HTML 究竟来自哪里?Bc 价格不在您在代码中添加的链接的页面上。所以在这汤里你不应该找到任何价格。

上面的代码适用于您在那里提供的 html 代码。

编辑:下面评论的截图 在此处输入图像描述

!解决方案!:

解决此问题的一种方法是将 Selenium webdriver 与 BeautifulSoup 结合使用。我似乎找不到任何其他(更简单)的方法。

首先,安装 Seleniumpip install selenium

在此处下载您的浏览器的驱动程序。

我们所做的是我们点击打开网站时出现的“设置选择”按钮,然后我们用已经加载的价格给页面加汤。享受我下面的代码。

from bs4 import BeautifulSoup
from selenium import webdriver

# use the path of your driver.exe
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")

# open your website link
driver.get("https://www.tenniswarehouse-europe.com/catpage-WILSONRACS-EN.html")

# button for submitting the location
button1 = driver.find_element_by_class_name("vat_entry_opt-submit")
button1.click()

# now that the button is clicked the prices are loaded in and we can soup this page
html = driver.page_source
page_soup = BeautifulSoup(html)

# extracting all prices into an array named pricing
pricing = page_soup.findAll("div",{"class":"pricing"})
price = pricing[x].span.text

# a loop for writing every price inside an array named 'price'
price = []
i = 0
while i<len(pricing):
    price.append(pricing[i].span.text)
    i = i + 1

# For this example you have to use class "pricing" instead of "price" because the red prices are in class "sale"
# replace x with the price you're looking for, or let it loop and get all prices in one array

# driver.close()  closes your webdriver window

推荐阅读