python - 从产品中抓取价格时出现 python web 抓取问题
问题描述
所以我有这个代码。我成功提取了页面的每个产品名称。
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
page_url = "https://www.tenniswarehouse-europe.com/catpage-WILSONRACS-EN.html"
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
containers = page_soup.findAll("div", {"class":"product_wrapper cf rac"})
for container in containers:
name = container.div.img["alt"]
print(name)
我试图从下面的 html 中提取价格。我尝试了与上述相同的方法,但遇到了一个错误,即索引超出范围。我也尝试过 div 价格在哪里,甚至是跨度,但无济于事。
<div class="product_wrapper cf rac">
<div class="image_wrap">
<a href="https://www.tenniswarehouse-europe.com/Wilson_Pro_Staff_RF_97_V130_Racket/descpageRCWILSON-97V13R-EN.html">
<img class="cell_rac_img" src="https://img.tenniswarehouse-europe.com/cache/56/97V13R-thumb.jpg" srcset="https://img.tenniswarehouse-europe.com/cache/112/97V13R-thumb.jpg 2x" alt="Wilson Pro Staff RF 97 V13.0 Racket" />
</a>
</div>
<div class="text_wrap">
<a class="name " href="https://www.tenniswarehouse-europe.com/Wilson_Pro_Staff_RF_97_V130_Racket/descpageRCWILSON-97V13R-EN.html">Wilson Pro Staff RF 97 V13.0 Racket</a>
<div class="pricing">
<span class="price"><span class="convert_price">264,89 €</span></span>
<span class="msrp">SRP <span class="convert_price">300,00 €</span></span>
</div>
<div class="pricebreaks">
<span class="pricebreak">Price for 2: <span class="convert_price">242,90 €</span> each</span>
</div>
<div>
<p>Wilson updates the cosmetic of Federer's RF97 but keeps the perfect spec profile and sublime feel that has come to define this iconic racket. Headsize: 626cm². String Pattern: 16x19. Standard Length</p>
<div class="cf">
<div class="feature_links cf">
<a class="review ga_event" href="/Reviews/97V13R/97V13Rreview.html" data-trackcategory="Product Info" data-trackaction="TWE Product Review" data-tracklabel="97V13R - Wilson Pro Staff RF 97 V13.0 Racket">TW Reviews</a>
<a class="feedback ga_event" href="/feedback.html?pcode=97V13R" data-trackcategory="Product Info" data-trackaction="TWE Customer Review" data-tracklabel="97V13R - productName">Customer Reviews</a>
<a class="video_popup ga_event" href="/productvideo.html?pcode=97V13R" data-trackcategory="Video" data-trackaction="Cat - Product Review" data-tracklabel="Wilson_Pro_Staff_RF_97_V130_Racket">Video</a>
</div>
</div>
</div>
</div>
</div>
</td>
<td class="cat_border_cell">
<div class="product_wrapper cf rac">
解决方案
我想这对你有用:
prices = page_soup.findAll("span", {"class":"convert_price"})
然后,您将在页面上拥有一个包含所有价格的容器,您可以使用 访问单一价格prices[0] ... prices[len(prices)-1]
。如果您想从价格中删除 html 标签,请执行以下操作prices[0].text
但是这个 HTML 究竟来自哪里?Bc 价格不在您在代码中添加的链接的页面上。所以在这汤里你不应该找到任何价格。
上面的代码适用于您在那里提供的 html 代码。
!解决方案!:
解决此问题的一种方法是将 Selenium webdriver 与 BeautifulSoup 结合使用。我似乎找不到任何其他(更简单)的方法。
首先,安装 Seleniumpip install selenium
在此处下载您的浏览器的驱动程序。
我们所做的是我们点击打开网站时出现的“设置选择”按钮,然后我们用已经加载的价格给页面加汤。享受我下面的代码。
from bs4 import BeautifulSoup
from selenium import webdriver
# use the path of your driver.exe
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
# open your website link
driver.get("https://www.tenniswarehouse-europe.com/catpage-WILSONRACS-EN.html")
# button for submitting the location
button1 = driver.find_element_by_class_name("vat_entry_opt-submit")
button1.click()
# now that the button is clicked the prices are loaded in and we can soup this page
html = driver.page_source
page_soup = BeautifulSoup(html)
# extracting all prices into an array named pricing
pricing = page_soup.findAll("div",{"class":"pricing"})
price = pricing[x].span.text
# a loop for writing every price inside an array named 'price'
price = []
i = 0
while i<len(pricing):
price.append(pricing[i].span.text)
i = i + 1
# For this example you have to use class "pricing" instead of "price" because the red prices are in class "sale"
# replace x with the price you're looking for, or let it loop and get all prices in one array
# driver.close() closes your webdriver window
推荐阅读
- sockets - OS X 上的 UDP 客户端
- sql - 什么可能导致连接查询中未引用列的隐式转换?
- electron - 如何使 CTRL-SHIFT-S 绑定到 Save All in Atom?
- c - 为什么这个宏不运行
- python - 将字符串数组中的数字数组转换为python中的二维浮点数组
- aws-cli - 描述实例的排序输出?
- ios - AVPlayer HLS 直播落后
- javascript - 卸载前保存到 IndexedDB
- javascript - ReactJS 不编译 boostrap 组件。只返回 HTML 组件。0xerr011d
- c# - 计时器停止后重新启动 InActivity Monitor