首页 > 解决方案 > 从 div 标签中抓取网页正在返回随机产品的标题名称,而它应该返回第一个

问题描述

我正在尝试使用以下代码从网站上抓取数据:

containers = page_soup.findAll("div", {"class": "item-info"})
container = containers[0]

输出:

<div class="item-info">
<!--brand info-->
<div class="item-branding">
<a class="item-brand" href="https://www.newegg.com/Ugg-Australia/BrandStore/ID-59551">
<img alt="Ugg Australia" src="//c1.neweggimages.com/Brandimage_70x28//Brand59551.gif" title="Ugg Australia"/>
</a>
<!--rating info-->
</div>
<!--description info-->
<a class="item-title" href="https://www.newegg.com/ugg-australia-black-boots/p/0F5-003V-00P74" title="View Details">Ugg Australia Bailey Button II Women US 5 Black Winter Boot</a>
<!--promption info-->
<p class="item-promo"></p>
<!--feature-->
<ul class="item-features">
<li><strong>Brand:</strong> Ugg Australia</li><li><strong>Type:</strong> Boots</li><li><strong>Color:</strong> Black</li><li><strong>Occasion:</strong> Specialty</li>
<li><strong>Model #: </strong>1016422/BLK</li>
<li><strong>Return Policy: </strong><a href="https://www.newegg.com/AreaTrend/about" target="_blank" title="View Return Policy(new window)">View Return Policy</a></li>
</ul>
<div class="item-action">
<!--price-->
<ul class="price ">
<li class="price-was">
       $140.00
<span class="price-was-data" style="display: none">140.00</span>

接下来,当我尝试使用此代码抓取标题名称时:

for container in containers:
    title_container = container.findAll("a", {"class" : "item-title"})
    title_container[0].text 

我在页面中获得了随机产品的标题,而不是获得第一个产品名称。
理想情况下,我应该得到:

Ugg Australia Bailey Button II Women US 5 Black Winter Boot

我究竟做错了什么?

标签: pythonhtmlweb-scrapingbeautifulsouptags

解决方案


.findAll将从 html 代码中获取所有产品。
您可以迭代每个单独的项目,如下所示:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.newegg.com/Ugg-Australia/BrandStore/ID-59551").text
soup = BeautifulSoup(page, "html.parser")

results = soup.findAll("a", {"class": "item-title"})

# prints the first product on the page.
print(results[0].text.strip())

# prints all the products on the page.
for r in results:
    print(r.text.strip())

推荐阅读