首页 > 解决方案 > Web Scraping - ResultSet 对象没有属性“findAll”

问题描述

在 for 循环中读取数组中的第二个值时 bs4 出现问题。下面我将粘贴代码。

但是,当我使用第 19 行时,我没有收到任何错误。当我将它换成整个数组(第 18 行)时,它在尝试收集第二个值时出错。请注意,数组中的第二个值与第 19 行的值相同。

import requests
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

 
SmartLiving_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=Smart%20Living&selectedFacets=Brand%7CSmart%20Living&sortBy="
IEL_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=IEL&selectedFacets=Brand%7CIts%20Exciting%20Lighting&sortBy="
TD_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=two%20dogs&selectedFacets=Brand%7CTwo%20Dogs%20Designs&sortBy="

Headers = "Description, URL, Price \n"

text_file = open("HayneedlePrices.csv", "w")
text_file.write(Headers)
text_file.close()


URL_Array = [SmartLiving_IDS, IEL_IDS, TD_IDS]
#URL_Array = [IEL_IDS]
for URL in URL_Array:
  print("\n" + "Loading New URL:" "\n" + URL + "\n" + "\n")
  
  uClient = uReq(URL)
  page_html = uClient.read()
  uClient.close() 
  soup = soup(page_html, "html.parser")
  
  Containers = soup.findAll("div", {"product-card__container___1U2Sb"})
  for Container in Containers:

    
    Title             = Container.div.img["alt"]    
    Product_URL       = Container.a["href"]
    
    Price_Container   = Container.findAll("div", {"class":"product-card__productInfo___30YSc body no-underline txt-black"})[0].findAll("span", {"style":"font-size:20px"})

    Price_Dollars     = Price_Container[0].get_text()
    Price_Cents       = Price_Container[1].get_text()


    print("\n" + "#####################################################################################################################################################################################################" + "\n")
    # print("   Container: " + "\n" + str(Container))
    # print("\n" + "-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------" + "\n")
    print(" Description: " + str(Title))
    print(" Product URL: " + str(Product_URL))
    print("       Price: " + str(Price_Dollars) + str(Price_Cents))
    print("\n" + "#####################################################################################################################################################################################################" + "\n")
 
    text_file = open("HayneedlePrices.csv", "a")
    text_file.write(str(Title) +  ", " + str(Product_URL) + ", " + str(Price_Dollars) + str(Price_Cents) + "\n")
    text_file.close()

  print("Information gathered and Saved from URL Successfully.")
  print("Looking for Next URL..")
print("No Additional URLs to Gather. Process Completed.")

标签: pythonweb-scrapingbeautifulsoup

解决方案


问题是你import BeautifulSoup as soup还定义了一个soup = soup(page_html, "html.parser")同名的变量!

我稍微重构了您的代码,让我知道它是否按预期工作!

import csv

import requests
from bs4 import BeautifulSoup

smart_living_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=Smart%20Living&selectedFacets=Brand%7CSmart%20Living&sortBy="
IEL_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=IEL&selectedFacets=Brand%7CIts%20Exciting%20Lighting&sortBy="
TD_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=two%20dogs&selectedFacets=Brand%7CTwo%20Dogs%20Designs&sortBy="

site_URLs = [smart_living_IDS, IEL_IDS, TD_IDS]

sess = requests.Session()

prod_data = []

for curr_URL in site_URLs:
    req = sess.get(url=curr_URL)
    soup = BeautifulSoup(req.content, "lxml")

    containers = soup.find_all("div", {"product-card__container___1U2Sb"})
    for curr_container in containers:
        prod_title = curr_container.div.img["alt"]
        prod_URL = curr_container.a["href"]

        price_container = curr_container.find(
            "div",
            {"class": "product-card__productInfo___30YSc body no-underline txt-black"},
        )

        dollars_elem = price_container.find("span", {"class": "main-price-dollars"})
        cents_elem = dollars_elem.find_next("span")

        prod_price = dollars_elem.get_text() + cents_elem.get_text()
        prod_price = float(prod_price[1:])

        prod_data.append((prod_title, prod_URL, prod_price))

CSV_headers = ("title", "URL", "price")

with open("../out/hayneedle_prices.csv", "w", newline="") as file_out:
    writer = csv.writer(file_out)
    writer.writerow(CSV_headers)
    writer.writerows(prod_data)

我通过重复当前 URL 列表 10 次来测试它,它花费的时间比我预期的要长。肯定有改进的地方,我可能会在接下来的几天内重写它以使用 lxml,多处理也可能是一个不错的选择。当然,这完全取决于您如何使用它:)


推荐阅读