首页 > 解决方案 > 使用 beautifulsoup4,避免在不存在的元素上出现 AttributeError

问题描述

一天中的好时光!

在进行抓取项目时,我遇到了一些问题。目前我正在起草一份草案:

from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException 
from selenium.webdriver.common.by import By
import requests 
from bs4 import BeautifulSoup

import time 
import random 
#driver Path
PATH = "C:\Program Files (x86)\chromedriver"

BASE_URL = "https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167"
driver = webdriver.Chrome(PATH)
driver.implicitly_wait(30)
driver.get(BASE_URL)
time.sleep(random.uniform(3.0, 5.0))

btn = driver.find_elements_by_xpath('//*[@id="uc-btn-accept-banner"]')[0]
btn.click()
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.content, "html.parser")

def reader(url):
    ls = list()
    ImmoWebCode = url.find(class_ ="classified__information--immoweb-code").text.strip()     
    Price = url.find("p", class_="classified__price").find("span",class_="sr-only").text.strip()
    Locality = url.find(class_="classified__information--address-row").find("span").text.strip()
    HouseType = url.find(class_="classified__title").text.strip()
    LivingArea = url.find("th",text="Living area").find_next(class_="classified-table__data").next_element.strip()
    RoomsNumber = url.find("th",text="Bedrooms").find_next(class_="classified-table__data").next_element.strip()
    Kitchen = url.find("th",text="Kitchen type").find_next(class_="classified-table__data").next_element.strip()
    TerraceOrientation = url.find("th",text="Terrace orientation").find_next(class_="classified-table__data").next_element.strip()
    TerraceArea = url.find("th",text="Terrace").find_next(class_="classified-table__data").next_element.strip()
    Furnished = url.find("th",text="Furnished").find_next(class_="classified-table__data").next_element.strip()
    ls.append(Furnished)
    OpenFire = url.find("th", text="How many fireplaces?").find_next(class_="classified-table__data").next_element.strip()
    GardenOrientation = url.find("th", text="Garden orientation").find_next(class_="classified-table__data").next_element.strip()
    ls.append(GardenOrientation)
    GardenArea = url.find("th",text="Garden surface").find_next(class_="classified-table__data").next_element.strip()
    PlotSurface = url.find("th",text="Surface of the plot").find_next(class_="classified-table__data").next_element.strip()
    ls.append(PlotSurface)
    FacadeNumber = url.find("th",text="Number of frontages").find_next(class_="classified-table__data").next_element.strip()
    SwimmingPoool = url.find("th",text="Swimming pool").find_next(class_="classified-table__data").next_element.strip()
    StateOfTheBuilding = url.find("th",text="Building condition").find_next(class_="classified-table__data").next_element.strip()
    return ls

print(reader(soup))

我开始面临问题,当代码到达“Locality”时,我收到一个Exception has occurred: AttributeError 'NoneType' object has no attribute 'find,尽管很明显提到的元素存在于 HTML 代码中。我坚持认为这是一个合成器问题,但我无法解决这个问题。

这让我想到了第二个问题:由于此代码将在多个页面上运行,因此这些页面可能没有请求的元素。None如果它发生,我该如何放置价值。

非常感谢您!

源代码:

<div class="classified__header-secondary-info classified__informations"><p class="classified__information--property">
    3 bedrooms
        <span aria-hidden="true">|</span>
        199
<span class="abbreviation"><span aria-hidden="true">
m²                                    </span> <span class="sr-only">
square meters                                    </span></span></p> <div class="classified__information--financial"><!----> <a href="https://www.immoweb.be/en/credit-application?classified=9308167&amp;icid_to=mortgage&amp;icid_cta=classified-header" class="classified__information--mortgage-banner small"><!----> <span class="mortgage-banner__text">Request your mortgage loan</span></a></div> <div class="classified__information--address"><p><span class="classified__information--address-row"><span>
1340
</span> <span aria-hidden="true">—&lt;/span> <span>
Ottignies
</span>
|
</span> <button class="button button--text button--size-small classified__information--address-button">
Ask for the exact address
</button></p></div> <div class="classified__information--immoweb-code">
Immoweb code : 9308167
</div></div>

标签: pythonweb-scrapingbeautifulsoup

解决方案


这些数据不都在<table>网站的标签中吗?可以只使用熊猫:

import requests
import pandas as pd

url = 'https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167'
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}

response = requests.get(url,headers=headers)
dfs = pd.read_html(response.text)

df = pd.concat(dfs).dropna().reset_index(drop=True)
df = df.pivot(index=None, columns=0,values=1).bfill().iloc[[0],:]

输出:

print(df.to_string())
0                            Address As built plan Available as of               Available date Basement Bathrooms Bedrooms   CO₂ emission Construction year Double glazing Elevator   Energy class External reference Furnished Garden Garden orientation Gas, water & electricity Heating type Investment property Kitchen type Land is facing street            Living area   Living room surface Number of frontages Outdoor parking spaces                Price Primary energy consumption Reference number of the EPC report Shower rooms     Surface of the plot Toilets                Website Yearly theoretical total energy consumption
0  Grand' Route 69 A 1435  - Corbais            No   To be defined  December 31 2022 - 12:00 AM      Yes         1        3  Not specified              2020            Yes      Yes  Not specified        8566 - 4443        No    Yes               East                      Yes          Gas                  No    Installed                    No  199  m² square meters  48  m² square meters                   2                      1  € 410,000  410000 €              Not specified                      Not specified            1  150  m²  square meters       3  http://www.gilmont.be                               Not specified

推荐阅读