python - 使用 BeautifulSoup 抓取时如何处理某些页面中缺失的元素

问题描述

我需要从一系列产品页面中抓取下面的代码，然后将其拆分以分别显示作者和插图画家。

问题是：

某些页面同时具有<li>作者和<li>插图画家，如第 1 页

某些页面只有<li>for author，如 page2

某些页面既没有作者也没有插图画家，所以根本没有<ul>，如第 3页

<li>知道是否适用于插画家的唯一方法是，是否<li>包含文本“（Illustreerder）”。

当作者和插图画家为空时，如何为它们分配默认值？

<ul class="product-brands">
    <li class="brand-item">
        <a href="https://lapa.co.za/Skrywer/zinelda-mcdonald-illustreerder.html" title="Zinelda McDonald (Illustreerder)">Zinelda McDonald (Illustreerder)</a>
    </li>
    <li class="brand-item">
        <a href="https://lapa.co.za/Skrywer/jose-reinette-palmer.html" title="Jose  Palmer &amp; Reinette Lombard">Jose  Palmer &amp; Reinette Lombard</a>
    </li>
</ul>

from bs4 import BeautifulSoup
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}

# AUTHOR & ILLUSTRATOR
page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'

# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'

# NO AUTHOR and NO ILLUSTRATOR
page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'

# PAGE WITH NO STOCK
page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'


illustrator = '(Illustreerder)'
productlist = []

r = requests.get(page2, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')

isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "")
stocks = soup.find('div', class_='stock available')
if stocks is not None:
    stock = stocks.text.strip()
if stocks is None:
    stock = 'n/a'
 
for ultag in soup.find_all('ul', {'class': 'product-brands'}):
    for litag in ultag.find_all('li'):
        author = litag.text.strip() or 'None'

        if illustrator not in author:
            author = author

for ultag in soup.find_all('ul', {'class': 'product-brands'}):
    for litag in ultag.find_all('li'):
        author = litag.text.strip()

        if illustrator in author:
            illustrator = author
          
bookdata = [isbn, stock, author, illustrator]
print(bookdata)

预期输出： r = requests.get(page1, headers=headers)

['9781776356515', 'In voorraad', 'Jose  Palmer & Reinette Lombard', 'Zinelda McDonald']

预期输出： r = requests.get(page2, headers=headers)

['9780799383874', 'In voorraad', 'Jaco Jacobs', 'None']

预期输出： r = requests.get(page3, headers=headers)

['9780799383690', 'In voorraad', 'None', 'None']

标签： pythonbeautifulsouplxml

解决方案

你可以这样做。

首先选择<ul>您需要使用的find()

ul = soup.find('ul', class_='product-brands')

现在检查是否<ul>存在。如果True那么您至少有一个作者或插画家或两者兼而有之。
如果True，则获取元素<li>内标签的字符串<ul>并返回列表。您可以使用.stripped_strings来获取标签内所有字符串的列表。

如果False简单地返回None。

if ul:
      return list(ul.stripped_strings)
return None

根据返回的列表中的项目数量，我认为很容易弄清楚您在问题中提到的内容是什么：

<li>知道是否适用于插画家的唯一方法是，是否<li>包含文本“（Illustreerder）”。

这是给出作者和 Illustrator 列表的代码（如果它们中的任何一个存在） else None。

import requests
from bs4 import BeautifulSoup

# AUTHOR & ILLUSTRATOR
page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'

# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'

# NO AUTHOR and NO ILLUSTRATOR
page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'

# PAGE WITH NO STOCK
page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'


def test(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
    }
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    ul = soup.find('ul', class_='product-brands')
    # Setting Default values for author and illustrator
    author, illustrator = None, None
    # Return a list only if ul is not None
    if ul:
        details = list(ul.stripped_strings)
        # Assigning the names to "author" and "illustrator"
        for name in details:
            if name.endswith('(Illustreerder)'):
                illustrator = name
            else:
                author = name
    return (author, illustrator)

    
# Iterate over the pages and call the test() function to get author and illustrator names
for page in [page1, page2, page3, page4]:
    author, illustrator = test(page)
    print(f'Authors: {author}\nIllustrators: {illustrator}\n')

现在，您已经将名称分开author并illustrator存储在每个页面的不同变量中。

Authors: Jose  Palmer & Reinette Lombard
Illustrators: Zinelda McDonald (Illustreerder)

Authors: Jaco Jacobs
Illustrators: None

Authors: None
Illustrators: None

Authors: Jan de Wet
Illustrators: None

python - 使用 BeautifulSoup 抓取时如何处理某些页面中缺失的元素

问题描述

解决方案

推荐阅读