首页 > 解决方案 > 获取 html 文档中的某些 p 标签

问题描述

我有这段代码可以解析 HTML 页面。

from bs4 import BeautifulSoup

with open('Books-_html.txt') as page:
   soup = BeautifulSoup(page, "lxml")

Items = soup.find('div',{'class':'main'})

All_links_and_titles = Items.findAll('p')

print(All_links_and_titles)

打印后留下的html是这样的:

[<p>Looking for good philosophy books? This is my list of the best philosophy books of all-time. If you only have time to read one or two books, I recommend looking at the Top Philosophy Books section below.</p>, <p>Further down the page, you'll find more philosophy book recommendations. Many of these books are fantastic as well. I try to carefully curate all of my reading lists and you can be sure that any philosophy book on this page is worth your time. Enjoy!</p>, <p><strong>Manual for Living<br/></strong>by Epictetus<br/><a href="https://jamesclear.com/book/manual-for-living">Print</a> | <a href="https://jamesclear.com/audiobook/manual-for-living">Audiobook</a><br/><a href="https://jamesclear.com/book-summaries/manual-for-living">Read my summary of this book »</a></p>, <p><strong>Meditations</strong><br/>by Marcus Aurelius<br/><a href="https://jamesclear.com/book/meditations">Print</a> | <a href="https://jamesclear.com/ebook/meditations">eBook</a> | <a href="https://jamesclear.com/audiobook/meditations">Audiobook</a></p>, <p><strong>The Republic</strong><br/>by Plato<br/><a href="https://jamesclear.com/book/the-republic">Print</a> | <a href="https://jamesclear.com/ebook/the-republic">eBook</a> | <a href="https://jamesclear.com/audiobook/the-republic">Audiobook</a></p>, <p><strong>The Little Prince</strong><br/>by Antoine de Saint-Exupery<br/><a href="https://jamesclear.com/book/the-little-prince" title="The Little Prince by Antoine de Saint-Exupery">Print</a> | <a href="https://jamesclear.com/audiobook/the-little-prince" title="The Little Prince by Antoine de Saint-Exupery">Audiobook</a></p>, <p><strong>Free Will</strong><br/>by Sam Harris<br/><a href="https://jamesclear.com/book/free-will">Print</a> | <a href="https://jamesclear.com/ebook/free-will">eBook</a> | <a href="https://jamesclear.com/audiobook/free-will">Audiobook</a><br/><a href="https://jamesclear.com/book-summaries/free-will">Read my summary of this book »</a></p>, <p><strong>Candide</strong><br/>by Voltaire<br/><a href="https://jamesclear.com/book/candide" title="Candide by Voltaire">Print</a> | <a href="https://jamesclear.com/audiobook/candide" title="Candide audiobook">Audiobook</a></p>, <p>Or, <a href="https://jamesclear.com/best-books" title="Browse all book recommendations.">browse all book recommendations</a>.</p>, <p>]

从中我需要得到带有书名的 p 标签。比如《沉思》、《小王子》等。

 <p><strong>Meditations</strong><br


<p><strong>The Little Prince</strong><br/

print(All_links_and_titles) 之后的代码应如下所示:

for Only_titles in All_links_and_titles:
    Only_titles = All_links_and_titles.find( ????)
    print(Only_titles)

到目前为止没有任何效果。需要帮助。先感谢您。

标签: pythonbeautifulsoup

解决方案


尝试使用 CSS Selector p strong,它选择<strong>标签下的所有<p>标签。

from bs4 import BeautifulSoup

html = """<p>Looking for good philosophy books? [And on..]">browse all book recommendations</a>.</p>, <p>"""
soup = BeautifulSoup(html, "html.parser")

for tag in soup.select("p strong"):
    print(tag.text)

输出:

Manual for Living
Meditations
The Republic
The Little Prince
Free Will
Candide

在您的示例中:

for tag in all_links_and_titles:
    title = tag.select_one("p strong")
    # Were only calling the `.text` method if it's not None
    if title:
        print(title.text)

推荐阅读