python - 获取 html 文档中的某些 p 标签
问题描述
我有这段代码可以解析 HTML 页面。
from bs4 import BeautifulSoup
with open('Books-_html.txt') as page:
soup = BeautifulSoup(page, "lxml")
Items = soup.find('div',{'class':'main'})
All_links_and_titles = Items.findAll('p')
print(All_links_and_titles)
打印后留下的html是这样的:
[<p>Looking for good philosophy books? This is my list of the best philosophy books of all-time. If you only have time to read one or two books, I recommend looking at the Top Philosophy Books section below.</p>, <p>Further down the page, you'll find more philosophy book recommendations. Many of these books are fantastic as well. I try to carefully curate all of my reading lists and you can be sure that any philosophy book on this page is worth your time. Enjoy!</p>, <p><strong>Manual for Living<br/></strong>by Epictetus<br/><a href="https://jamesclear.com/book/manual-for-living">Print</a> | <a href="https://jamesclear.com/audiobook/manual-for-living">Audiobook</a><br/><a href="https://jamesclear.com/book-summaries/manual-for-living">Read my summary of this book »</a></p>, <p><strong>Meditations</strong><br/>by Marcus Aurelius<br/><a href="https://jamesclear.com/book/meditations">Print</a> | <a href="https://jamesclear.com/ebook/meditations">eBook</a> | <a href="https://jamesclear.com/audiobook/meditations">Audiobook</a></p>, <p><strong>The Republic</strong><br/>by Plato<br/><a href="https://jamesclear.com/book/the-republic">Print</a> | <a href="https://jamesclear.com/ebook/the-republic">eBook</a> | <a href="https://jamesclear.com/audiobook/the-republic">Audiobook</a></p>, <p><strong>The Little Prince</strong><br/>by Antoine de Saint-Exupery<br/><a href="https://jamesclear.com/book/the-little-prince" title="The Little Prince by Antoine de Saint-Exupery">Print</a> | <a href="https://jamesclear.com/audiobook/the-little-prince" title="The Little Prince by Antoine de Saint-Exupery">Audiobook</a></p>, <p><strong>Free Will</strong><br/>by Sam Harris<br/><a href="https://jamesclear.com/book/free-will">Print</a> | <a href="https://jamesclear.com/ebook/free-will">eBook</a> | <a href="https://jamesclear.com/audiobook/free-will">Audiobook</a><br/><a href="https://jamesclear.com/book-summaries/free-will">Read my summary of this book »</a></p>, <p><strong>Candide</strong><br/>by Voltaire<br/><a href="https://jamesclear.com/book/candide" title="Candide by Voltaire">Print</a> | <a href="https://jamesclear.com/audiobook/candide" title="Candide audiobook">Audiobook</a></p>, <p>Or, <a href="https://jamesclear.com/best-books" title="Browse all book recommendations.">browse all book recommendations</a>.</p>, <p>]
从中我需要得到带有书名的 p 标签。比如《沉思》、《小王子》等。
<p><strong>Meditations</strong><br
<p><strong>The Little Prince</strong><br/
print(All_links_and_titles) 之后的代码应如下所示:
for Only_titles in All_links_and_titles:
Only_titles = All_links_and_titles.find( ????)
print(Only_titles)
到目前为止没有任何效果。需要帮助。先感谢您。
解决方案
尝试使用 CSS Selector p strong
,它选择<strong>
标签下的所有<p>
标签。
from bs4 import BeautifulSoup
html = """<p>Looking for good philosophy books? [And on..]">browse all book recommendations</a>.</p>, <p>"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select("p strong"):
print(tag.text)
输出:
Manual for Living
Meditations
The Republic
The Little Prince
Free Will
Candide
在您的示例中:
for tag in all_links_and_titles:
title = tag.select_one("p strong")
# Were only calling the `.text` method if it's not None
if title:
print(title.text)
推荐阅读
- c# - 如何在 ms dynamics crm 2011 中输入压缩 PDF 文件作为附件?
- linux - 下面这些文件列表的所有者和组是什么类型的?
- sed - 如何使用 sed 替换 shell 中的第二列
- r - 时间序列的自动绘图功能错误
- python - Create a list of values for each key in pandas?
- c# - 使用 UseWindowsAzureActiveDirectoryBearerAuthentication 从 Azure 使用 Id_token
- bash - /dev/fd 中的文件描述符 3 和 `wheel ` 是什么意思
- azure - How can I get an updated server parameter to take effect in Azure Database for MySQL?
- asp.net - 设计多页面网站的解决方案是什么,所有页面都具有相同的表单但每个字段在 2 个字段中不同
- c - C-错误中的锯齿状数组实现:预期的';' 在声明列表的末尾