python - 使用 BeautifulSoup 查找和删除内容
问题描述
我有博客文本(django 中的应用程序),我想在其中删除一些内容。我尝试使用 BeautifulSoup 搜索内容。我想查找并删除 wphimage 标签之间的所有内容。下面是我的代码。不起作用的是当我在运行它后显示汤对象时出现的 wphimage 标记,我确实在其中写入了 obj.text
我的代码
class Command(BaseCommand):
def handle(self, *args, **kwargs):
article = Blogposts.objects.all()
for obj in article:
soup = BeautifulSoup(obj.text, 'html.parser')
for i in soup.find_all('wphimage'):
obj.text = str(i.replace_with(''))
obj.save()
博文内容
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
<wphimage data="{'FileId':6182,'Copyright':'John Smith','Alignment':'left','ZoomDisabled':false,'ImageOnly':false,'AlternativeText':'John Smith','ImageVersion':'conductorportraitlong','tabid':0,'moduleid':0}">
<span style="display:block; float:left;" class="DIV_imageWrapper">
<a data-lightview-title="Adela Frasineanu" data-lightview-caption="" class="lightview" href="//example.com/static/images/image.JPG">
<img src="//example.com/static/images/image.JPG" alt="John Smith">
</a>
<a href="javascript:;">≡ <span>John Smith</span></a>
<a class="A_zoom lightview" href="//example.com/static/images/image.JPG" data-lightview-title="John Smith" data-lightview-caption="">+ </a>
</span>
</wphimage>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
我的目标是:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
解决方案
您可以用于获取第一个文本previousSibling
和最后一个文本nextSibling
。你可以试试:
from bs4 import BeautifulSoup
html_doc = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
<wphimage data="{'FileId':6182,'Copyright':'John Smith','Alignment':'left','ZoomDisabled':false,'ImageOnly':false,'AlternativeText':'John Smith','ImageVersion':'conductorportraitlong','tabid':0,'moduleid':0}">
<span style="display:block; float:left;" class="DIV_imageWrapper">
<a data-lightview-title="Adela Frasineanu" data-lightview-caption="" class="lightview" href="//example.com/static/images/image.JPG">
<img src="//example.com/static/images/image.JPG" alt="John Smith">
</a>
<a href="javascript:;">≡ <span>John Smith</span></a>
<a class="A_zoom lightview" href="//example.com/static/images/image.JPG" data-lightview-title="John Smith" data-lightview-caption="">+ </a>
</span>
</wphimage>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."""
soup = BeautifulSoup(html_doc, "lxml")
first_text = soup.find("wphimage").previousSibling
last_text = soup.find("wphimage").nextSibling
print(first_text.strip())
print(last_text.strip())
输出将是:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
推荐阅读
- javascript - 尽管存在 Access-Control-Allow-Origin 标头,但 Chrome 中的 CORS 问题
- sql - 如何查询包含数组 ["val1", "val2"] 的文本列并检索包含特定值的所有表
- model - 预加载模型资源时出错 com.google.mlkit.common.MlKitException
- django - 如何使用 Django ajax“GET 方法”
- fonts - 如何让 xmessage 使用字体?
- java - 有人可以告诉我如何编写和覆盖方法以找到最多 6 个百分比输入的几何平均值
- c# - 为什么我不能在我的 Web API 项目中使用带有 C# 异步函数的 Discard 作为触发后遗忘机制
- ada - 前置条件和后置条件是否取代了函数验证?
- c++ - 关于 QT Tcpserver 和 Thread
- mysql - 仅使用选定选项的 MySQL 搜索