python - 从所有元素中删除所有带有 etree 的数据属性
问题描述
所以我正在尝试清理一些 HTML。我有以下功能:
def clean_html(self, html):
replaced_html = html.decode('utf-8').replace('<', ' <')
tree = etree.HTML(replaced_html)
etree.strip_elements(tree, 'script', 'style', 'img', 'noscript', 'svg')
for el in tree.xpath('//*[@style]'):
el.attrib.pop('style')
for el in tree.xpath('//*[@class]'):
el.attrib.pop('class')
for el in tree.xpath('//*[@id]'):
el.attrib.pop('id')
etree.strip_tags(tree, etree.Comment)
return etree.tostring(tree, encoding='unicode', method='html')
我希望也删除所有data-attributes
例如
<li data-direction="ltr" '
'data-listposition="center" data-data-id="dataItem-ifz7cqbs" '
'data-state="menu idle link notMobile">sky</li>
但是我不知道这些属性(上面只是一个例子)。
所以我希望将上述内容转换为just <li>sky</li>
,并将在页面上的每个元素上运行。
在我上面的代码中,我可以删除简单的东西,例如id
,class
但我不确定如何处理动态属性data-*
。可能是正则表达式?
编辑
我应该澄清一下输入。我上面的例子展示了<li>
标签的使用。但实际输入是页面的整个 html,所以它会是这样的:
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
<div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp35za1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black LinkedIn Icon","uri":"6ea5b4a88f0b4f91945b40499aa0af00.png","width":200,"height":200,"alt":"Black LinkedIn Icon","link":{"type":"ExternalLink","id":"dataItem-ig84dp5v","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.linkedin.com/in/beth-liu-aba2b487?trk=hp-identity-name","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.pinterest.com/agencyb/" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ijxtrrjj","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Pinterest Icon","uri":"8f6f59264a094af0b46e9f6c77dff83e.png","width":200,"height":200,"alt":"Black Pinterest Icon","link":{"type":"ExternalLink","id":"dataItem-ikg674xm","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.pinterest.com/agencyb/","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="http://www.twitter.com/lubecka" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp3554u","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Twitter Icon","uri":"c7d035ba85f6486680c2facedecdcf4d.png","description":"","width":200,"height":200,"alt":"Black Twitter Icon","link":{"type":"ExternalLink","id":"dataItem-ifp3554u1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"http://www.twitter.com/lubecka","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.instagram.com/" target="_blank">
<html>
解决方案
假设“数据属性”的名称总是以“数据-”开头,您可以像这样删除它们:
for el in tree.xpath("//*"):
for attr in el.attrib:
if attr.startswith("data-"):
el.attrib.pop(attr)
推荐阅读
- postgresql - 使用窗口函数将聚合与另一个聚合进行比较
- python-3.x - 如何在 Django 2.x 中创建自定义 ACL 功能?
- react-admin - 警告:缺少关键的翻译:“”;
- python-2.7 - 问:Sonos Python 自检错误:找不到记录器“smapi”的处理程序
- angular - 以角度 6 获取 Leaflet 地图坐标中的光标位置
- entity-framework-core - 实体框架核心 SQLite 版本
- spring - 自定义 AuthenticationFailureHandler @Bean 与新的 FailureHandler()
- node.js - 无法使用 firebase 函数将字节数组图像上传到火存储
- ios - 模拟器和真实环境中的 Xcode 构建错误“Command /usr/bin/codesign failed with exit code 1”
- opengl - GLSL 中“pixel_interlock_ordered”的语法是什么?