python - 用 Python 中的 BS4 替换整个代码块,将 HTML 的一部分替换为另一个块
问题描述
我想用 bs4 替换整个代码结构,我有一个源 html 和一个目标 html
t_soup = BeautifulSoup(target_html, 'html.parser')
s_soup = BeautifulSoup(source_html, 'html.parser')
第一个代码在目标中:
//Block of code number 0
<div class="td-module-thumb">
//Some html code
</div>
//Block of code number 1
<div class="td-module-thumb">
<a href="this_is_an_url.html" rel="bookmark" class="image-wrap">
<img width="356" height="220" class="entry-thumb" src="../my_image-356x220.jpg" >
</a>
</div>
我想替换目标中包含的内容,特别是块 [1] 中的内容,以替换块 [1] 中源中包含的内容,即:
//Block of code number 0
<div class="td-module-thumb">
//Some html code
</div>
//Block of code number 1
<div class="td-module-thumb">
<a href="another_url.html" rel="bookmark_2" class="new-class">
<img width="356" height="220" class="exit" src="../other_image_here.jpg" >
</a>
</div>
他们有相同的<div class="td-module-thumb">
我进行替换的代码如下:
left_column_selector = 'div.td-module-thumb'
left_column = s_soup.select(left_column_selector)[1]
笔记:
>>> type(s_soup.select(left_column_selector)[1])
<class 'bs4.element.Tag'>
这是我最后一行代码的不同尝试,实际上是进行替换的代码:
// #1
t_soup.select(left_column_selector)[1].replace_with(str(left_column))
// #2
t_soup.select(left_column_selector)[1].string.replace_with(left_column)
// #3
t_soup.select(left_column_selector)[1].string.replace_with(left_column.string)
// #4
t_soup.select(left_column_selector)[1].replace_with(left_column.string)
除了 las 代码行外,一切正常。因此,目标中的代码没有被替换为源代码
解决方案
我会批发,因为它是 - 删除目标,插入源:
selector = 'div.td-module-thumb'
to_graft = s_soup.select(selector)[0]
for div in t_soup.select(selector):
div.decompose()
t_soup.select_one('doc').insert(1, to_graft)
编辑:
假设您的文件如下所示:
target = """<root> I am the target
<div class="td-module-thumb">
don't touch me!
</div>
<div class="td-module-thumb">
replace me!
<a href="this_is_an_url.html" rel="bookmark" class="image-wrap">
<img width="356" height="220" class="entry-thumb" src="../my_image-356x220.jpg" >
</a>
</div>
</root>
"""
source = """<root><div class="td-module-thumb">
I'm the irrelevant part of the source
</div>
<div class="td-module-thumb">
move me to target!
<a href="another_url.html" rel="bookmark_2" class="new-class">
<img width="356" height="220" class="exit" src="../other_image_here.jpg" >
</a>
</div>
</root>
"""
然后应该这样做:
t_soup = bs(target,'lxml')
s_soup = bs(source,'lxml')
selector = 'div.td-module-thumb'
to_graft = s_soup.select(selector)[1]
to_remove = t_soup.select(selector)
to_remove[1].decompose()
t_soup.select_one('root').insert(2, to_graft)
t_soup
输出:
<root> I am the target
<div class="td-module-thumb">
don't touch me!
</div><div class="td-module-thumb">
move me to target!
<a class="new-class" href="another_url.html" rel="bookmark_2">
<img class="exit" height="220" src="../other_image_here.jpg" width="356"/>
</a>
</div>
</root>
推荐阅读
- unit-testing - vuetify : cannot trigger click on radio button
- docker - Docker 容器不保存状态
- ios - Decodable value String or Bool
- visual-studio - Azure Function App = 在本地运行单个 Azure Function 以进行调试
- image - 在 woocommerce 电子邮件中获取 sku 和产品图片
- reactjs - CDN 上的 SSR 渲染保存包
- actionscript-3 - AS3。MouseEvent click in (for) 循环函数
- openssl - OpenSSL 偶尔会生成错误的签名
- python - Python 3:TypeError:不能将序列乘以“float”类型的非整数
- javascript - React App 中的倒计时似乎呈指数级倒计时