首页 > 解决方案 > 用 Python 中的 BS4 替换整个代码块,将 HTML 的一部分替换为另一个块

问题描述

我想用 bs4 替换整个代码结构,我有一个源 html 和一个目标 html

t_soup = BeautifulSoup(target_html, 'html.parser')
s_soup = BeautifulSoup(source_html, 'html.parser')

第一个代码在目标中:

//Block of code number 0
<div class="td-module-thumb">
    //Some html code
</div>

//Block of code number 1
<div class="td-module-thumb">
    <a href="this_is_an_url.html" rel="bookmark" class="image-wrap">
       <img width="356" height="220" class="entry-thumb" src="../my_image-356x220.jpg" >
    </a>
</div>

我想替换目标中包含的内容,特别是块 [1] 中的内容,以替换块 [1] 中源中包含的内容,即:

//Block of code number 0
<div class="td-module-thumb">
    //Some html code
</div>

//Block of code number 1
<div class="td-module-thumb">
    <a href="another_url.html" rel="bookmark_2" class="new-class">
       <img width="356" height="220" class="exit" src="../other_image_here.jpg" >
    </a>
</div>

他们有相同的<div class="td-module-thumb">

我进行替换的代码如下:

left_column_selector = 'div.td-module-thumb'
left_column = s_soup.select(left_column_selector)[1]

笔记:

>>> type(s_soup.select(left_column_selector)[1])
<class 'bs4.element.Tag'>

这是我最后一行代码的不同尝试,实际上是进行替换的代码:

// #1
t_soup.select(left_column_selector)[1].replace_with(str(left_column))

// #2
t_soup.select(left_column_selector)[1].string.replace_with(left_column)

// #3
t_soup.select(left_column_selector)[1].string.replace_with(left_column.string)

// #4
t_soup.select(left_column_selector)[1].replace_with(left_column.string)

除了 las 代码行外,一切正常。因此,目标中的代码没有被替换为源代码

标签: pythonhtmlpython-3.xbeautifulsoup

解决方案


我会批发,因为它是 - 删除目标,插入源:

selector = 'div.td-module-thumb'
to_graft = s_soup.select(selector)[0]
for div in t_soup.select(selector): 
    div.decompose()
t_soup.select_one('doc').insert(1, to_graft)

编辑:

假设您的文件如下所示:

target = """<root> I am the target
<div class="td-module-thumb">
    don't touch me!
</div>
<div class="td-module-thumb">
    replace me!
    <a href="this_is_an_url.html" rel="bookmark" class="image-wrap">
       <img width="356" height="220" class="entry-thumb" src="../my_image-356x220.jpg" >
    </a>
</div>
</root>
"""
source = """<root><div class="td-module-thumb">
    I'm the irrelevant part of the source
</div>
<div class="td-module-thumb">
    move me to target!
    <a href="another_url.html" rel="bookmark_2" class="new-class">
       <img width="356" height="220" class="exit" src="../other_image_here.jpg" >
    </a>
</div>
</root>
"""

然后应该这样做:

t_soup = bs(target,'lxml')
s_soup = bs(source,'lxml')
selector = 'div.td-module-thumb'
to_graft = s_soup.select(selector)[1]
to_remove = t_soup.select(selector)
to_remove[1].decompose()
t_soup.select_one('root').insert(2, to_graft)
t_soup

输出:

<root> I am the target
<div class="td-module-thumb">
    don't touch me!
</div><div class="td-module-thumb">
    move me to target!
    <a class="new-class" href="another_url.html" rel="bookmark_2">
<img class="exit" height="220" src="../other_image_here.jpg" width="356"/>
</a>
</div>

</root>

推荐阅读