首页 > 解决方案 > 如何检测 BeautifulSoup.find_all 找到的标签是否是搜索字符串中的第一个/最后一个标签?

问题描述

我正在用 python 编写一个函数来处理 HTML 内容,它在与其他标签的表上的工作方式不同。主要目的是我想将带有 pypandoc 的内容从 html 转换为 markdown,但转换表格似乎永远无法工作。所以我想在转换其余表格时将表格保留为 HTML 格式。

例如,我想要这个:

<h3>RRRRRRRRRRRRRR</h3>
<p> Xxxxxxxxxxx II jj</p>
<table>
<tbody>
<tr>
<td>AAAAA</td>
<td>AAAAA</td>
<td>AAAAA</td>
</tr>
<tr>
<td>AAAAA</td>
<td>AAAAA</td>
<td>AAAAA</td>
</tr>
</tbody>
</table>

转换为:

### RRRRRRRRRRRRRR
Xxxxxxxxxxx II jj

<table>
<tbody>
<tr>
<td>AAAAA</td>
<td>AAAAA</td>
<td>AAAAA</td>
</tr>
<tr>
<td>AAAAA</td>
<td>AAAAA</td>
<td>AAAAA</td>
</tr>
</tbody>
</table>

我编写了下面的函数来推广到任何标签或 find_all 搜索字符串。但是,如果找到的标记是搜索到的 HTML 代码中的最后一个,则它不起作用。我想设定一个条件来检测这种情况。item检测汤中最后一项还是第一项的最佳方法是什么?

或者,或者:是否有更好的方法来处理除了表格之外的所有内容?

def do_nothing(input):
    return input

def process_tags_differently(tag_to_process, doc_contents, process_tag=do_nothing, process_out_of_tag=do_nothing):
    soup = BeautifulSoup(doc_contents, "html.parser")
    list_of_tags = soup.find_all(tag_to_process)
    if(len (list_of_tags) > 0):
        for item in list_of_tags:
            split_contents = doc_contents.split(str(item))  # doc_contents split in 3: before, item, after
            part1 = process_out_of_tag(split_contents[0] )
            part2 = str(item)
            ret = part1 + part2
            print(len(split_contents))
            # this part is recursive, to process the rest of the file
            if(len(split_contents) > 1):
                part3 = process_tags_differently(tag_to_process, split_contents[1], process_tag, process_out_of_tag)
                ret = ret + part3
            return ret
    else:
        # no more tags found, process the rest of the doc
        return process_out_of_tag (doc_contents)

这是一个失败的测试示例:

    dc1 = "<h1>xxxxxx</h1><p>aaaa</p><img src='toto.htm'/>"
    result = process_tags_differently('img', dc1, double_str)
    good_result = "<p>aaaa</p><img src='toto.htm'/><img src='toto.htm'/>"

结果是:

process_tags_differently('img', dc1) failed! 
was 
<h1>xxxxxx</h1><p>aaaa</p><img src='toto.htm'/><img src="toto.htm"/>, 
should be 
<h1>xxxxxx</h1><p>aaaa</p><img src='toto.htm'/>

标签: pythonbeautifulsoup

解决方案


我选择了一个不那么聪明、更无聊且针对任务的解决方案。

在下面的代码中,我使用 BeautifulSoup 将每个表格替换为内容中永远不会出现的段落,即<p>&&& ïïï ùùù</p>. 然后我将批次转换为降价,然后将表格放回内容中。

import constants as c
import pypandoc
from bs4 import BeautifulSoup

def main():
    md = open(f"{c.OUTPUT_PATH}/a_test.md" , 'w+', encoding='utf-8')
    soup = BeautifulSoup(doc_contents , "html.parser" )
    tables = soup.findAll("table")
    divider = soup.find("p", {"id": "__divider__"})
    for table in tables:
        parent_table = table.findParent("table")
        if table.parent_table is None:
            print("found table!!!, table.findParent('table')= "+ str (parent_table))
            divider =BeautifulSoup("<p id='__divider__'>&&& ïïï ùùù</p>" , "html.parser" )
            table.replaceWith(divider)
    cont = str (soup )
    cont = convert_snippet(cont)

    for table in tables:
        cont = cont.replace('&&& ïïï ùùù',  str(table), 1)
    output=doc_header + cont
    print(output)
    md.write(doc_header + cont)

def convert_snippet(txt):
    ret =  pypandoc.convert_text (
    txt, 
    'md', 
    format='html',
    extra_args=[ '-s', '--wrap=preserve']
    ) 
    return ret

doc_header=""" 
{
    "title": "A test",
    "linkTitle": "A test",
    "weight": "1"
}
"""


doc_contents="""
<div>
<h3>RRRRRRRRRRRRRR</h3>
<p> Xxxxxxxxxxx II jj</p>
<table>
<tbody>
<tr>
<td>AAAAA</td>
<td>AAAAA</td>
<td>AAAAA</td>
</tr>
<tr>
<td>AAAAA</td>
<td>AAAAA</td>
<td>AAAAA</td>
</tr>
</tbody>
</table>
<p> Xxxxxxxxxxx II jj</p>
<p> Xxxxxxxxxxx II jj</p>
<table>
<tbody>
<tr>
<td>BBBBB</td>
<td>AAAAA</td>
<td>BBBBB</td>
</tr>
<tr>
<td>BBBBB</td>
<td>AAAAA</td>
<td>BBBBB</td>
</tr>
</tbody>
</table>
</div>

"""

if __name__ == '__main__':
    main()

推荐阅读