首页 > 解决方案 > 如何将html中的多行段落合并为一个?

问题描述

我有一个 html 文件,其中包含 pdf 文件的标题和段落。但是在这个文件中,每一行段落都被认为是另一个段落,这就是为什么它给出了很多

标记行,因此不可能创建多行的单个段落。任何人都可以建议我解决这个问题的方法吗?

这是我得到的方式:

["<p>Forti provides access to a diverse array of Forti solutions through a single sign-on ",
  "<p>including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti ",
  "<p>cloud-based management and services. Forti accounts are free which require a license for ",
  "<p>each solution. "]

我希望以这种方式在哪里:

['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, FortiWeb Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']

我已经这样做了:

paragraphs_1 = []
local_path = "file.json"
data = json.loads(open(local_path).read())
for x in data:
    soup = BeautifulSoup(x, 'html.parser') 
    for paragraphs in soup.find_all("p"): 
        paragraphs_1.append(paragraphs.get_text())

标签: pythonhtmlbeautifulsoup

解决方案


您可以使用替换功能来摆脱所有 p...like

yourtext.replace("<p>", "") 

推荐阅读