首页 > 解决方案 > 在 python 中使用 .get_text() 后删除空格

问题描述

我想从沼泽标准 .html 文件中删除空格。我正在使用 python 3.6.2

到目前为止我的代码

#!/usr/bin/python

import re
import logging
import textwrap

from bs4 import BeautifulSoup

print('opening file....')
with open("./scraped_pages/doc.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")
    print('closing file...') 
    fp.close()
    print('..... file closed  ...')
    # print out the original text, in this case html source code
    # print(soup)   
    # only retrieve the text from the document, remove all html tags
    soup = soup.get_text()
    print(soup)

    lines = soup.split("\n")
    #Use the list comprehension syntax [line for line in lines if condition] with lines as the previous result and condition as line.strip() != "" to remove any empty lines from lines.
    no_soup = [line for line in lines if line.strip() != ""]

    # Declare an empty string and use a for-loop to iterate over the previous result. 
    no_empty_soup = ""
    # At each iteration, use the syntax str1 += str2 + "/n" to add the current element in the list str2 and a newline to the empty string str1.
    for line in no_soup:
        no_empty_soup += line + "\n"

    print("no empty lines:\n", no_empty_soup)

    soup = no_empty_soup.strip()
    print(soup)
    
   print(textwrap.dedent(soup))

和 doc.html 代码

<!DOCTYPE html>
<html lang="en-GB">
<head>
  <title>Head's title</title>
</head>

<body>
  <p class="title"><b>Body's title</b></p>
  <p class="story">line begins
    <a href="http://example.com/element1" class="element" id="link1">1</a>
    <a href="http://example.com/element2" class="element" id="link2"> 2</a>
    <a href="http://example.com/avatar1" class="avatar" id="link3">3</a>
  <p>     line ends</p>
</body>
</html>

我得到的回报

Head's title
Body's title
line begins
    1
 2
3
     line ends

以及我希望得到的回报

Head's title,
Body's title
line begins
1
2
3
line ends 

我不明白为什么使用后空格仍然存在.strip()or textwrap.dedent()。如果有人可以解释一下。

我本来希望'1'BBodyl中一样在第一位,并且在使用.get_text(). 请问有什么想法吗?

谢谢你,汤米。

标签: python

解决方案


您的列表理解缺少.strip(),它应该是:

no_soup = [line.strip() for line in lines if line.strip() != ""]

然后它会工作。


推荐阅读