首页 > 解决方案 > 如何在 Python 中将 HTML 转换为文本?

问题描述

我知道这个问题有很多答案,但其中很多已经过时了,当我找到一个“有效”的答案时,它的效果还不够好。

这是我当前的代码:

import requests
from bs4 import BeautifulSoup

url = "http://example.com"

req = requests.get(url)


html = req.text


PlainText = BeautifulSoup(html, 'lxml')
print (PlainText.get_text())

这是我得到的输出:


Example Domain




    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }




Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...


这是我想要的输出:

Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

我怎样才能只从网站打印出我可以阅读的文本?

标签: pythonhtmltextbeautifulsoup

解决方案


这是一个 python 程序,它使用一个函数来删除 < 标签和 > 标签之间的所有内容,并仅返回不在这些标签之间的文本。

def striphtmltags(s):
    b=True
    r=''
    for i in range(0, len(s)):
        if(s[i]=='<'): b=False
        if(b): r+=s[i]  
        if(s[i]=='>'): b=True
    return(r.strip())   

html="<html><body><h1>this is the header</h1>this is the main body<font color=blue>this is blue</font><h6>this is the footer</h6></body></html>"
text=striphtmltags(html)

print("text:", text)

这会产生:

text: this is the headerthis is the main bodythis is bluethis is the footer

推荐阅读