首页 > 解决方案 > 使用python清理文本中具有特定类型垃圾的字符串

问题描述

我想从这个字符串中提取出有意义的文本。如何清理这种特定类型的字符串。

'<div dir="auto">I booked a flight ticket from Trivandrum to Mumbai<div 
dir="auto"><br></div><div dir="auto">Amount debited from my 
account.</div><div dir="auto"><br></div><div dir="auto">But 
ticket not received yet.</div><div dir="auto"><br></div><div 
dir="auto">Please check</div></div>
'

预期输出:

I booked a flight from Trivandrum to Mumbai Amount debited from my account. But 
ticket not received yet. Please check

import re
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

cleanhtml(cleanr)
'&lt;div dir=&quot;auto&quot;&gt;I booked a flight ticket from Trivandrum to 
Mumbai&lt;div dir=&quot;auto&quot;&gt;&lt;br&gt;&lt;/div&gt;&lt;div 
dir=&quot;auto&quot;&gt;Amount debited from&nbsp;my account.&lt;/div&gt;&lt;div 
dir=&quot;auto&quot;&gt;&lt;br&gt;&lt;/div&gt;&lt;div dir=&quot;auto&quot;&gt;But 
ticket not received yet.&lt;/div&gt;&lt;div 
dir=&quot;auto&quot;&gt;&lt;br&gt;&lt;/div&gt;&lt;div dir=&quot;auto&quot;&gt;Please 
check&lt;/div&gt;&lt;/div&gt;&#13;&#10;'

字符串没有清理,请提出一些解决方案

标签: pythonstringnlpdata-cleaning

解决方案


推荐阅读