python - 需要帮助加入字典项目并删除换行符和多个空格和特殊字符
问题描述
带有 2 个 url 及其文本的字典:需要去掉所有的多个空格、特殊字符和换行符
{' https://firsturl.com ': ['\n\n', '\n ', '\n \n \n ', '\n \n ', '\n \n ', '\n \n ', '\n', '\n', '\n ', '\n ', '首页 | Sam ModelInc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n\n \n\n \n \n', '\n', '\n', '\n', '\n', '\n', '\n', '跳到主要内容'],' https ://secondurl.com#main-content': ['\n\n', '\n', '\n \n \n', '\n \n', '\n \n', '\n\n', '\n', '\n', '\n', '\n', '首页 | 将开始 inc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n \n\n\n \n \n', '\n', '\n', '\n', '\n', '\n', '\n', '跳到主要内容', ' \n ', '\n \n', '\n\n ', '\n\n ', '\n \n \n \n \n ', '\n\n ', '\n ', '\n\n \n ', '\n ', '\n\n \n ', '\n ', '品牌', '\n', '关于我们', '
预期输出:{' https://firsturl.com ': ['home sam modelInc 跳转到主要内容'], https://secondurl.com#main-content ': ['home going to start inc 跳转到主要内容关于我们的品牌联合直接响应]}
帮助将不胜感激
解决方案
因此,让我们尝试逐步完成此过程,而不是仅仅向您抛出一些代码。
我们要删除的第一个元素是换行符。因此,我们可以从以下内容开始:
ex_dict = {"a": ["\n\n", "\n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "\n" not in e]
ex_dict[x] = new_list
如果你运行它,你会看到我们现在过滤掉了所有新行。
现在我们有以下情况:
Home | Sam ModelInc
Skip to main content
Home | Going to start inc
Brands
About Us
Syndication
Direct Response
根据您的预期输出,您希望将所有单词小写并删除非字母字符。
对如何做到这一点进行了一些研究。
在代码中,它看起来像:
import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["\n\n", "\n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "\n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
ex_dict[x] = new_list
所以现在我们的最终结果new_list
看起来像:['Home Sam ModelInc', 'Skip to main content']
接下来我们要将所有内容都小写。
import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["\n\n", "\n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "\n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
new_list = [e.lower() for e in new_list]
ex_dict[x] = new_list
最后,我们希望将所有内容组合在一起,每个单词之间只有一个空格。
import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["\n\n", "\n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "\n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
new_list = [e.lower() for e in new_list]
new_list = [" ".join((" ".join(new_list)).split())]
ex_dict[x] = new_list
推荐阅读
- r - 使用 rjags 的贝叶斯逻辑回归
- iis - IIS UrlRewrite:如何从域重写为域和路径
- python - 无法让我的函数在 Python 中满足 while 循环条件
- redis - 如何为redis流定义TTL?
- reactjs - 与 MUI TextField 等控制组件一起使用时,React Hook Form 是否会减少重新渲染
- makefile - 从子目录访问父 Makefile 目标
- sql-server - SQL Server - SSRS - 直接在报告中显示表/视图的内容(而不使用表/矩阵)
- sql - 来自具有相同结构的 2 个不同数据库的 SQL 请求
- c# - 我的 AdMob 广告仅在我导航到另一个页面然后返回时才会加载
- angular - 离子与 SSR 实现