首页 > 解决方案 > 需要帮助加入字典项目并删除换行符和多个空格和特殊字符

问题描述

带有 2 个 url 及其文本的字典:需要去掉所有的多个空格、特殊字符和换行符

{' https://firsturl.com ': ['\n\n', '\n ', '\n \n \n ', '\n \n ', '\n \n ', '\n \n ', '\n', '\n', '\n ', '\n ', '首页 | Sam ModelInc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n\n \n\n \n \n', '\n', '\n', '\n', '\n', '\n', '\n', '跳到主要内容'],' https ://secondurl.com#main-content': ['\n\n', '\n', '\n \n \n', '\n \n', '\n \n', '\n\n', '\n', '\n', '\n', '\n', '首页 | 将开始 inc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n \n\n\n \n \n', '\n', '\n', '\n', '\n', '\n', '\n', '跳到主要内容', ' \n ', '\n \n', '\n\n ', '\n\n ', '\n \n \n \n \n ', '\n\n ', '\n ', '\n\n \n ', '\n ', '\n\n \n ', '\n ', '品牌', '\n', '关于我们', '

预期输出:{' https://firsturl.com ': ['home sam modelInc 跳转到主要内容'], https://secondurl.com#main-content ': ['home going to start inc 跳转到主要内容关于我们的品牌联合直接响应]}

帮助将不胜感激

标签: pythonhtmlregexxmldictionary

解决方案


因此,让我们尝试逐步完成此过程,而不是仅仅向您抛出一些代码。

我们要删除的第一个元素是换行符。因此,我们可以从以下内容开始:

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]
    ex_dict[x] = new_list

如果你运行它,你会看到我们现在过滤掉了所有新行。

现在我们有以下情况:

Home | Sam ModelInc
Skip to main content
Home | Going to start inc
Brands
About Us
Syndication
Direct Response

根据您的预期输出,您希望将所有单词小写并删除非字母字符。

对如何做到这一点进行了一些研究。

在代码中,它看起来像:

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]
    ex_dict[x] = new_list

所以现在我们的最终结果new_list看起来像:['Home Sam ModelInc', 'Skip to main content']

接下来我们要将所有内容都小写。

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]

    new_list = [e.lower() for e in new_list]
    ex_dict[x] = new_list

最后,我们希望将所有内容组合在一起,每个单词之间只有一个空格。

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]

    new_list = [e.lower() for e in new_list]

    new_list = [" ".join((" ".join(new_list)).split())]
    ex_dict[x] = new_list

推荐阅读