python - Python:对列表中的相似文本进行分组
问题描述
我如何对具有匹配 80% 的模糊逻辑的数组中的值进行分组
combined_list = ['magic', 'simple power', 'matrix', 'simple aa', 'madness', 'magics', 'mgcsa', 'simple pws', 'seek', 'dour', 'softy']
产量:
['magic, magics'], ['simple pws', 'simple aa'], ['simple power'], [matrix]
这是我所取得的成就,但与我的目标有很大不同。此外它只支持很少的值,我打算用 50,000 条记录运行它
from difflib import SequenceMatcher as sm
combined_list = ['magic', 'simple power', 'matrix', 'madness', 'magics', 'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
result = list()
result_group = list()
for x in combined_list:
for name in combined_list:
if(sm(None, x, name).ratio() >= 0.80):
result_group.append(name)
else:
pass
result.append(result_group)
print(result)
del result_group[:]
print(result)
循环外的打印结果为空,但循环内的结果包含我需要的值。虽然输出与我需要的不同
['magic', 'magics']]
[['simple power', 'simple pws'], ['simple power', 'simple pws']]
[['matrix'], ['matrix'], ['matrix']]
[['madness'], ['madness'], ['madness'], ['madness']]
[['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics']]
[['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa']]
[['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws']]
[['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek']]
[['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour']]
[['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft']]
[['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa']]
[[], [], [], [], [], [], [], [], [], [], []]
解决方案
from difflib import SequenceMatcher as sm
combined_list = ['magic', 'simple power', 'matrix', 'madness', 'magics',
'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
result = list()
result_group = list()
usedElements = list()
skip = False
for firstName in combined_list:
skip = False
for x in usedElements:
if x == firstName:
skip = True
if skip == True:
continue
for secondName in combined_list:
if(sm(None, firstName, secondName).ratio() >= 0.80):
result_group.append(secondName)
usedElements.append(secondName)
else:
pass
result.append(result_group[:])
del result_group[:]
print(result)
我添加了一种删除重复项的方法,方法是将列表中已经放入组中的元素扔到 usedElements 列表中。
它确实保留一组,但如果您不希望元素不在组中,您可以将最后一段代码更改为:
if len(result_group) > 1:
result.append(result_group[:])
del result_group[:]
del result_group[:]
print(result)
希望这可以帮助。
推荐阅读
- javascript - 带有 css 生成块的 JavaScript 游戏,字符无法在点击时跳转
- git - Git删除了所有跟踪文件中的所有更改
- python - 是否可以将正则表达式作为 maketrans() 的第三个参数?
- python - 如何在单个 if 语句中检查 3 个随机字符串是否不相等?
- google-apps-script - 下载 .xlsx 文件并将其附加到 Google 表格中
- r - 仅特定变量之间的相关性
- flutter - 如何优化颤振CameraImage到TensorImage?
- laravel - 如何从 laravel 中的表中选择特定字段
- html - 不允许图像增长
- javascript - 如果另一个没有内容,如何隐藏一个div