python - 删除相似数据
问题描述
从基于测量不准确性的相似数据中删除重复项
我正在努力解决 Python 中用于过滤重复数据的新问题。我特别在寻找将它用于超过 100 行和超过 25 列的大数据的可能性。
使用以下数据框简化为一个简单的示例:
>>> df
a b c d
0 1.764052 0.400157 0.978738 2.240893
1 1.764052 0.400157 0.978738 2.240893
2 -0.103219 0.410599 0.144044 1.454274
3 0.761038 0.121675 0.443863 0.333674
4 -0.103219 0.410599 0.144044 1.454274
5 1.230291 1.202380 -0.387327 -0.302303
6 1.230291 1.202380 -0.387327 -0.302303
7 1.532779 1.469359 0.154947 0.378163
8 1.230291 1.202380 -0.387327 -0.302303
9 1.230291 1.202380 -0.387327 -0.302303
>>> df1 = df.drop_duplicates()
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 -0.103219 0.410599 0.144044 1.454274
3 0.761038 0.121675 0.443863 0.333674
4 -0.103219 0.410600 0.144044 1.454274
5 1.240291 1.202380 -0.387327 -0.302303
7 1.532779 1.469359 0.154947 0.378163
8 1.230291 1.202380 -0.387327 -0.302303
>>> df2 = df. spezial code ?
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 -0.103219 0.410599 0.144044 1.454274
3 0.761038 0.121675 0.443863 0.333674
5 1.240291 1.202380 -0.387327 -0.302303
7 1.532779 1.469359 0.154947 0.378163
8 1.230291 1.202380 -0.387327 -0.302303
因此,drop.duplicates()
inpandas
非常高效且超快,并且运行良好。但它只过滤完全相同的重复项。但是为了最小化日期并查看测量误差,我还想删除相似的数据,并且基于定义的测量误差相同。
所以也应该删除第 4 行,它与 column 中的第 2 行“几乎”相同c
。
另一方面,它应该保留在第 8 行,这与第 5 行(在 a 列中)相似,但在测量不准确性方面没有。
遵循解决小数据问题的可能性,但不幸的是,这是处理大数据的缓慢方式。
tolerances = {'a':0.001,
'b':0.5,
'c':0.5,
'd':0.05}
df_clean = pd.DataFrame(columns=df.columns.to_list())
df_clean = df_clean.append(df.iloc[1])
for i in range(df.shape[0]):
for j in range(df_clean.shape[0]):
m = 0
for key in tolerances:
if ((df.iloc[i].loc[key] <= df_clean.iloc[j].loc[key]+tolerances[key]) and (df.iloc[i].loc[key] >= df_clean.iloc[j].loc[key]-tolerances[key])):
m = m+1
else:
break
if m == len(tolerances):
break
if j == (df_clean.shape[0]-1):
df_clean = df_clean.append(df.iloc[i])
df_clean.sort_index(inplace=True)
>>> print(df_clean)
a b c d
0 1.764052 0.400157 0.978738 2.240893
1 -0.103219 0.410599 0.144044 1.454274
2 0.761038 0.121675 0.443863 0.333674
4 1.240291 1.202380 -0.387327 -0.302303
5 1.532779 1.469359 0.154947 0.378163
6 1.230291 1.202380 -0.387327 -0.302303
解决方案
这是您的输入数据:
from scipy.spatial.distance import pdist, squareform
import numpy as np
import pandas as pd
data = {'a': {0: '1.764052', 1: '-0.103219', 2: '0.761038', 3: '-0.103219', 4: '1.240291', 5: '1.532779', 6: '1.230291'}, 'b': {0: '0.400157', 1: '0.410599', 2: '0.121675', 3: '0.410600', 4: '1.202380', 5: '1.469359', 6: '1.202380'}, 'c': {0: '0.978738', 1: '0.144044', 2: '0.443863', 3: '0.144044', 4: '-0.387327', 5: '0.154947', 6: '-0.387327'}, 'd': {0: '2.240893', 1: '1.454274', 2: '0.333674', 3: '1.454274', 4: '-0.302303', 5: '0.378163', 6: '-0.302303'}}
df = pd.DataFrame(data, columns=["a", "b", "c", "d"])
tolerances = {'a': 0.001, 'b': 0.5, 'c': 0.5, 'd': 0.05}
tolerances_values = np.fromiter(tolerances.values(), dtype=float)
>>> print(df)
a b c d
0 1.764052 0.400157 0.978738 2.240893
1 -0.103219 0.410599 0.144044 1.454274
2 0.761038 0.121675 0.443863 0.333674
3 -0.103219 0.410600 0.144044 1.454274
4 1.240291 1.202380 -0.387327 -0.302303
5 1.532779 1.469359 0.154947 0.378163
6 1.230291 1.202380 -0.387327 -0.302303
您想根据您提供的距离删除足够相似的行:行之间的差异不得大于tolerances
.
from scipy.spatial.distance import pdist, squareform
# Define your similarity function between rows.
def is_similar(x, y):
"""
Returns True if x is similar to y, False else
"""
diffs = np.abs(y-x) # Look at absolute differences
similar = all(diffs <= tolerances_values) # True if all columns diffs are within tolerances
return bool(similar)
# Compute similarities on all your dataframe
similarity_values = pdist(df.to_numpy(), is_similar)
# Convert np.array() into a pd.DataFrame()
similarity_df = pd.DataFrame(squareform(similarity_values), index=df.index, columns= df.index)
# Get indices of similar rows
similar_indices = similarity_df[similarity_df == True].stack().index.tolist()
# Remove symmetric indices (from i,j i,i and j,i only keep i,j)
similar_indices = [sorted(tpl) for tpl in similar_indices if tpl[0] < tpl[1]]
# Flatten
similar_indices = list(set([item for tpl in similar_indices for item in tpl]))
现在你去:
>>> df[~df.index.isin(similar_indices)]
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 0.761038 0.121675 0.443863 0.333674
4 1.240291 1.202380 -0.387327 -0.302303
5 1.532779 1.469359 0.154947 0.378163
6 1.230291 1.202380 -0.387327 -0.302303
[过时] 使用 cosine_similarity 距离的其他示例
定义一个函数来计算相似度并检索相似度高于阈值的索引:
from sklearn.metrics.pairwise import cosine_similarity # any other can be used
def remove_similar(df, distance, threshold):
distance_df = cosine_similarity(df)
similar_indices = [(x,y) for (x,y) in np.argwhere(distance_df>threshold) if x != y]
similar_indices = list(set([item for tpl in similar_indices for item in tpl]))
return df[~df.index.isin(similar_indices)]
现在您可以尝试distance=cosine_similarity
使用阈值:
>>> remove_similar(df, cosine_similarity, 0.9)
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 0.761038 0.121675 0.443863 0.333674
5 1.532779 1.469359 0.154947 0.378163
>>> remove_similar(df, cosine_similarity, 0.9999999)
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 0.761038 0.121675 0.443863 0.333674
4 1.240291 1.202380 -0.387327 -0.302303
5 1.532779 1.469359 0.154947 0.378163
6 1.230291 1.202380 -0.387327 -0.302303
推荐阅读
- javascript - JS Firebase Auth 等待初始化而不清除 getRedirectResult() 的结果
- android - 使用 NavController 导航的片段标签名称
- javascript - 我正在尝试使用多个组件重新制作 Vuex 的入门项目,但无法弄清楚如何从组件调用根方法
- kubernetes - 在 Kubernetes 上正确设置 Datadog 日志摄取
- c++ - 如何在不强制转换每个参数的情况下删除初始化列表中从 int 到 char 的缩小转换?
- algorithm - 如何处理通过 Dijkstra 算法遍历的图中的“组合节点”?
- android - 如果我销售的实物商品也解锁了对我们应用程序的访问权限,我是否必须使用谷歌应用内购买?
- javascript - JavaScript 计算器中只有 1 位小数?
- python - pygame窗口永远加载
- c++ - GetTokenInformation 在非提升时不返回所有权限的状态