首页 > 解决方案 > 将 CSV 文件的每个元素与不同 CSV 文件的每个元素进行比较,并找到最相似的元素

问题描述

我有两个需要比较的 CSV 文件。第一个称为 SAP.csv,第二个称为 SAPH.csv。

SAP.csv 有这些单元格:

Notification    Description
5000000001      Detailed Inspection of Masts (2100mm) (3
5000000002      Ceremonial Awnings-Survey and Load Test
5000000003      HPA-Carry out 4000 hour service routine
5000000004      UxE 8 in Number Temperature Probs for C
5000000005      Overhaul valves

...而 SAPH.csv 具有以下单元格:

Notification   Description
4000000015     Detailed Inspection of Masts (2100mm) (3
4000000016     Ceremonial Awnings-Survey and Load Test
4000000017     HPA-Carry out 8000 hour service routine
4000000018     UxE 8 in Number Temperature Probs for C
4000000019     Represerve valves
4000000020     STW System

它们是相似的,但有些行,如第四行(HPA-执行4000小时服务程序与 HPA-执行8000小时服务程序)略有不同。

我想将 SAP.csv 的每个值与 SAPH.csv 的每个值进行比较,并使用余弦相似度找到最相似的行,以便输出看起来像这样(这里的相似度百分比只是示例,而不是什么他们实际上是):

Description
Detailed Inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test  - 100%
HPA-Carry out 4000 hour service routine  - 85%
UxE 8 in Number Temperature Probs for C  - 90%
Overhaul valves                          - 0%

发布答案编辑

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')

回溯(最近一次通话最后):

文件“”,第 1 行,在

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')

运行文件中的文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行

execfile(filename, namespace)

文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile

exec(compile(f.read(), filename, 'exec'), namespace)

文件“C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”,第 31 行,在

similarity_score = similar(job, description) # Get their similarity

文件“C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py”,第 14 行,类似

similarity = 1-textdistance.Cosine(qval=2).distance(a, b)

文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py”,第 173 行,距离

return self.maximum(*sequences) - self.similarity(*sequences)

文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py”,第 176 行,相似

return self(*sequences)

文件“C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\token_based.py”,第 175 行,调用中

return intersection / pow(prod, 1.0 / len(sequences))

ZeroDivisionError:浮点除以零

由于上述解决方案的第二次编辑

所以最初的请求只有两个输出——描述和相似度得分。

描述来自 SAP 相似度来自于 textdistance calc

解决方案可以修改为以下

通知(这是 SAP 文件中的 10 位数字) 描述(目前是) 相似性(目前是) 通知(此数字来自 SAPH 文件,将是提供相似度分数的数字)

所以一个示例行输出就像这样

80000115360 附加材料 FWD 绳护罩 86.24% 7123456789

这将沿着 A、B、C、D 列

A、B 来自 SAP C 计算 D 来自 SAPH

编辑 3

运行文件中的文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行

execfile(filename, namespace)

文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile

exec(compile(f.read(), filename, 'exec'), namespace)

文件“C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py”,第 16 行,在

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'})

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 702 行,在 parser_f

return _read(filepath_or_buffer, kwds)

_read 中的文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 429 行

parser = TextFileReader(filepath_or_buffer, **kwds)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 895 行,在init

self._make_engine(self.engine)

_make_engine 中的文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1122 行

self._engine = CParserWrapper(self.f, **self.options)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1853 行,在init

self._reader = parsers.TextReader(src, **kwds)

文件“pandas/_libs/parsers.pyx”,第 490 行,在 pandas._libs.parsers.TextReader 中。初始化

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py”,第 2017 行,在 pandas_dtype

dtype))

TypeError:不理解数据类型“字符串”

编辑后 4 - 2020 年 10 月 25 日

嗨,所以得到了和我之前想的一样的错误

此电子邮件可能包含 BAE Systems 和/或第三方的专有信息。

运行文件中的文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 786 行

execfile(filename, namespace)

文件“C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”,第 110 行,在 execfile

exec(compile(f.read(), filename, 'exec'), namespace)

文件“C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py”,第 16 行,在

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 702 行,在 parser_f

return _read(filepath_or_buffer, kwds)

_read 中的文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 435 行

data = parser.read(nrows)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1139 行,已读取

ret = self._engine.read(nrows)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 2421 行,正在读取

data = self._convert_data(data)

_convert_data 中的文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 2487 行

clean_conv, clean_dtypes)

_convert_to_ndarrays 中的文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1705 行

cvals = self._cast_types(cvals, cast_type, c)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py”,第 1808 行,在 _cast_types

copy=True, skipna=True)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py”,第 623 行,在 astype_nansafe

dtype = pandas_dtype(dtype)

文件“C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py”,第 2017 行,在 pandas_dtype

dtype))

TypeError:不理解数据类型“字符串”

我了解了您对分隔符的看法,因此我将 csv 文件上传到 repl.it,它看起来好像“,”是分隔符。

因此修改了代码以适应。当我在 repl.it 上这样做时,它起作用了。

这是我正在使用的代码

导入文本距离

将熊猫导入为 pd

def similar(a, b): # 改编自这里:https ://stackoverflow.com/a/63838615/8402369

similarity = 1-textdistance.Cosine(qval=2).distance(a, b)

return similarity * 100

阅读 CSV

SAP = pd.read_csv('H:\Documents/Python/导入到 Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

SAPH = pd.read_csv('H:\Documents/Python/导入到 Python/SAP/SAP_History.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

创建一个 pandas 数据框来存储输出。'Description' 列填充了 SAP['Description'] 的值

score = pd.DataFrame(SAP['Description'], columns = ['Notification (SAP)','Description', 'Similarity', 'Notification (SAPH)'])

存储最高相似度分数的临时变量

最高分 = 0

描述 = 0

迭代 SAP['Description']

在 SAP ['Description'] 中的工作:

high_score = 0 # 在每次迭代中重置最高得分

for description in SAPH['Description']: # Iterate through SAPH['Description']

similarity_score = similar(job, description) # Get their similarity



if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values

  highest_score = similarity_score

  desc = str(description)

if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.

  break

使用最高分数和其他值更新数据框“分数”

打印(SAPH['描述'][SAPH['描述'] == desc])

分数['通知 (SAP)'][分数['描述'] == 工作] = SAP['通知'][SAP['描述'] == 工作]

分数['相似度'][分数['描述'] ==工作] = f'{highest_score}%'

分数['通知 (SAPH)'][分数['描述'] == 工作] = SAPH['通知'][SAPH['描述'] == desc]

打印(分数)

将其输出到 Scores.csv 没有索引列

使用 open('./Scores.csv', 'w') 作为文件:

file.write(scores.__repr__())

在 Spyder (Python 3.7) 上运行

标签: python-3.xcsvcosine-similarity

解决方案


@George_Pipas这个问题的回答演示了一个使用该库的示例textdistance(我在这里解释他的部分答案):

一个解决方案是与textdistance图书馆合作。我将提供一个例子Cosine Similarity

import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')

我们得到:

0.5

因此,我们可以创建一个相似度查找函数:

def similar(a, b):
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b)     
    return similarity

根据相似性,如果ab更相似,这将输出一个接近 1 的数字,如果不是,它将输出一个接近 0 的数字。因此,如果a === b,则输出将为1,但如果a !== b,则输出将小于 1。

要获得百分比,您只需将输出乘以 100。像这样:

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b) 
    return similarity * 100

CSV 文件可以通过以下方式轻松读取pandas

# Read the CSVs
SAP = pd.read_csv('SAP.csv') 
SAPH = pd.read_csv('SAPH.csv')

我们创建另一个pandas 数据框来存储我们将计算的结果:

# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']}, columns = ['SAP', 'SAPH', 'Similarity']) 

现在,我们遍历SAP['Description']SAPH['Description'],将每个元素与其他元素进行比较,计算它们的相似性,并将最高的保存到scores

# Temporary variable to store both the highest similarity score, and the 'SAPH' value the score was computed with
highest_score = {"score": 0, "description": ""}

# Iterate though SAP['Description']
for job in SAP['Description']:
  highest_score = {"score": 0, "description": ""} # Reset highest_score at each iteration
  for description in SAPH['Description']: # Iterate through SAPH['Description']
    similarity_score = similar(job, description) # Get their similarity

    if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
      highest_score['score'] = similarity_score
      highest_score['description'] = description
    if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
      break
  # Update the dataframe 'scores' with highest_score
  scores['SAPH'][scores['SAP'] == job] = highest_score['description'] 
  scores['Similarity'][scores['SAP'] == job] = highest_score['score']

这是一个细分:

  1. 创建一个临时变量highest_score来存储最高计算分数。
  2. 现在我们进行彻底的迭代SAP['Description'],并且在内部进行迭代SAPH['Description']。这允许我们将 ( ) 的每个值与( ) 的SAP['Description']每个值进行比较。jobSAPH['Description']description
  3. 在迭代SAPH['Description']时,我们:
    1. 计算两者的相似度得分jobdescription
    2. 如果它高于保存的分数highest_score,我们会highest_score相应地更新;否则我们继续
    3. 如果similarity_score等于100,我们知道这是一个完美的匹配,不必继续寻找。在这种情况下,我们打破了循环。
  4. SAPH['Description']循环之外,现在我们已经比较job了 , 的每个元素SAPH['Description'](或找到了完美匹配),我们将值保存到scores.

这对 的每个元素重复SAP['Description']

这是scores完成后的样子:

                                        SAP                                      SAPH Similarity
0  Detailed Inspection of Masts (2100mm) (3  Detailed Inspection of Masts (2100mm) (3        100
1   Ceremonial Awnings-Survey and Load Test   Ceremonial Awnings-Survey and Load Test        100
2   HPA-Carry out 4000 hour service routine   HPA-Carry out 8000 hour service routine    94.7368
3   UxE 8 in Number Temperature Probs for C   UxE 8 in Number Temperature Probs for C        100
4                           Overhaul valves                         Represerve valves    53.4522

并将其输出到 CSV 文件后:

# Output it to Scores.csv without the index column (0, 1, 2, 3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv', index=False)

... Scores.csv看起来像这样:

SAP,SAPH,Similarity
Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100
Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100
HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315
UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100
Overhaul valves,Represerve valves,53.45224838248488

查看完整代码,在线运行编辑

请注意, textdistance pandas 是为此所需的库。安装它们,如果您还没有它们,请使用:

pip install textdistance pandas

笔记:


编辑:(对于评论中提到的遇到的问题)

这是相似度函数的错误捕获版本:

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
  try: 
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b) 
    return similarity * 100
  except ZeroDivisionError:
    print('There was an error. Here are the values of a and b that were passed')
    print(f'a: {repr(a)}')
    print(f'b: {repr(b)}')
    exit()

推荐阅读