首页 > 解决方案 > 比较两个 csv 文件并在新的 csv 文件中获取输出是否匹配

问题描述

我有两个 csv 文件。一个是 profile.csv 另一个是 data.csv 文件。profile.csv 在两列下有数据,例如 company_name 和 job_description,就像profile.csv一样。data.csv 文件在两列下有数据,例如 company_name 和 job_description,如data.csv

我想要的是 profile.csv 的描述(资格)必须与 data.csv 的描述进行比较。并获取每个描述(资格)是否匹配的输出......

在我看来,输出必须像这样

公司 - - - - - - - - - - -

PPD全球有限公司

职位描述 - - - - -

科学学科的学士/高级学位教育——匹配

监管医学写作方面的先前经验——匹配

优秀的语法、编辑和校对能力——匹配

有效的组织和计划能力——匹配

积极性、主动性和适应能力 在团队中有效工作的能力——匹配

到目前为止,我已经尝试过了

它只匹配整个job_description而不是每个句子......

import csv

with open('C:\\Users\\Izzath  Ali\\Desktop\\Data Mining\\profile.csv', 'rt', encoding='utf-8') as csvfile1:
    csvfile1_indices = dict((r[1], i) for i, r in enumerate(csv.reader(csvfile1)))

with open('C:\\Users\\Izzath  Ali\\Desktop\\Data Mining\\data.csv', 'rt', encoding='utf-8') as csvfile2:
    with open('outputText-mining.csv', 'w') as results:
        reader = csv.reader(csvfile2)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['status'])

        for row in reader:
            index = csvfile1_indices.get(row[1])
            if index is not None:
               message = '-- matching'
               writer.writerow(row + [message])

            else:
               message = '-- not matching'
               writer.writerow(row + [message])

 results.close()

标签: pythoncsvdata-miningtext-mining

解决方案


我将简化您的数据结构以启用演示:

file1 = """company1, "sent1. sent2. sent3"
company2, "sent4. sent5. sent6"
company3, "sent7. sent8. sent9"
"""

file2 = """companyA, "sent1. sent20. sent3"
companyB, "sent40. sent5."
companyC, "sent5. sent1. sent60"
"""

首先,我将数据加载到一个数据结构中——一个字典列表,每个公司一个字典。

list_of_file1_company_dicts = []
for line in file1.split('\n'):
    company_dict = {}
    col = line.split(',')
    print('company:', col[0])
    list_of_sent = col[1].split('.')
    company_dict[col[0]] = list_of_sent
list_of_file1_company_dicts.append(company_dict)

list_of_file2_company_dicts = []
for line in file2.split('\n'):
    company_dict = {}
    col = line.split(',')
    print('company:', col[0])
    list_of_sent = col[1].split('.')
    company_dict[col[0]] = list_of_sent
list_of_file2_company_dicts.append(company_dict)

然后循环遍历这两个数据结构以查找字典值的交集

for file1_company_dict in list_of_file1_company_dicts:
    for company1_name, list_of_sent1 in file1_company_dict.items():
        for sent1 in list_of_sent1:
            for file2_company_dict in list_of_file2_company_dicts:
                for company2_name, list_of_sent2 in file2_company_dict.items():
                    for sent2 in list_of_sent2:
                        if sent1==sent2:
                            print(company1_name, company2_name)

推荐阅读