首页 > 解决方案 > 无法比较 2 个包含字符串的 Python 集

问题描述

我创建了 2 个从 2 个不同的 CSV 文件创建的 python 集,其中包含一些刺。

我正在尝试匹配 2 个集合,以便它返回 2 个的交集(应该返回两个集合的公共字符串)。

这就是我的代码的样子:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import nltk
#using content mmanager to open and read file
#converted the text file into csv file at the source using Notepad++
with open(r'skills.csv', 'r', encoding="utf-8-sig") as f:
    myskills = f.readlines()
    #converting mall the string in the list to lowercase
    list_of_myskills = map(lambda x: x.lower(), myskills)
    set_of_myskills = set(list_of_myskills)
    #print(type(nodup_filtered_content))
print(set_of_myskills)
#open and read by line from the text file
with open(r'list_of_skills.csv', 'r') as f2:
    #using readlines() instead of read(), becasue it reads line by line (each 
    line as a string obj in the python list)
    contents_f2 = f2.readlines()
    #converting mall the string in the list to lowercase
    list_of_skills = map(lambda x: x.lower(), contents_f2)
    #converting into sets
    set_of_skills = set(list_of_skills)
print(set_of_skills)

这是我正在使用的功能:

def set_compare(set1,set2):
if(set1 & set2):
    return print('The matching skills are: '(set1 & set2))
else:
    print("No matching skills")

在我运行代码后:

    set_compare(set_of_skills,set_of_myskills)

输出:

No matching skills

'skills.csv' 的内容是:

{'critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate,'}


文件“list_of_skills.csv”的内容:

{'assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently,'}

虽然我可以实际看到匹配的关键字,但我不明白为什么我没有得到输出。

也没有收到任何错误

标签: pythonjupyter-notebookjupyter-lab

解决方案


比较两组字符串不会比较这些字符串的子字符串。你的程序本质上是在做什么

foo = {'ABC', 'DEF', 'GHI'}
bar = {'AB', 'CD', 'DE', 'FG', 'HI'}

foo.intersection(bar) # returns {}

仅仅因为不同集合中的字符串之间共享字符并不意味着集合有交集。字符串'ABC'在第一个而不是第二个,字符串'AB'在第二个而不是第一个,等等。

有点不清楚你到底想比较两个 csv 之间的交集。您想找到两者中的单个单元格吗?它们是否也必须在列中匹配?如果您提供有关预期输出的更多信息,那么我可以编辑此答案以提供更多信息。

[编辑] 根据您的评论,看起来您想要的是用逗号分割那些巨大的字符串,以便集合的元素成为单独的单元格。目前,这些套装只有一个元素,每个元素都只是一根巨大的绳子,里面有很多技能。如果你更换

list_of_myskills = map(lambda x: x.lower(), myskills)

list_of_myskills = [y.strip().lower() for x in myskills for y in x.split(',')]

并相应地替换其他类似的行,那么您可能会更接近您的预期。


推荐阅读