首页 > 解决方案 > 在列中查找特定字符并将其与其他列合并

问题描述

我有一个这样的 .xlsx 数据集:

df = pd.DataFrame({'AuthorName':["Wendelaar Bonga"," Sjoerd E.", "Grätzel"," Michael", "Willett", "Walter C.", "Kessler", "Ronald C.", "Witten, Edward", "Wang, Zhong Lin"],
               'SubjectField': ["Biomedical Engineering", "Inorganic & Nuclear Chemistry", "Organic Chemistry", "Biomedical Engineering", "Developmental Biology", "Mechanical Engineering & Transports", "Biomedical Engineering", "Microbiology", "Cardiovascular System & Hematology", "Biomedical Engineering"],
              'NumberOfPapers':[10, 28, 34, 56, 78, 90, 54, 54, 32, 14],
              'totalAuthorsWithinField':[5, 10, 11, 30, 56, 34, 13, 45,23, 7]})

数据集看起来像这样,但 我想搜索其中包含“工程师”的主题字段。然后,计算每个工程领域的平均论文数量。显示此表以及字段信息中的作者总数。

我的输出应该是这张表:

我尝试了此代码,但出现错误

#add a new column that is 1 if 'Engineer' appears in the Subject Field, else 0
data['isEngineeringRelated']=data['SubjectField'].map(lambda x: 1 if 'Engineer' in x else 0)
#filter for engineering rows
engineering_data = data[data['isEngineeringRelated']==1]
#groupby the engineering fields and count the average number of papers of authors in that field
display(engineering_data.groupby('SubjectField')['NumberOfPapers'].mean())

标签: pythonpandas-groupbydata-analysis

解决方案


您可以使用contains来检查是否SubjectFieldEngineer

import pandas as pd

df = pd.DataFrame({'AuthorName':["Wendelaar Bonga"," Sjoerd E.", "Grätzel"," Michael", "Willett", "Walter C.", "Kessler", "Ronald C.", "Witten, Edward", "Wang, Zhong Lin"],
               'SubjectField': ["Biomedical Engineering", "Inorganic & Nuclear Chemistry", "Organic Chemistry", "Biomedical Engineering", "Developmental Biology", "Mechanical Engineering & Transports", "Biomedical Engineering", "Microbiology", "Cardiovascular System & Hematology", "Biomedical Engineering"],
              'NumberOfPapers':[10, 28, 34, 56, 78, 90, 54, 54, 32, 14],
              'totalAuthorsWithinField':[5, 10, 11, 30, 56, 34, 13, 45,23, 7]})

df['isEngineeringRelated']=df['SubjectField'].str.contains('Engineer')

engineering_data = df[df['isEngineeringRelated']==True]

engineering_data.groupby('SubjectField').agg({
                                              'NumberOfPapers':'mean'
                                          ,'totalAuthorsWithinField':'sum'
                                            }).reset_index()

                          SubjectField  NumberOfPapers  totalAuthorsWithinField
0               Biomedical Engineering            33.5                       55
1  Mechanical Engineering & Transports            90.0                       34

推荐阅读