首页 > 解决方案 > Pandas DataFrame 的多循环和多拆分

问题描述

我有一个包含 22000 行作者姓名的 CSV 文件。

  1. 每行有多个作者姓名,由“;”分隔。
  2. 一行中的每个作者姓名都按“姓氏,名字”顺序排列。

我想拆分它们并附加到新列,如下所示。

原始数据集预览

+------------------------------------+
|           author_full_name         |
+------------------------------------+
| Kahana, M J; Adler, M              |
|Gautam, H; Potdar, G G; Vidya, T N C|
+------------------------------------+

预期输出

+------------------------------------+------------------------------------------+
|           author_full_name         | author_first_names| author_last_names    |
+------------------------------------+------------------------------------------+
| Kahana, M J; Adler, M              |      M J; M       | Kahana; Adler        |
|Gautam, H; Potdar, G G; Vidya, T N C|     H; G G; T N C | Gautam; Potdar; Vidya|
+------------------------------------+------------------------------------------+

我怎样才能用熊猫做到这一点?

标签: pythonpandascsvdata-sciencedata-cleaning

解决方案


这里的逻辑本质上是先拆分,;然后拆分每个值, ,并将它们的第一个值作为 ;ast 名称,将第二个值作为名字

>>> [x.split(",")[0] for x in "Gautam, H; Potdar, G G; Vidya, T N C".split(";")]
>>> ['Gautam', ' Potdar', ' Vidya']

在使用应用的熊猫中:

import pandas as pd 
df = pd.DataFrame({"Name":["Gautam, H; Potdar, G G; Vidya, T N C","Kahana, M J; Adler, M "]})
df['author_last_names'] = df['Name'].apply(lambda x: ";".join([ele.split(",")[1] for ele in x.split(";")]))
df['author_first_names'] = df['Name'].apply(lambda x: ";".join([ele.split(",")[0] for ele in x.split(";")]))

df

输出:

------------------------------------|-----------------|------------------------
Gautam, H; Potdar, G G; Vidya, T N C  H; G G; T N C      Gautam; Potdar; Vidya
Kahana, M J; Adler, M                 M J; M             Kahana; Adler
------------------------------------|-----------------|------------------------

推荐阅读