python-3.x - pandas groupby 并根据值填充列
问题描述
给定一个数据框(https://pastebin.com/MdqWz4Ke)
# some data
data3 = [["Alex","Tampa","A23","1","Ax","Red"],
["Alex","Tampa","A23","1","Ay","Blue"],
["Alex","Tampa","B43","1","Bx","Green"],
["Alex","Tampa","B43","1","By","White"],
["Alex","Tampa","C55","1","Cx","Red"],
["Alex","Tampa","C55","1","Cy","White"],
["Alex","Tampa","C55","2","Cx","Purple"],
["Alex","Tampa","C55","2","Cy","Black"],
["Tim","San Diego","A23","1","Ax","Green"],
["Tim","San Diego","A23","1","Ay","Black"],
["Tim","San Diego","B43","1","Bx","Yellow"],
["Tim","San Diego","B43","1","By","Black"],
["Tim","San Diego","C55","1","Cx","Pink"],
["Tim","San Diego","C55","1","Cy","Orange"],
["Tim","San Diego","A23","2","Ax","Green"],
["Tim","San Diego","A23","2","Ay","Red"],
["Tim","San Diego","B43","2","Bx",""],
["Tim","San Diego","B43","2","By",""],
["Mark","Houston","A23","1","Ax","Purple"],
["Mark","Houston","A23","1","Ay","Yellow"],
["Mark","Houston","B43","1","Bx","Gray"],
["Mark","Houston","B43","1","By","White"],
["Mark","Houston","C55","1","Cx",""],
["Mark","Houston","C55","1","Cy",""],
["Anthony","Seattle","A23","","Ax","Orange"],
["Anthony","Seattle","A23","","Ay","Black"],
["Anthony","Seattle","B43","","Bx","Red"],
["Anthony","Seattle","B43","","By","Black"],
["Anthony","Seattle","C55","","Cx","Blue"],
["Anthony","Seattle","C55","","Cy","Pink"]]
# create dataframe
df3 = pd.DataFrame(data3,columns=[
"Name","City","Domain","Sequence","Group","Value"])
如何比较组中的值并使用这些值有条件地填充列?
# add Compared columns
df3["Compared Group"] = ""
df3["Compared Value"] = ""
# replace nulls with np.NaN
df3.replace(r"^s*$", np.nan, regex=True, inplace = True)
# fillna for missing Sequence and Value
df3.fillna({"Sequence":"N/A","Value":"NULL"},inplace=True)
# expected result
result = [["Alex","Tampa","A23","1","Ax","Red","Ay","Blue"],
["Alex","Tampa","B43","1","Bx","Green","By","White"],
["Alex","Tampa","C55","1","Cx","Red","Cy","White"],
["Alex","Tampa","C55","2","Cx","Purple","Cy","Black"],
["Tim","San Diego","A23","1","Ax","Green","Ay","Black"],
["Tim","San Diego","A23","2","Ax","Green","Ay","Red"],
["Tim","San Diego","B43","1","Bx","Yellow","By","Black"],
["Tim","San Diego","B43","2","Bx","NULL","By","NULL"],
["Tim","San Diego","C55","1","Cx","Pink","Cy","Orange"],
["Mark","Houston","A23","1","Ax","Purple","Ay","Yellow"],
["Mark","Houston","B43","1","Bx","Gray","By","White"],
["Mark","Houston","C55","1","Cx","NULL","Cy","NULL"],
["Anthony","Seattle","A23","","Ax","Orange","Ay","Black"],
["Anthony","Seattle","B43","","Bx","Red","By","Black"],
["Anthony","Seattle","C55","","Cx","Blue","Cy","Pink"]]
result_df = pd.DataFrame(result,columns=[
"Name","City","Domain","Sequence","Group",
"Value","Compared Group","Compared Value"])
笔记:
如果一个人的
Group
值与另一个人匹配(Ax
toAy
,Bx
toBy
,例如)并且Sequence
数字相同,则用相应的and填充Compared Group
andCompared Value
列 。Group
Value
比较中不考虑
City
andDomain
,但所有列都需要保留。有些行没有
Sequence
数字,所以我用 填充空值N/A
,以便对某个值进行分组。此外,有些行在列中没有值Value
,所以我用 , 填充了这些行,以便在填充列NULL
时填充一些内容Compared Values
。
我创建了一个映射Group
值的字典
# map groups with dictionary
group_dict = {"Ax":"Ay","Bx":"By","Cx":"Cy"}
并创建了groupby
对象
# groupby
grouped = df3.groupby(["Name","Sequence","Domain","Group"], group_keys=False)
我最初的计划是.loc
为了填充Compared
列并可能map
与字典一起使用,但是当尝试访问组中的值时......
for name in df3["Name"]:
print(grouped.get_group((name,"Ax")))
我收到以下错误:
ValueError: must supply a a same-length tuple to get_group with multiple grouping keys
我假设是因为并非所有组都包含相同数量和类型的Group
值(例如,Tim 有Sequence
1 和 2 ,Ax
而 Alex 只有Sequence
1 Ax
)。我不确定如何从这里开始以合并和转换这些行。
解决方案
鉴于您的样本数据,您可以执行以下操作:
def myfunc(x):
# extract rows 0 2 4 ...
# reset_index rename the rows as 0 1 2 ...
df1 = x.iloc[::2].reset_index(drop=True)
# extract rows 1 3 5
df2 = x.iloc[1::2].reset_index(drop=True)
# merge put the two dataframes next together
return df1.merge(df2, left_index=True, right_index=True)
# group by the other columns and select only ['Group', 'Value']
(df3.groupby(['Name', 'City', 'Domain', 'Sequence'])[['Group','Value']]
.apply(myfunc) # concatenate the rows
.reset_index(-1, drop=True) # drop the unnecessary index
.reset_index() # make the other original columns as data instead of index
)
输出:
Name City Domain Sequence Group_x Value_x Group_y Value_y
0 Alex Tampa A23 1 Ax Red Ay Blue
1 Alex Tampa B43 1 Bx Green By White
2 Alex Tampa C55 1 Cx Red Cy White
3 Alex Tampa C55 2 Cx Purple Cy Black
4 Anthony Seattle A23 N/A Ax Orange Ay Black
5 Anthony Seattle B43 N/A Bx Red By Black
6 Anthony Seattle C55 N/A Cx Blue Cy Pink
7 Mark Houston A23 1 Ax Purple Ay Yellow
8 Mark Houston B43 1 Bx Gray By White
9 Mark Houston C55 1 Cx NULL Cy NULL
10 Tim Los Angeles A23 1 Ax Green Ay Black
11 Tim Los Angeles A23 2 Ax Green Ay Red
12 Tim Los Angeles B43 1 Bx Yellow By Black
13 Tim Los Angeles B43 2 Bx NULL By NULL
14 Tim Los Angeles C55 1 Cx Pink Cy Orange
推荐阅读
- c# - 未找到支持的蓝牙协议栈
- windows - 为什么单独的文件夹和所有文件夹内容之间的文件夹大小差异如此之大?
- emacs - 为什么不能从中国大陆访问官方 ELPA 存储库
- cursor - 如何修复“database.CursorIndexOutOfBoundsException:请求索引 0,大小为 0”?
- ruby-on-rails - 如何在我的 Rails API 中向 JSON 响应添加其他属性
- python - 当我在 tkinter GUI 中单击时,我想更改按钮的背景颜色
- python - 如何将python烧瓶中的变量传递给mysqldb?
- javascript - GXT HandlerManager 总线内存泄漏
- amazon-s3 - 在 Lambda 上使用 python 从 S3 逐行处理 CSV
- javascript - 使用 tensorflow.js 的智能自动完成