python - comparing values in two pandas dataframes to keep a running count
问题描述
My apologies for the length of this but I want to explain as fully as possible. I am completely stumped on how to solve this.
The Setup:
I have two dataframes the first has a list of all possible values in the first column there are no duplicate values in this column. Let's call it df_01. Theses are all the common possible values in each list. All additional columns represent independent lists. Each contains a number that represents how many days any given value of all possible values has been on that list. This dataframe (df_01) has a shape of (9277, 32). These dimensions can change but will general stay the same. The following is a small example of what it looks like.
df_01 before any actions:
index values list01 list02 ... list30 list31
0 aaa 5 1 ... NaN 83
1 bbb NaN NaN ... NaN 4
2 ccc 20 NaN ... NaN 32
3 ddd 1 27 ... NaN NaN
. . . . ... . .
. . . . ... . .
. . . . ... . .
9274 qqq NaN 15 ... NaN 6
9275 rrr 238 NaN ... NaN 102
9276 sss 3 2 ... NaN NaN
9277 ttt 12 NaN ... NaN 99
This first dataframe (df_01) will always be the values as they were the previous day.
The second dataframe. Let's call it df_2 will always have less row, and change from day to day in length, but always the same number of columns as (df_01). It currently has a shape of (1351, 32). In this dataframe (df_2) the first column has all the common values from each list as of today, and has no duplicates. The other columns in this dataframe (df_2) have a 1 if the value is on the list today and NaN if it does not. Here's an example.
df_02 before any actions:
index values list01 list02 ... list30 list31
0 aaa 1 1 ... NaN 1
1 bbb NaN 1 ... 1 1
2 ddd 1 1 ... NaN NaN
. . . . ... . .
. . . . ... . .
. . . . ... . .
1349 qqq NaN NaN ... 1 1
1350 rrr 1 NaN ... 1 NaN
1351 sss NaN 1 ... NaN 1
The Question:
What I want to accomplish is as follows.
1) For every value in each column if the value exists in the first dataframe (df_01) and not in the second (df_02) its counter in (df_01) will reset to NaN on a per column basis.
2) Then for every value in each column of the second dataframe (df_02) if the value exists in the same column on both dataframes sum the values.
3) If aaa on list01 of (df_01) = 5 and aaa on list01 of (df_2) = 1 then aaa on list01 of (df_02) will become 6. This will keep a running count.
4) If the value is NaN in both no action is needed.
5) If a value is NaN on (df_01) and 1 on (df_02) it stays 1.
*** The value in df_02 will always be a 1 or an NaN prior to the summation. It is a binary choice of whether or not the value in the values column is in the individual list that day or not.
*** Notice value ccc, and, ttt are in (df_01) as they are possible values but not (df_02) as they were on none of the lists today.
*** The asterisk around values like *NaN*
or *6*
is to denote the values that will change it would not actually be in the data.
The dataframes should look like this after the procedure:
df_01
index values list01 list02 ... list30 list31
0 aaa 5 1 ... NaN 83
1 bbb NaN NaN ... NaN 4
2 ccc NaN NaN ... NaN NaN
3 ddd 1 27 ... NaN NaN
. . . . ... . .
. . . . ... . .
. . . . ... . .
9274 qqq NaN *NaN* ... NaN 6
9275 rrr 238 NaN ... NaN *NaN*
9276 sss *NaN* 2 ... NaN 24
9277 ttt 12 NaN ... NaN 99
df_02
index values list01 list02 ... list30 list31
0 aaa *6* *2* ... NaN *84*
1 bbb NaN 1 ... 1 *5*
2 ddd *2* *28* ... NaN NaN
. . . . ... . .
. . . . ... . .
. . . . ... . .
1349 qqq NaN NaN ... 1 *7*
1350 rrr *239* NaN ... 1 NaN
1351 sss NaN *3* ... NaN *25*
How would I go about accomplishing something like this? I don't even know where to begin. Any ideas, even if not completely working, just to point me in the right direction would be appreciated. Please let me know if anything needs clarification.
Thanks
解决方案
df1 = df1.set_index('values')
df2 = df2.set_index('values')
cols = [*df1.columns]
for col in cols:
#Update to df1
df1[col].update(df2.loc[df2[col].isnull(), col].fillna('-'))
df1[col].replace('-', np.NaN, inplace = True)
#Update to df2, sum if they both have numbers
df2[col].update(df2.loc[~df2[col].isnull(), col] + df1.loc[~df1[col].isnull(), col])
This should do what you want. We will loop over each row then update them individually. Make sure the cols
list contains the correct columns based on your df's.
The reason we have to use .fillna('-')
in the update to df1 is because you can't replace a value with NaN, so we have to fill it with something else, then we can replace it back to NaN.
推荐阅读
- spring-boot - 为什么我在使用 mapstruct 从 DTO 映射到实体时没有得到 id?
- amazon-web-services - 带有 Internet 网关的 VPC 中的 AWS Lambda 函数仍然无法访问 Internet
- javascript - 为什么我会收到“UnhandledPromiseRejectionWarning”?
- python - 绘制基于二进制的列数据与连续数据列的直方图
- arrays - 使用 MongoDB 将布尔值的总和设为整数
- jquery - 如何进行内联数据表编辑
- eclipse - 将自动生成的 LibGDX gradle 项目导入 eclipse 时出现问题
- reactjs - React - 即使状态变量没有变化,useEffect 也会运行
- sql-server - 为什么即使插入失败,SQL Server 也会生成一个新的标识值?
- sharding - yandex如何实现2层分片