首页 > 解决方案 > How to label same pandas dataframe rows?

问题描述

I have a large pandas dataframe like this:

log  apple   watermelon  orange  lemon  grapes

1      1         1         yes     0      0
1      2         0         1       0      0
1     True       0         0       0      2
2      0         0         0       0      2
2      1         1         yes     0      0
2      0         0         0       0      2
2      0         0         0       0      2
3     True       0         0       0      2
4      0         0         0       0      2.1
4      0         0         0       0      2.1

How can I label the rows that are the same, for example:

log   apple   watermelon  orange  lemon  grapes   ID

1      1         1         yes     0      0      1
1      2         0         1       0      0      2
1     True       0         0       0      2      3
2      0         0         0       0      2      4
2      1         1         yes     0      0      1
2      0         0         0       0      2      4
2      0         0         0       0      2      4
3     True       0         0       0      2      3
4      0         0         0       0      2.1    5
4      0         0         0       0      2.1    5

I tried to:

df['ID']=df.groupby('log')[df.columns].transform('ID')

And

df['personid'] = df['log'].clip_upper(2) - 2*d.duplicated(subset='apple')
df

However, the above doesnt work because I literally have a lot of columns.

But its not giving me the expected output. Any idea of how to group and label this dataframe?

标签: pythonpython-3.xpandas

解决方案


给定

x = io.StringIO("""log  apple   watermelon  orange  lemon  grapes

1      1         1         yes     0      0
1      2         0         1       0      0
1     True       0         0       0      2
2      0         0         0       0      2
2      1         1         yes     0      0
2      0         0         0       0      2
2      0         0         0       0      2
3     True       0         0       0      2
4      0         0         0       0      2.1
4      0         0         0       0      2.1""")
df2 = pd.read_table(x, delim_whitespace=True)

您可以首先使用transformwith tuple 使每一行可散列和可比较,然后使用索引并range创建唯一 id

f = df2.transform(tuple,1).to_frame()
k = f.groupby(0).sum()
k['id'] = range(1,len(k.index)+1)

最后

df2['temp_key'] = f[0]
df2 = df2.set_index('temp_key')
df2['id'] = k.id
df2.reset_index().drop('temp_key', 1)

    log     apple   watermelon  orange  lemon   grapes  id
0   1       1       1           yes     0       0.0     1
1   1       2       0           1       0       0.0     2
2   1       True    0           0       0       2.0     3
3   2       0       0           0       0       2.0     4
4   2       1       1           yes     0       0.0     5
5   2       0       0           0       0       2.0     4
6   2       0       0           0       0       2.0     4
7   3       True    0           0       0       2.0     6
8   4       0       0           0       0       2.1     7
9   4       0       0           0       0       2.1     7

推荐阅读