首页 > 解决方案 > Pandas:当其他列的值发生变化时,如何从列中保留其他新数据框中的重复行?

问题描述

我有一个看起来像这样的 Pandas DataFrame,df:

text label
a    country
a    sport
b    cooking
b    cooking
c    travel
c    design
d    tech

我想要两个数据框。当“标签”列上的值发生变化时,“文本”列中的重复行。而另一个保留其他一切。

预期输出,df1:

text label
a    country
a    sport
c    travel
c    design

和df2:

text label
b    cooking
b    cooking
d    tech

标签: pythonpandas

解决方案


用于DataFrame.duplicated测试掩码的一列或多列:

m1 = df.duplicated('text', keep=False)
m2 = df.duplicated(['text','label'], keep=False)
#if all columns
#m2 = df.duplicated(keep=False)
mask = m2 | ~m1

df1 = df[~mask]
df2 = df[mask]

print (df1)
  text    label
0    a  country
1    a    sport
4    c   travel
5    c   design

print (df2)
  text    label
2    b  cooking
3    b  cooking
6    d     tech

另一种方法是检查每组唯一值的数量 - 如果相等1或不相等:

mask = df.groupby('text')['label'].transform('nunique').eq(1)
df1 = df[~mask]
df2 = df[mask]

如果更改数据输出不同:

print (df)
  text    label
0    a  country
1    a    sport
2    a    sport
3    b  cooking
4    b  cooking
5    c   travel
6    c   design
7    d     tech
    

m1 = df.duplicated('text', keep=False)
m2 = df.duplicated(['text','label'], keep=False)
#if all columns
#m2 = df.duplicated(keep=False)
mask = m2 | ~m1

df1 = df[~mask]
df2 = df[mask]
print (df1)
  text    label
0    a  country
5    c   travel
6    c   design

print (df2)
  text    label
1    a    sport
2    a    sport
3    b  cooking
4    b  cooking
7    d     tech

mask = df.groupby('text')['label'].transform('nunique').eq(1)
df1 = df[~mask]
df2 = df[mask]
print (df1)
  text    label
0    a  country
1    a    sport
2    a    sport
5    c   travel
6    c   design

print (df2)
  text    label
3    b  cooking
4    b  cooking
7    d     tech

推荐阅读