python - 如果相同的设置模式在使用 Pandas 的数据框中再次重复,如何分配唯一的列名?
问题描述
我正在尝试使用以下逻辑创建一个新列 Group (Cluster )。
逻辑:脚本会检查供应商、文本字段和天数列,如果供应商、文本和天数值 <=2,那么记录会被分组到 1 个集群中
我的代码
data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True)
data['Days']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)
g = data.groupby(['Text', 'Vendor', t], sort=False).ngroup()
data=data.assign(Group=g.add(1).astype(str).radd('Cluster'))
我当前的输出
Invoice Date Text Vendor Days Group
1234567 1/1/2012 Repairs A 0 Cluster1
1234568 2/1/2012 Repairs A 1 Cluster1
1234569 4/1/2012 Repairs A 2 Cluster1
1234570 6/1/2012 Water A 2 Cluster2
1234571 9/1/2012 Water A 3 Cluster2
1234572 9/1/2012 Car A 0 Cluster3
1234573 9/1/2012 Bus A 0 Cluster4
1234574 9/1/2012 Bike A 0 Cluster5
1234575 9/1/2012 Repairs A 0 Cluster6
1234576 10/1/2012 Repairs A 1 Cluster6
1234577 11/1/2012 Repairs A 1 Cluster6
1234578 12/1/2012 Water A 2 Cluster6
1234579 13/1/2012 Water A 1 Cluster2
1234580 14/1/2012 Water A 1 Cluster2
预期产出
Invoice Date Text Vendor Days Group
1234567 1/1/2012 Repairs A 0 Cluster1
1234568 2/1/2012 Repairs A 1 Cluster1
1234569 4/1/2012 Repairs A 2 Cluster1
1234570 6/1/2012 Water A 2 Cluster2
1234571 9/1/2012 Water A 3 Cluster2
1234572 9/1/2012 Car A 0 No Cluster
1234573 9/1/2012 Bus A 0 No Cluster
1234574 9/1/2012 Bike A 0 No Cluster
1234575 9/1/2012 Repairs A 0 Cluster3
1234576 10/1/2012 Repairs A 1 Cluster3
1234577 11/1/2012 Repairs A 1 Cluster3
1234578 12/1/2012 Water A 2 Cluster4
1234579 13/1/2012 Water A 1 Cluster4
1234580 14/1/2012 Water A 1 Cluster4
测试数据
Invoice Date Text Vendor Days Group Expected Group
1000001 1/1/2012 Repair A 0 Cluster1 Cluster1
1000003 2/1/2012 Repair A 1 Cluster1 Cluster1
1000005 4/1/2012 Repair A 2 Cluster1 Cluster1
1000007 6/1/2012 Water A 2 No Cluster No Cluster
1000008 9/2/2012 Repair A 34 Cluster2 No Cluster
1000010 9/2/2012 Garden A 0 Cluster3 Cluster2
1000011 10/2/2012 Garden A 1 Cluster3 Cluster2
1000012 15/2/2012 Car A 5 Cluster4 Cluster3
1000013 16/2/2012 Car A 1 Cluster4 Cluster3
1000015 17/2/2012 Car A 1 Cluster4 Cluster3
1234574 17/2/2012 Bike A 0 No Cluster No Cluster
这怎么能在python中完成?
解决方案
想法是g1
为连续的组创建新的系列Text
和Vendor
列,并只填充由助手重复的Series
g1
,最后添加不匹配的值Series.reindex
:
data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True, dayfirst=True)
data.sort_values(['Vendor','Date'],inplace=True)
data['Date_Difference']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)
g1 = data[['Text', 'Vendor']].ne(data[['Text', 'Vendor']].shift()).any(axis=1).cumsum()
m = g1.duplicated(keep=False)
g = data[m].groupby([g1, t], sort=False).ngroup()
clust = g.add(1).astype(str).radd('Cluster').reindex(data.index, fill_value='No Cluster')
data=data.assign(Group=clust)
print (data)
Invoice Date Text Vendor Days Group Date_Difference
0 1000001 2012-01-01 Repair A 0 Cluster1 0.0
1 1000003 2012-01-02 Repair A 1 Cluster1 1.0
2 1000005 2012-01-04 Repair A 2 Cluster1 2.0
3 1000007 2012-01-06 Water A 2 No Cluster 2.0
4 1000008 2012-02-09 Repair A 34 No Cluster 34.0
5 1000010 2012-02-09 Garden A 0 Cluster2 0.0
6 1000011 2012-02-10 Garden A 1 Cluster2 1.0
7 1000012 2012-02-15 Car A 5 Cluster3 5.0
8 1000013 2012-02-16 Car A 1 Cluster3 1.0
9 1000015 2012-02-17 Car A 1 Cluster3 1.0
10 1234574 2012-02-17 Bike A 0 No Cluster 0.0
推荐阅读
- php - 使用 php 和数据库重命名图像
- sql - PL/SQL如何返回整条记录
- ruby-on-rails - 具有多租户的 Postgresql 9.6 - INHERIT 说子表具有不同数据类型的列,但它没有
- javascript - 如何使用来自单独的 JSS 样式组件的 React Context 值?
- ios - 如何强制安装不兼容的 pod?
- python-3.x - 如何从包含空格的 txt 中捕获元素?
- shell - bash 按单词和 concat 分组
- dataweave - 在 mule 4 中带长度检查的 Leftpad Zeros
- php - php-7 中的“2020 年 1 月第一个星期三”与“2020 年 1 月的第一个星期三”
- docker - 使用 Docker,我如何绑定挂载并保留容器文件?