python - for循环向量化优化
问题描述
我需要在数据框中添加一列month
。当一行的 和start date
包含某些月份的第一天时,则将这些月份添加到该列中。我只能用最原始的for循环来处理。我的数据大约有 300 行。如果我使用矢量化,我该如何优化它?end date
data
month
import pandas as pd
import numpy as np
data = pd.DataFrame(
{
"d_month": ['202109', '202109', '202109', '202106', '202106', '202106', '202105', '202105', '202105', '202104',
'202104', '202104', '202103', '202103', '202103', ],
"code": ['A202109', 'B202109', 'C202109', 'A202106', 'B202106', 'C202106', 'A202105', 'B202105', 'C202105',
'A202104', 'B202104', 'C202104', 'A202103', 'B202103', 'C202103'],
"start_date": ['20210118', '20210118', '20210118', '20201019', '20201019', '20201019', '20210322', '20210322',
'20210322', '20210222', '20210222', '20210222', '20200720', '20200720', '20200720'],
"end_date": ['20210917', '20210917', '20210917', '20210618', '20210618', '20210618', '20210521', '20210521',
'20210521', '20210416', '20210416', '20210416', '20210319', '20210319', '20210319'], })
data = data.sort_values(by=['d_month', 'code'], ascending=[True, True]).reset_index(drop=True)
result = pd.DataFrame()
s = data['d_month'].sort_values(ascending=True).drop_duplicates()
for i in s.values:
d1 = str(i) + '01'
v1 = data[(data.start_date <= d1) & (data.end_date >= d1)].reset_index(drop=True)
v1['month'] = i
result = pd.concat([result, v1])
result = result.sort_values(by=['month', 'd_month', 'code'], ascending=[True, True, True]).reset_index(drop=True)
result = result[['month', 'd_month', 'code', 'start_date', 'end_date']]
print('Original data:')
print(data.head(10))
print('Expected data:')
print(result.head(10))
输出结果:
Original data:
d_month code start_date end_date
0 202103 A202103 20200720 20210319
1 202103 B202103 20200720 20210319
2 202103 C202103 20200720 20210319
3 202104 A202104 20210222 20210416
4 202104 B202104 20210222 20210416
5 202104 C202104 20210222 20210416
6 202105 A202105 20210322 20210521
7 202105 B202105 20210322 20210521
8 202105 C202105 20210322 20210521
9 202106 A202106 20201019 20210618
Expected data:
month d_month code start_date end_date
0 202103 202103 A202103 20200720 20210319
1 202103 202103 B202103 20200720 20210319
2 202103 202103 C202103 20200720 20210319
3 202103 202104 A202104 20210222 20210416
4 202103 202104 B202104 20210222 20210416
5 202103 202104 C202104 20210222 20210416
6 202103 202106 A202106 20201019 20210618
7 202103 202106 B202106 20201019 20210618
8 202103 202106 C202106 20201019 20210618
9 202103 202109 A202109 20210118 20210917
解决方案
想法是获取所有唯一月份并传递给所有组合merge
的辅助列进行交叉连接,a
然后过滤boolean indexing
,最后排序并在必要时更改列的顺序:
df = data.assign(a=1)
df1 = df[['a','d_month']].drop_duplicates().rename(columns={'d_month':'month'})
df = df.merge(df1, on='a')
df = df[(df.start_date <= df['month']) & (df.end_date >= df['month'])].drop('a', axis=1)
df = df.sort_values(by=['month', 'd_month', 'code'], ignore_index=True)
df = df[df.columns[-1:].tolist() + df.columns[:-1].tolist()]
print (df.head(10))
month d_month code start_date end_date
0 202103 202103 A202103 20200720 20210319
1 202103 202103 B202103 20200720 20210319
2 202103 202103 C202103 20200720 20210319
3 202103 202104 A202104 20210222 20210416
4 202103 202104 B202104 20210222 20210416
5 202103 202104 C202104 20210222 20210416
6 202103 202106 A202106 20201019 20210618
7 202103 202106 B202106 20201019 20210618
8 202103 202106 C202106 20201019 20210618
9 202103 202109 A202109 20210118 20210917
推荐阅读
- xml - xslt-插入多个紧邻的变量
- python - 未经重复授权下载 Youtube 展示次数
- android - Retrofit2 SOAP 请求中缺少 XML 声明
- python - python列主要和行主要矩阵
- python - 当一个矩阵非常宽时,实现矩阵乘法的有效方法?
- amazon-web-services - 没有路径的 S3 预签名 url
- sql - pyspark中同一列的多个AND条件没有连接操作
- microservices - 通知微服务 API 或队列
- json - Scala 中的 Json.parse()、Json.toJson() 和 Json.stringify() 有什么区别?
- dart - 居中 Font Awesome 图标时出现渲染问题