python for 循环向量化优化

我需要对dataframe增加一列month,当某行数据的开始日期和结束日期包含某些月份的第一天,那就把这些月份添加到month列中,我只能用最原始的for循环来处理,我的数据大概有300行,如果用向量化处理的话,我该如何优化?

import pandas as pd
import numpy as np

data = pd.DataFrame(
    {
        "d_month": ['202109', '202109', '202109', '202106', '202106', '202106', '202105', '202105', '202105', '202104',
                    '202104', '202104', '202103', '202103', '202103', '202102', '202102', '202102', '202101', '202101',
                    '202101'],
        "ts_code": ['A202109', 'B202109', 'C202109', 'A202106', 'B202106', 'C202106', 'A202105', 'B202105', 'C202105',
                    'A202104', 'B202104', 'C202104', 'A202103', 'B202103', 'C202103', 'A202102', 'B202102', 'C202102',
                    'A202101', 'B202101', 'C202101'],
        "first_date": ['20210118', '20210118', '20210118', '20201019', '20201019', '20201019', '20210322', '20210322',
                       '20210322', '20210222', '20210222', '20210222', '20200720', '20200720', '20200720', '20201221',
                       '20201221', '20201221', '20201123', '20201123', '20201123'],
        "end_date": ['20210917', '20210917', '20210917', '20210618', '20210618', '20210618', '20210521', '20210521',
                     '20210521', '20210416', '20210416', '20210416', '20210319', '20210319', '20210319', '20210219',
                     '20210219', '20210219', '20210115', '20210115', '20210115'],

    })
data = data.sort_values(by=['d_month', 'ts_code'], ascending=[True, True]).reset_index(drop=True)
result = pd.DataFrame()
s = data['d_month'].sort_values(ascending=True).drop_duplicates()
for i in s.values:
    d1 = str(i) + '01'
    v1 = data[(data.first_date <= d1) & (data.end_date >= d1)].reset_index(drop=True)
    v1['month'] = i
    result = pd.concat([result, v1])


result = result.sort_values(by=['month', 'd_month', 'ts_code'], ascending=[True, True, True]).reset_index(drop=True)
result = result[['month', 'd_month', 'first_date', 'end_date']]
print('原始数据:')
print(data.head(10))
print('希望得到的数据:')
print(result.head(10))

输出:

原始数据:
  d_month  ts_code first_date  end_date
0  202101  A202101   20201123  20210115
1  202101  B202101   20201123  20210115
2  202101  C202101   20201123  20210115
3  202102  A202102   20201221  20210219
4  202102  B202102   20201221  20210219
5  202102  C202102   20201221  20210219
6  202103  A202103   20200720  20210319
7  202103  B202103   20200720  20210319
8  202103  C202103   20200720  20210319
9  202104  A202104   20210222  20210416
希望得到的数据:
    month d_month first_date  end_date
0  202101  202101   20201123  20210115
1  202101  202101   20201123  20210115
2  202101  202101   20201123  20210115
3  202101  202102   20201221  20210219
4  202101  202102   20201221  20210219
5  202101  202102   20201221  20210219
6  202101  202103   20200720  20210319
7  202101  202103   20200720  20210319
8  202101  202103   20200720  20210319
9  202101  202106   20201019  20210618

Process finished with exit code 0

 

大牛的回答:

Idea is get all unique months and pass to merge for cross join with helper column a for all combinations and then filter in boolean indexing, last sorting and change order of columns if necessary:

df = data.assign(a=1)
df1 = df[['a','d_month']].drop_duplicates().rename(columns={'d_month':'month'})
df = df.merge(df1, on='a')
df = df[(df.start_date <= df['month']) & (df.end_date >= df['month'])].drop('a', axis=1)

df = df.sort_values(by=['month', 'd_month', 'code'], ignore_index=True)
df = df[df.columns[-1:].tolist() + df.columns[:-1].tolist()]

print (df.head(10))
    month d_month     code start_date  end_date
0  202103  202103  A202103   20200720  20210319
1  202103  202103  B202103   20200720  20210319
2  202103  202103  C202103   20200720  20210319
3  202103  202104  A202104   20210222  20210416
4  202103  202104  B202104   20210222  20210416
5  202103  202104  C202104   20210222  20210416
6  202103  202106  A202106   20201019  20210618
7  202103  202106  B202106   20201019  20210618
8  202103  202106  C202106   20201019  20210618
9  202103  202109  A202109   20210118  20210917