首页 > 解决方案 > Is there a faster groupby correlation in Python

问题描述

Hi I am running some python code that calculates the correlation between two columns in my pandas dataframe, while being grouped by date and id. For example, my df looks like this:

date id    z      x   y
1    A     z1     x1  y1
1    A     z2     x2  y2
....
....
1    D     z_n-1  x2  y2
1    D     z_n    x2  y2

Try not to focus on the subscripts, or what the data actually means. Rather focus on the general form. For a given date, I have multiple repeated observations for a given id and I want to calculate the correlation between "x" and "y" for each id on each date. My df has about 2.4 million rows, which is roughly divided up among 200 dates.

My code to get the correlations obviously works (this seems to be a trivial problem if I wait long enough), but it has been running for about 7 hours now and I'd like to know if anybody has written something that is custom that might run faster? Anyway, here is the code

corr_df = df.groupby(['date','id'])['x'].corr(df['y'])

标签: pythonpandasperformancepandas-groupby

解决方案


我有一段类似的代码,我认为这可能会更快:
尝试
corr_series = df.groupby(['date','id'])[['x','y']].corr()['y'][:,'x']

这样,您就不会与外部系列(外部,即使它就df在分组之前)运行相关性,而是计算groupby对象内部的相关性。

希望能帮助到你。


推荐阅读