python - Is there a faster groupby correlation in Python
问题描述
Hi I am running some python code that calculates the correlation between two columns in my pandas dataframe, while being grouped by date and id. For example, my df looks like this:
date id z x y
1 A z1 x1 y1
1 A z2 x2 y2
....
....
1 D z_n-1 x2 y2
1 D z_n x2 y2
Try not to focus on the subscripts, or what the data actually means. Rather focus on the general form. For a given date, I have multiple repeated observations for a given id and I want to calculate the correlation between "x" and "y" for each id on each date. My df has about 2.4 million rows, which is roughly divided up among 200 dates.
My code to get the correlations obviously works (this seems to be a trivial problem if I wait long enough), but it has been running for about 7 hours now and I'd like to know if anybody has written something that is custom that might run faster? Anyway, here is the code
corr_df = df.groupby(['date','id'])['x'].corr(df['y'])
解决方案
我有一段类似的代码,我认为这可能会更快:
尝试
corr_series = df.groupby(['date','id'])[['x','y']].corr()['y'][:,'x']
这样,您就不会与外部系列(外部,即使它就df
在分组之前)运行相关性,而是计算groupby
对象内部的相关性。
希望能帮助到你。
推荐阅读
- sed - Using an append pattern sed on AIX
- c++ - 无法通过 cin.ignore() 和 cin.clear() 清除输入流
- powerpoint - 有什么方法可以从 Power Point 网络插件向幻灯片添加标签?
- gojs - gojs 将组模板添加到 .model.nodeDataArray
- google-sheets-api - 我是否必须使用浏览器登录才能从节点实例使用 Google Sheets API?
- php - 刷新页面时,UTF-8 会随机更改
- c++ - 将字符串传递给构造函数时出现冲突的声明错误
- javascript - castShadow 和 recieveShadow 不在场景中渲染
- mysql - mysql排序记录基于不同的where
- sql - 如何将间隔时间添加到 postgres 中的时间戳,不包括周末时间