首页 > 解决方案 > 熊猫 groupby 上的嵌套列表理解问题

问题描述

我有一个大约几百万行的天文数据集,这就是它的外观,

                     oid               mjd               mag               magerr                     ra                    dec
    0        1809105320673280.0  58338.42578125  20.6175079345703125  0.1499880552291870   12.0123176574707031  56.8318214416503906
    1        1809105320673280.0  58365.42968750  20.7830238342285156  0.1610205173492432   12.0121049880981445  56.8318862915039062
    2        1809105320673280.0  58377.37500000  20.7814407348632812  0.1609148979187012   12.0120792388916016  56.8319053649902344
    3        1809105320673280.0  58389.36328125  20.6266822814941406  0.1505994796752930   12.0119419097900391  56.8318405151367188
    4        1809105320673280.0  58430.28906250  20.7284736633300781  0.1573843955993652   12.0120868682861328  56.8317718505859375
    ...                     ...             ...                  ...                 ...                   ...                  ...
    8474460   381208110301184.0  58711.27343750  19.1085929870605469  0.0534130744636059  257.3913269042968750 -10.2478170394897461
    8474461   381208110301184.0  58723.13671875  19.4006576538085938  0.0655696913599968  257.3913879394531250 -10.2481222152709961
    8474462   381208110301184.0  58726.13281250  19.4201564788818359  0.0664852634072304  257.3913574218750000 -10.2475624084472656
    8474463   381208110301184.0  58737.16796875  19.3793220520019531  0.0645836368203163  257.3914184570312500 -10.2481050491333008
    8474464   381208110301184.0  58765.10937500  19.3963356018066406  0.0653686374425888  257.3912658691406250 -10.2478036880493164

我需要将它分成单独的源文件。首先,我根据观察 ID (oid) 对数据进行分组。然后,我在不同的组中使用 min ra & dec 来计算角距离;

PyAstronomy.pyasl.getAngDist(ra1,dec1,ra2,dec2))

ra1 和 dec1 属于一个组,而 ra2 和 dec2 属于另一个组。如果角距离小于某个值,代码会将它们写入同一个文件。

代码是;

#!/usr/bin/env python3
import numpy as np
import pandas as pd
import glob
from PyAstronomy import pyasl


def data():
    cols = ['oid', 'mjd', 'mag', 'magerr', 'ra', 'dec']
    threshold = 1.5 / 3600
    df = pd.read_hdf('ztf_dr3.txt',dtype={'8':np.float32, '9':np.float32})
    pd.set_option('display.precision', 16)
    # data = data.apply(pd.to_numeric, errors='coerce')
    df.columns = cols
    edf = pd.DataFrame(columns=cols)
    grouped = df.groupby(['oid'])
    for name, i in grouped:
        edf = edf.append(i, ignore_index=True)
        for name, j in grouped:
            ang_dist = pyasl.getAngDist(i['ra'].min(), i['dec'].min(), j['ra'].min(), j['dec'].min())
            if (ang_dist <= threshold):
                edf = edf.append(j, ignore_index=True)
        edf.to_csv('result/' + str(i['oid'].min()) + '.txt', columns=cols, header=True, index=False)
        edf = pd.DataFrame(columns=cols)

它可以正常工作,但是速度很慢。我试着把它写成综合形式,

def data():
    pd.set_option('display.precision', 16)
    grouped = pd.read_hdf('ztf_dr3.txt',dtype={'8':np.float32, '9':np.float32}).groupby(['0'])
    edf = pd.DataFrame([[i, j]
                        for name, i in grouped for name, j in grouped
                        if
                        (pyasl.getAngDist(i['8'].min(), i['9'].min(), j['8'].min(), j['9'].min()) <= 1.5 / 3600.0)
                        ])
    return edf.to_csv('result/{}.txt'.format("edf['0'].min()"))

问题是嵌套的综合列表使用了大量的内存(x3)。

对于如何以嵌套的综合形式编写此代码的任何想法,我将不胜感激。

示例数据文件

https://drive.google.com/file/d/1naJFXLJOjsQ2nVnGFX5WWH-hBdxikGq3/view?usp=sharing

编辑:它不必是嵌套的综合列表形式,我只是想让它更快。

标签: pandas-groupbyastronomynested-for-loop

解决方案


推荐阅读