首页 > 解决方案 > How can I utilize vectorization on my Pandas script for efficiency?

问题描述

this is a continuation from my previous post, where I wanted a faster and more efficient alternative to a standard Python loop, which performs some summing and multiplication on elements of each row.

Basically, what I have are two file inputs. One is a list of all combinations for a group of SNPs, for example below for 3 SNPs:

    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

And the second is a table, containing some information for each SNP, notably their log(OR) for a disease and the frequency of the risk allele:

SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136  

Below is my main code, in which I am looking to calculate a 'score' and a 'frequency' for each 'profile. The score is the sum of log(ORs) for each risk allele present in the profile, while the frequency is the frequencies multiplied together, assuming Hardy Weinberg equilibrium:

import pandas as pd

numbers = pd.read_csv(table2, sep="\t", header=None)

combinations = pd.read_csv(table1, sep=" ", header=None)

def score_freq(line):
    score=0
    freq=1
    for j in range(len(line)):
        if line[j][1] != numbers.values[j][1]:   # homozygous for ref
            score+=0
            freq*=(float(1-float(numbers.values[j][6]))*float(1-float(numbers.values[j][6])))
        elif line[j][0] != numbers.values[j][1] and line[j][1] == numbers.values[j][1]: # heterozygous
            score+=(float(numbers.values[j][5]))
            freq*=(2*(float(1-float(numbers.values[j][6]))*float(numbers.values[j][6])))
        elif line[j][0] == numbers.values[j][1]:   # homozygous for risk
            score+=2*(float(numbers.values[j][5]))
            freq*=(float(numbers.values[j][6])*float(numbers.values[j][6]))

        if freq < 1e-05:   # threshold to stop loop in interest of efficiency 
            break

    return pd.Series([score, freq])

combinations[['score', 'freq']] = combinations.apply(lambda row: score_freq(row), axis=1)
#combinations[['score', 'freq']] = score_freq(combinations.values) # vectorization?

print(combinations)

I was referring to this site, where they go over the fastest way to loop over a Pandas dataframe. I have been able to use the Pandas apply method, but I am not sure how to perform the vectorization method over the Pandas series. Other than that, do suggest any way in which I can improve my script to make it more efficient, thanks!

标签: pythonpandasperformancevectorization

解决方案


我建议使用 NumPy Python 库来提高您的 pd 脚本的效率。NumPy 背后的想法是,您可以使用矢量化来避免 FOR 循环,从而非常有效地处理数据负载。使用 Numpy 时,您基本上是将数据转换为 Numpy 数组。您可以在此处找到大量文档。要回答您的问题,您可以对 numpy 数组执行数学运算,如下所示:

a = np.array([1, 2, 3, 4])
a + 1                // to add 1 to every element in the array

a * 2                // to multiply each element in the array by 2

这比在纯 python 中使用 FOR 循环要高效得多。

希望这可以帮助。


推荐阅读