python - How can I utilize vectorization on my Pandas script for efficiency?
问题描述
this is a continuation from my previous post, where I wanted a faster and more efficient alternative to a standard Python loop, which performs some summing and multiplication on elements of each row.
Basically, what I have are two file inputs. One is a list of all combinations for a group of SNPs, for example below for 3 SNPs:
AA CC TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA
And the second is a table, containing some information for each SNP, notably their log(OR) for a disease and the frequency of the risk allele:
SNP1 A T 1.25 0.223143551314 0.97273
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136
Below is my main code, in which I am looking to calculate a 'score' and a 'frequency' for each 'profile. The score is the sum of log(ORs) for each risk allele present in the profile, while the frequency is the frequencies multiplied together, assuming Hardy Weinberg equilibrium:
import pandas as pd
numbers = pd.read_csv(table2, sep="\t", header=None)
combinations = pd.read_csv(table1, sep=" ", header=None)
def score_freq(line):
score=0
freq=1
for j in range(len(line)):
if line[j][1] != numbers.values[j][1]: # homozygous for ref
score+=0
freq*=(float(1-float(numbers.values[j][6]))*float(1-float(numbers.values[j][6])))
elif line[j][0] != numbers.values[j][1] and line[j][1] == numbers.values[j][1]: # heterozygous
score+=(float(numbers.values[j][5]))
freq*=(2*(float(1-float(numbers.values[j][6]))*float(numbers.values[j][6])))
elif line[j][0] == numbers.values[j][1]: # homozygous for risk
score+=2*(float(numbers.values[j][5]))
freq*=(float(numbers.values[j][6])*float(numbers.values[j][6]))
if freq < 1e-05: # threshold to stop loop in interest of efficiency
break
return pd.Series([score, freq])
combinations[['score', 'freq']] = combinations.apply(lambda row: score_freq(row), axis=1)
#combinations[['score', 'freq']] = score_freq(combinations.values) # vectorization?
print(combinations)
I was referring to this site, where they go over the fastest way to loop over a Pandas dataframe. I have been able to use the Pandas apply method, but I am not sure how to perform the vectorization method over the Pandas series. Other than that, do suggest any way in which I can improve my script to make it more efficient, thanks!
解决方案
我建议使用 NumPy Python 库来提高您的 pd 脚本的效率。NumPy 背后的想法是,您可以使用矢量化来避免 FOR 循环,从而非常有效地处理数据负载。使用 Numpy 时,您基本上是将数据转换为 Numpy 数组。您可以在此处找到大量文档。要回答您的问题,您可以对 numpy 数组执行数学运算,如下所示:
a = np.array([1, 2, 3, 4])
a + 1 // to add 1 to every element in the array
a * 2 // to multiply each element in the array by 2
这比在纯 python 中使用 FOR 循环要高效得多。
希望这可以帮助。
推荐阅读
- ios - 避免对 Core Data(或类似的持久性框架)的全局依赖
- django - 我无法运行测试
- java - 无法从 api (Volley) 获取任何响应
- powerbi - 在 Power BI 中使用 DAX 的矩阵问题中的部件百分比
- math - 计算文本空间图像的长度
- plotly-dash - 使用 dcc.Store 时我无法弄清楚如何清除本地存储
- python-3.x - 张量流(使用Keras)中“InvalidArgumentError:不兼容的形状:[10,2] vs. [10]”的原因是什么?
- java - 如何在这里修复 MalformedURLException?
- angular - 用 JSON 键值替换 HTML 中的字符串
- sql-server - 连接到(主)数据库的 SQL Server 远程连接