python - 在python中使用步长计算滑动窗口
问题描述
我使用熊猫有这些数据:
SNP = pd.read_csv("C:/Users/sia/Desktop/SNP.txt",delimiter=r"\s+",header=0)
ID Chr Position p
M1 1 4762 0.40
M2 1 77143 0.62
M3 1 130756 0.22
M4 1 227358 0.50
M5 1 265131 0.60
M6 1 568128 0.64
M7 2 2000 0.32
M8 2 18000 0.36
M9 2 60300 0.64
M10 2 71118 0.50
M11 2 71595 0.28
M12 2 200000 0.10
在python中,如何根据新数据帧中每个Chr的位置列中的滑动窗口(100000)和步长(50000)获得p值的总和,如下所示:
Chr start end sum.p.slide
1 0 100000 1.02
1 50000 150000 0.84
1 100000 200000 0.22
1 150000 250000 0.50
1 200000 300000 1.10
1 250000 350000 0.60
1 300000 400000 Na
1 350000 450000 Na
1 400000 500000 Na
1 450000 550000 Na
1 500000 600000 0.64
2 0 100000 2.1
2 50000 150000 Na
2 100000 200000 0.1
解决方案
我敢肯定有更好的方法来做到这一点,但你去吧。
df['range1'] = pd.cut(df.Position, [x for x in range(0, df.Position.max()+100000,100000)])
df['range2'] = pd.cut(df.Position, [x for x in range(50000, df.Position.max()+50000,100000)])
a = df[['range1','Chr','p']].groupby(['Chr','range1']).agg({'p':sum})
b = df[['range2','Chr','p']].groupby(['Chr','range2']).agg({'p':sum})
out = pd.concat([a,b], axis=1).fillna(np.nan).sum(axis=1).replace(0.0, np.nan).reset_index()
out['start'] = out.level_1.apply(lambda x:x.left)
out['end'] = out.level_1.apply(lambda x:x.right)
out.drop(columns=['level_1'], inplace=True)
out.columns = ['Chr','sum.p.silde','start','end']
out[['Chr','start','end','sum.p.silde']]
输出
Chr start end sum.p.silde
0 1 0 100000 1.02
1 1 50000 150000 0.84
2 1 100000 200000 0.22
3 1 150000 250000 0.50
4 1 200000 300000 1.10
5 1 250000 350000 0.60
6 1 300000 400000 NaN
7 1 350000 450000 NaN
8 1 400000 500000 NaN
9 1 450000 550000 NaN
10 1 500000 600000 0.64
11 2 0 100000 2.10
12 2 50000 150000 1.42
13 2 100000 200000 0.10
14 2 150000 250000 0.10
15 2 200000 300000 NaN
16 2 250000 350000 NaN
17 2 300000 400000 NaN
18 2 350000 450000 NaN
19 2 400000 500000 NaN
20 2 450000 550000 NaN
21 2 500000 600000 NaN