python - 使用 python 绘制多个样本的 SNP 密度
问题描述
已编辑
你好
我想创建一个 python 程序,将其作为输入:FCV file
和window
并在每个窗口中为所有样本(列)increment value
返回一个带有SNP 密度的图。示例图片如下。
我希望采取的步骤:
- 建立一个 X 碱基宽的窗口并计算该窗口中的多态性数量
- 记录多态计数和窗口的起始位置
- 将窗口向下移动 Y 碱基,计算窗口中的多态性数量。您将计算许多在前一个窗口中计算的相同多态性。
- 记录多态计数和窗口的当前起始位置
- 继续将窗口沿染色体向下移动 Y 碱基,计算多态性,并记录计数和位置数据,直到窗口到达染色体末端
- 对数据框中的所有个人执行此操作
- 为每个人创建(计数、位置)数据的折线图或散点图。图表应为每个人显示一条线
我可以使用 R/Bioconductor 包或 Biopython 来完成,但我需要一个基本的 python 解决方案。请提供任何帮助!谢谢
这是我尝试过的:VCFfile
#!/usr/bin/env python
# libraries
import argparse
import io
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
## Read VCF file
# Read vcf file without headers
def read_vcf(path):
with open(path, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
return pd.read_csv(
io.StringIO(''.join(lines)),
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
'QUAL': str, 'FILTER': str, 'INFO': str},
sep='\t'
).rename(columns={'#CHROM': 'CHROM'})
df = read_vcf('VCFFile.vcf')
# cleaning data
## format CHROM column
df['CHROM'] = df['CHROM'].str.replace('chr0','').astype(int)
## select useful columns: all columns except not useful ones
df = df[df.columns.difference(['ID', 'INFO', 'REF', 'ALT', 'QUAL', 'FILTER', 'FORMAT'])]
# Get alleles for each sample
def get_alleles(df):
for i in df.columns.difference(['CHROM', 'POS']):
suffix= str(i) + '_genotype'
df[suffix] = df[str(i)].astype(str).str[0:3]
#df.drop(str(i), axis=1)
#df = df[df.columns.drop(str(i))]
# apply the function
get_alleles(df)
# remove original genotype columns
filter_col = [col for col in df if col.endswith('genotype')]
filter_col.append('CHROM')
filter_col.append('POS')
df = df[filter_col]
# replace genotypes: 1/1 by 1, else by 0
list_values = ['0/0', './.', './0', '0/.', '1/0', '0/1']
df = df.replace(to_replace =list_values, value ='NaN')
df = df.replace(to_replace ='1/1', value =1)
现在我想绘制每个样本的 SNP 密度:
# plot SNP density for each sample ==========================================
# get data for each sample
# create a function to select columns
def select_sample(col):
x = df[['POS', str(col)]]
#remove NaN
x = x[x[str(col)] ==1]
return x
sample_1 = select_sample("A_genotype")
sample_2 = select_sample("B_genotype")
sample_3 = select_sample("C_genotype")
sample_4 = select_sample("D_genotype")
sample_5 = select_sample("E_genotype")
sample_6 = select_sample("F_genotype")
sample_7 = select_sample("I_genotype")
sample_8 = select_sample("P_genotype")
我无法添加 incrementValue 以获得如下图。图 1 - 使用 1,000,000 窗口大小和 100,000 增量的多态性密度图
def plot_windowed_variant_density(pos, window_size, incrementValue=None, title, ax):
# setup windows
bins = np.arange(0, pos.max(), window_size)
print(bins)
#incrementValue
#incrementValue = ???????????
# use window midpoints as x coordinate
x = (bins[1:] + bins[:-1])/2
# compute variant density in each window
count, _ = np.histogram(sample['POS'], bins=bins)
y= count
# plot
sns.despine(ax=ax, offset=10)
ax.plot(x, y)
ax.set_xlabel('Chromosome position (Mb)')
ax.set_ylabel('Count')
if title:
ax.set_title(title)
#====================================================
fig, ax = plt.subplots(figsize=(12, 3))
# Apply the function:
for i in [sample_1, sample_2, sample_3, sample_4, sample_5, sample_6, sample_7, sample_8]:
plot_windowed_variant_density(i.POS, 1000000,'test', ax)
解决方案
如果将图形的 ax 添加到函数参数,则可以在同一个图形上创建覆盖。
# plot SNP density ==========================================
def plot_windowed_variant_density(pos, window_size, title, ax):
# setup windows
bins = np.arange(0, pos.max(), window_size)
# use window midpoints as x coordinate
x = (bins[1:] + bins[:-1])/2
# compute variant density in each window
count, _ = np.histogram(pos, bins=bins)
y= count
# plot
sns.despine(ax=ax, offset=10)
ax.plot(x, y)
ax.set_xlabel('Chromosome position (Mb)')
ax.set_ylabel('Count')
if title:
ax.set_title(title)
#====================================================
fig, ax = plt.subplots(figsize=(12, 3))
# Apply the function: I can use a for loop
for i in [sample_1,sample_2,sample_3]:
plot_windowed_variant_density(i.POS, 1000000,'test', ax)
#plot_windowed_variant_density(sample_2.POS, 1000000,'test', ax)
推荐阅读
- javascript - babel 目标到旧版本浏览器,它在“react/jsx-runtime”中找不到导出“jsx”(导入为“_jsx”)(可能的导出:__esModule)
- apache-iotdb - 如何确定子路径是 Apache IoTDB 中的节点还是时间序列
- flutter - 谁能帮我找到这个突出主题
- arrays - PowerShell - 使用已排序的对象数组打印 JSON 输出?
- synchronize - 如何使用 watermelondb 为同步功能传递数据
- visual-studio-code - 如何在可视代码中消除此预定义宏
- microservices - 没有为网关和非反应式微服务生成 JHipster 实体过滤代码
- python - 如何修复 OSError: [WinError 2] 系统找不到指定的文件
- android - 从可组合函数中的另一个类调用方法
- qt - 从 CMake 生成 qt 帮助文件