首页 > 解决方案 > 使用 python 绘制多个样本的 SNP 密度

问题描述

已编辑

你好

我想创建一个 python 程序,将其作为输入:FCV filewindow并在每个窗口中为所有样本(列)increment value返回一个带有SNP 密度的图。示例图片如下。

我希望采取的步骤:

  1. 建立一个 X 碱基宽的窗口并计算该窗口中的多态性数量
  2. 记录多态计数和窗口的起始位置
  3. 将窗口向下移动 Y 碱基,计算窗口中的多态性数量。您将计算许多在前一个窗口中计算的相同多态性。
  4. 记录多态计数和窗口的当前起始位置
  5. 继续将窗口沿染色体向下移动 Y 碱基,计算多态性,并记录计数和位置数据,直到窗口到达染色体末端
  6. 对数据框中的所有个人执行此操作
  7. 为每个人创建(计数、位置)数据的折线图或散点图。图表应为每个人显示一条线

我可以使用 R/Bioconductor 包或 Biopython 来完成,但我需要一个基本的 python 解决方案。请提供任何帮助!谢谢

这是我尝试过的:VCFfile

#!/usr/bin/env python
# libraries
import argparse
import io
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

    ## Read VCF file
# Read vcf file without headers
def read_vcf(path):
    with open(path, 'r') as f:
        lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})

df = read_vcf('VCFFile.vcf')

# cleaning data
## format CHROM column
df['CHROM'] = df['CHROM'].str.replace('chr0','').astype(int)

## select useful columns: all columns except not useful ones
df = df[df.columns.difference(['ID', 'INFO', 'REF', 'ALT', 'QUAL', 'FILTER', 'FORMAT'])]

# Get alleles for each sample
def get_alleles(df):
    for i in df.columns.difference(['CHROM', 'POS']):
        suffix=  str(i) + '_genotype'
        df[suffix] = df[str(i)].astype(str).str[0:3]
        #df.drop(str(i), axis=1)
        #df = df[df.columns.drop(str(i))]
# apply the function
get_alleles(df)

# remove original genotype columns
filter_col = [col for col in df if col.endswith('genotype')]
filter_col.append('CHROM')
filter_col.append('POS')

df = df[filter_col]

# replace genotypes: 1/1 by 1, else by 0
list_values = ['0/0', './.', './0', '0/.', '1/0', '0/1']
df = df.replace(to_replace =list_values, value ='NaN')
df = df.replace(to_replace ='1/1', value =1)

现在我想绘制每个样本的 SNP 密度:

# plot SNP density for each sample ==========================================
# get data for each sample
# create a function to select columns
def select_sample(col):
    x = df[['POS', str(col)]]
    #remove NaN
    x = x[x[str(col)] ==1]
    return x

sample_1 = select_sample("A_genotype")
sample_2 = select_sample("B_genotype")
sample_3 = select_sample("C_genotype")
sample_4 = select_sample("D_genotype")
sample_5 = select_sample("E_genotype")
sample_6 = select_sample("F_genotype")
sample_7 = select_sample("I_genotype")
sample_8 = select_sample("P_genotype")

我无法添加 incrementValue 以获得如下如下图图。图 1 - 使用 1,000,000 窗口大小和 100,000 增量的多态性密度图

def plot_windowed_variant_density(pos, window_size, incrementValue=None, title, ax):

    # setup windows 
    bins = np.arange(0, pos.max(), window_size)
    print(bins)
    
    #incrementValue
    #incrementValue = ???????????
    
    # use window midpoints as x coordinate
    x = (bins[1:] + bins[:-1])/2
    
    # compute variant density in each window
    count, _ = np.histogram(sample['POS'], bins=bins)
    y= count
    # plot
    sns.despine(ax=ax, offset=10)
    ax.plot(x, y)
    ax.set_xlabel('Chromosome position (Mb)')
    ax.set_ylabel('Count')
    if title:
        ax.set_title(title)
#====================================================

fig, ax = plt.subplots(figsize=(12, 3))
# Apply the function: 
for i in [sample_1, sample_2, sample_3, sample_4, sample_5, sample_6, sample_7, sample_8]:
    plot_windowed_variant_density(i.POS, 1000000,'test', ax)

标签: pythonmatplotlibplotlyseabornbioinformatics

解决方案


如果将图形的 ax 添加到函数参数,则可以在同一个图形上创建覆盖。

# plot SNP density ==========================================
def plot_windowed_variant_density(pos, window_size, title, ax):

    # setup windows 
    bins = np.arange(0, pos.max(), window_size)

    # use window midpoints as x coordinate
    x = (bins[1:] + bins[:-1])/2
    
    # compute variant density in each window
    count, _ = np.histogram(pos, bins=bins)

    y= count

    # plot
    sns.despine(ax=ax, offset=10)
    ax.plot(x, y)
    ax.set_xlabel('Chromosome position (Mb)')
    ax.set_ylabel('Count')
    if title:
        ax.set_title(title)
#====================================================

fig, ax = plt.subplots(figsize=(12, 3))
# Apply the function: I can use a for loop
for i in [sample_1,sample_2,sample_3]:
    plot_windowed_variant_density(i.POS, 1000000,'test', ax)
    #plot_windowed_variant_density(sample_2.POS, 1000000,'test', ax)

在此处输入图像描述


推荐阅读