首页 > 解决方案 > 解析vcfs格式的txt文件

问题描述

我想将 txt 文件中的信息提取到数据框中,数据中包含以下字段

1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN 

txt文件在这里

我编写了以下代码试图从文件中获取信息,但不知道如何继续。你能帮我指导一些想法吗?

import io
import os
import pandas as pd


def read_vcf(path):
    with open('clinvar_final.txt', 'r') as f:
        lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})

标签: pythonpandasvcf-variant-call-format

解决方案


你可以阅读它

df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')

之后,您将拥有带有列 2) ID3) POS4)的表格ALT

print(df[['ID', 'POS', 'ALT']].head())

       ID      POS ALT
0  475283  1014O42   A
1  542074  1O14122   T
2  183381  1014143   T
3  542075  1014179   T
4  475278  1014217   T

其他信息 (1) GENEINFO5) CLNSIG6) )作为一个字符串CLNDN在列中,您可以使用INFOregex

df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')

print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())

结果

0    ISG15:9636
1    ISG15:9636
2    ISG15:9636
3    ISG15:9636
4    ISG15:9636
Name: GENEINFO, dtype: object

0                    Benign
1    Uncertain_significance
2                Pathogenic
3    Uncertain_significance
4                    Benign
Name: CLNSIG, dtype: object

0    Immunodeficiency_38_with_basal_ganglia_calcifi...
1    Immunodeficiency_38_with_basal_ganglia_calcifi...
2    Immunodeficiency_38_with_basal_ganglia_calcifi...
3    Immunodeficiency_38_with_basal_ganglia_calcifi...
4    Immunodeficiency_38_with_basal_ganglia_calcifi...
Name: CLNDN, dtype: object

import pandas as pd

df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')

print(df.columns)

print(df[['ID', 'POS', 'ALT']].head())

df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')

print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())

推荐阅读