c++ - How to extract genotype information for each sample as a string from a VCF file using htslib?
问题描述
I am using htslib for extracting all the information contained in a VCF file in C++.
Currently, thanks to the VCF specification and the documentation in the file vcf.h, I have successfully extracted all the metadata information in the header (Meta-Information Lines), and most of the information contained in each row of the body of the file (Data Lines).
However, I don't know how to extract the genotype information (sample columns).
I am using example files from the 1000G project. This is an example of two rows of the file, it shows the Format field and two samples (The file has more than 1000 samples per each row, I would like to extract the data for all of them):
FORMAT HG00096 HG00097
GT:DS:GL 0|0:0.050:-0.48,-0.48,-0.48 0|0:0.050:-0.24,-0.40,-1.49
GT:DS:GL 0|0:0.000:-0.10,-0.69,-4.70 0|0:0.000:-0.05,-0.94,-5.00
I know that this is a heavy task that would take some computation time. I have extracted the names of each column (HG00096, HG00077...), but I don't know how to extract the information of each sample either as a full string (e.g., "0|0:0.050:-0.48,-0.48,-0.48"), as a set (array, map, vector...) of key-value pairs (e.g., [("GT", "0|0"), ("DS", "0.050"), ("GL", "-0.48,-0.48,-0.48")), or simply as an array of values (e.g., ["0|0", "0.050", "-0.48,-0.48,-0.48"]. I would like to do this for each sample.
I have been reading the documentation in the vcf.h file and I think that the function bcf_get_genotypes(hdr,line,dst,ndst) may be suitable for this, but I don't know for sure how to use it for extracting the values as strings. Also, I think that this information may be stored inside the 'p' pointer of 'bcf_fmt_t', but I don't know for sure, it just contains an array set of uint8_t values and I don't know if a string (or char array) can be extracted in the way I want.
typedef struct bcf_fmt_t {
int id;
int n, size, type;
uint8_t *p;
uint32_t p_len;
uint32_t p_off:31, p_free:1;
} bcf_fmt_t;
Is there a way of doing that I am trying to do?
解决方案
I finally figured it out. There are some functions for doing this, depending on the type specified in the header for the format ID: the functions are inside of the vcf.h file in htslib:
#define bcf_get_format_int32(hdr,line,tag,dst,ndst)
bcf_get_format_values(hdr,line,tag,(void**)(dst),ndst,BCF_HT_INT)
#define bcf_get_format_float(hdr,line,tag,dst,ndst)
bcf_get_format_values(hdr,line,tag,(void**)(dst),ndst,BCF_HT_REAL)
#define bcf_get_format_char(hdr,line,tag,dst,ndst)
bcf_get_format_values(hdr,line,tag,(void**)(dst),ndst,BCF_HT_STR)
#define bcf_get_genotypes(hdr,line,dst,ndst)
bcf_get_format_values(hdr,line,"GT",(void**)(dst),ndst,BCF_HT_INT)
HTSLIB_EXPORT
int bcf_get_format_string(const bcf_hdr_t *hdr, bcf1_t *line, const char *tag, char ***dst, int *ndst);
HTSLIB_EXPORT
int bcf_get_format_values(const bcf_hdr_t *hdr, bcf1_t *line, const char *tag, void **dst, int *ndst, int type);
推荐阅读
- html - 如何使用 CSS Grid 布局获取带有粘性页眉和页脚的内容的滚动条?
- sass - 如何在 windows10 中使用“vscode live sass 编译器扩展”设置编译后的 css 文件的输出路径?
- django - 更新配置文件 Django 中所有用户的属性
- validation - 在单元格中使用动态范围和 Vlookup 公式进行数据验证以创建半自动列表
- javascript - jQuery / JS - 创建动态变量名
- javascript - 我想调用 data.json 文件中的对象
- ios - 错误线程 1:致命错误:在展开可选值时意外发现 nil?
- vba - VBA 宏:用特定字符串捕获一行并返回它
- batch-file - 批处理文件重新启动后台处理程序并自动确认单独的服务
- javascript - Node.js 和 ES6 类错误