r - Filter read_lines output?
问题描述
Given a read_lines
output:
c("# This data file generated by ffffff at: Wed Jan 13 11:57:32 2011",
"#", "# This file contains raw genotype data, including data that is not used in ffffff reports.",
"# This data has undergone a general quality review however only a subset of markers have been ",
"# individually validated for accuracy. As such, this data is suitable only for research, ",
"# educational, and informational use and not for medical or other use.",
"# ", "# Below is a text version of your data. Fields are TAB-separated",
"# Each line corresponds to a single SNP. For each SNP, we provide its identifier ",
"# (an rsid or an internal id), its location on the reference human genome, and the ",
"# genotype call oriented with respect to the plus strand on the human reference sequence.",
"# We are using reference human assembly build 37 (also known as Annotation Release 104).",
"# Note that it is possible that data downloaded at different times may be different due to ongoing ",
"# improvements in our ability to call genotypes. More information about these changes can be found at:",
"# fffffffff",
"# ", "# More information on reference human assembly builds:",
"# ffffffffffffffff",
"#", "# rsid\tchromosome\tposition\tgenotype", "rs548049170\t1\t69869\tTT",
"rs13328684\t1\t74792\t--", "rs9283150\t1\t565508\tAA", "i713426\t1\t726912\t--",
"rs116587930\t1\t727841\tGG", "rs3131972\t1\t752721\tAG", "rs12184325\t1\t754105\tCC",
"rs12567639\t1\t756268\tAA", "rs114525117\t1\t759036\tGG", "rs12124819\t1\t776546\tAA",
"rs12127425\t1\t794332\tGG", "rs79373928\t1\t801536\tTT", "rs72888853\t1\t815421\t--",
"rs7538305\t1\t824398\tAC", "rs28444699\t1\t830181\tAA", "i713449\t1\t830731\t--",
"rs116452738\t1\t834830\tGG", "rs72631887\t1\t835092\tTT", "rs28678693\t1\t838665\tTT",
"rs4970382\t1\t840753\tCC", "rs4475691\t1\t846808\tCC", "rs72631889\t1\t851390\tGG",
"rs7537756\t1\t854250\tAA", "rs13302982\t1\t861808\tGG", "rs376747791\t1\t863130\tAA",
"rs2880024\t1\t866893\tCC", "rs13302914\t1\t868404\tTT", "rs76723341\t1\t872952\tCC",
"rs2272757\t1\t881627\tAA", "rs35471880\t1\t881918\tGG")
I want to read_csv
it but first I need to filter all the prefix starting with #
.
Please advise how can I parse the file starting from the rows that don't start with #
解决方案
Your file appears to be a tab-separated data set with comments delimited by #
. I'd suggest
readr::read_tsv("your_file", comment="#")
You might need col_names=FALSE
too since it looks like your header row is also commented (this is awkward; it would be best if you can modify it upstream).
推荐阅读
- css - CSS 中的无效属性值使用:Bourbon SASS。+retinaimage 从使用 V6 的 V4 中弃用
- javascript - Ag Grid Module 绑定
- python - 如何冻结 lf-net tensorflow 模型以将其与 opencv dnn 一起使用?
- android - Android Dialog 自行更改 Activity 变量
- python - 如何使用 Python 将来自用户的数据插入到最优化的文件中?
- intellij-idea - 如何在 intellij IDEA 中搜索与特定模式匹配的所有模块?
- python - 为什么python中的以下简单赋值语句需要花费大量时间来执行?
- python - 如何使用 Python 和套接字进行发送和接收?
- html - 如何在 github 的降价文档中显示图标“mdi”
- swift - 为什么不调用 iOS 13.2 中的 MPPlayableContentManager 协议(MPPlayableContentDataSource 和 MPPlayableContentDelegate)?