r - tidyverse:readr read_delim 遇到制表符和分号错误?
问题描述
我有一个以制表符分隔的遗传变异文件,其中最后一个字段OtherInfo
是一长串用分号分隔的标签。不知何故,这导致readr
出现错误,如下所示。这是预期的行为吗?我怎样才能解决这个问题?
非常感谢。
> head myanno_AllChr_ExAC38.hg38_multianno.txt
Chr Start End Ref Alt ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS Otherinfo
1 15847952 15847952 G C . . . . . . . . . 241.9 76196 1 15847952 . G C 241.9 PASS AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406;culprit=MQ
1 15847963 15847963 A C . . . . . . . . . 1607.1 126156 1 15847963 . A C 1607.1 PASS AC=2;AF=0;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=2;MLEAF=0;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995;culprit=QD
1 15847964 15847966 GCC - . . . . . . . . . 1607.1 126156 1 15847963 . AGCC A 1607.1 PASS AC=63;AF=0.003;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=55;MLEAF=0.002;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995;culprit=QD
1 15847978 15847978 C T . . . . . . . . . 648.41 234344 1 15847978 . C T 648.41 PASS AC=9;AF=0;AN=25894;BaseQRankSum=-0.572;ClippingRankSum=-0.404;DP=234344;ExcessHet=3.348;FS=2.639;InbreedingCoeff=-0.0098;MLEAC=6;MLEAF=0;MQ=58.71;MQRankSum=-0.456;NEGATIVE_TRAIN_SITE;QD=4.13;ReadPosRankSum=-0.456;SOR=0.452;VQSLOD=-1.238;culprit=QD
1 15847979 15847979 G T . . . . . . . . . 315.48 243578 1 15847979 . G T 315.48 PASS AC=1;AF=0;AN=26062;BaseQRankSum=0.301;ClippingRankSum=0.356;DP=243578;ExcessHet=3.1213;FS=0;InbreedingCoeff=-0.0072;MLEAC=1;MLEAF=0;MQ=58.83;MQRankSum=-1.505;QD=12.62;ReadPosRankSum=0.684;SOR=0.495;VQSLOD=-0.1437;culprit=MQRankSum
运行以下命令:
variant.freqs <- read_tsv("AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt")
回报:
Parsed with column specification:
cols(
Chr = col_integer(),
Start = col_integer(),
End = col_integer(),
Ref = col_character(),
Alt = col_character(),
ExAC_ALL = col_character(),
ExAC_AFR = col_character(),
ExAC_AMR = col_character(),
ExAC_EAS = col_character(),
ExAC_FIN = col_character(),
ExAC_NFE = col_character(),
ExAC_OTH = col_character(),
ExAC_SAS = col_character(),
Otherinfo = col_character()
)
以及以下错误:
number of columns of result is not a multiple of vector length (arg 1)152306 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
file 2 2 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
row 3 3 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
col 4 4 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
expected 5 5 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
View(variant.freqs)
解决方案
从示例数据来看,第一行有 14 个选项卡,第二行有 24 个选项卡 - 您没有足够的标题来容纳列数
> fl = "foo.txt"
> lengths(strsplit(readLines(fl, 2), "\t"))
[1] 14 24
更详细
> res = strsplit(readLines(fl, 2), "\t")
> res[[1]][14] # first line, final header
[1] "Otherinfo"
> res[[2]][14] # second line, entry in position 14
[1] "."
> res[[2]][15] # second line, entry in position 15
[1] "241.9"
> res[[2]][24] # second line, entry in position 24
[1] "AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406;culprit=MQ"
推荐阅读
- python-3.x - 任何人都可以帮我解决这个问题吗?
- python - python将json对象附加到文件中的json列表
- javascript - 为什么我的表单密码验证不起作用
- windows - 作曲家错误 = Windows 10 上的 [RuntimeException]
- javascript - 如何在 VSCode 调试中访问“this”变量?
- heroku - heroku 如何知道您是否是购买该域名的人?
- python - 将包含字典列表的列的熊猫数据框转换为元组的元组
- python - Pandas DF 将日期字符串转换为日期年月
- angularjs - 可重用的布局模板 Angular.js
- json - 如何在 android studio 上使用 klaxon 从 thingspeak 解析 json 以获取字段值?