首页 > 解决方案 > 从 csv 中读取选定的列以使用 data.table 编辑和写回相同的 csv?

问题描述

我有一个非常大的 CSV,它是 huuuuuuge 单细胞 RNAseq 数据集的标准化基因表达矩阵,我犯了一个错误,没有将小鼠基因名称更改为人类基因名称。

这是一个数据应该是什么样子的示例,我想将值从基因列更改为人类等效值。

library(data.table)
x <- fread('https://raw.githubusercontent.com/dbrookeUAB/shared_files/master/example.csv')
x

      gene cell_1   cell_2 cell_3   cell_4 cell_5 cell_6   cell_7 cell_8
  1: Gsk3b      0 1.334471      0 0.000000      0      0 0.000000      0
  2: Fgfr1      0 0.000000      0 0.000000      0      0 0.000000      0
  3:  Cd8a      0 0.000000      0 0.000000      0      0 0.000000      0
  4: Aurkb      0 0.000000      0 0.000000      0      0 0.000000      0
  5:   Tub      0 0.000000      0 0.000000      0      0 0.000000      0
  6: Casp9      0 0.000000      0 0.000000      0      0 0.000000      0
  7:   Cd4      0 0.000000      0 0.000000      0      0 0.000000      0
  8:  Cd19      0 0.000000      0 0.000000      0      0 0.000000      0
  9: Itgam      0 0.875049      0 1.591288      0      0 0.000000      0
 10: Itgax      0 0.000000      0 0.000000      0      0 1.719341      0
        cell_9  cell_10
  1: 0.9982402 0.000000
  2: 0.0000000 0.000000
  3: 0.0000000 0.000000
  4: 0.0000000 0.000000
  5: 0.0000000 0.000000
  6: 0.0000000 0.000000
  7: 0.0000000 0.000000
  8: 0.0000000 0.000000
  9: 0.0000000 1.324255
 10: 0.9982402 0.000000

我知道您可以使用 阅读特定列data.table,但我想知道是否有办法编写单个列来替换原始 csv 中的列?看起来这比仅仅为了修复一列而读取整个数据集更有效。

感谢您的任何想法或想法!

标签: rcsvdata.table

解决方案


我将假设您有从geneto human(gene) 的翻译。原谅我不知道那些我头顶上的……所以我就用一些LETTERS

(顺便说一句:基因位于第一列的事实使这更加可行和稳健。)

genes <- unique(fread("example.csv", select = 1))[, human := LETTERS[seq_len(.N)]]
genes[]
#      gene human
#  1: Gsk3b     A
#  2: Fgfr1     B
#  3:  Cd8a     C
#  4: Aurkb     D
#  5:   Tub     E
#  6: Casp9     F
#  7:   Cd4     G
#  8:  Cd19     H
#  9: Itgam     I
# 10: Itgax     J

由于源数据没有引号,简单的模式就足够了。我假设没有一个基因有空格(这可能会鼓励一些 CSV 编辑工具使用引号)或逗号(需要引号)。通常,如果列中的任何值需要引号,那么所有值都将被引用,但这只是我的经验。

假设要转移的基因不止“几个”,最好将翻译放入一个文件并告诉sed从那里读取命令。

writeLines(genes[, sprintf("s/^%s,/%s,/", gene, human)], "gene_to_human.sed")
readLines("gene_to_human.sed", n = 3)
# [1] "s/^Gsk3b,/A,/" "s/^Fgfr1,/B,/" "s/^Cd8a,/C,/" 

从这里开始,它只是一个系统调用 use sed。我正在从 R 中执行此操作,但坦率地说,它可以从命令行(、、等等)轻松cmd.exe完成bash。无论哪种方式都不应妥协,因此请使用您喜欢的任何方式。

Sys.which("sed") # included in Rtools
#                               sed 
# "c:\\Rtools40\\usr\\bin\\sed.exe" 
system2("sed", c("-f", "gene_to_human.sed", "example.csv"), stdout = "example2.csv")
readLines("example.csv", n = 3)
# [1] "gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10"
# [2] "Gsk3b,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0"                   
# [3] "Fgfr1,0,0,0,0,0,0,0,0,0,0"                                                  
readLines("example2.csv", n = 3)
# [1] "gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10"
# [2] "A,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0"                       
# [3] "B,0,0,0,0,0,0,0,0,0,0"                                                      

这创建了一个新文件。如果您的驱动器空间很紧和/或您觉得很大胆并且想要真正就地修改它(覆盖原来的),那么将-i参数添加到 sed:

readLines("example.csv", n = 3)
# [1] "gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10"
# [2] "Gsk3b,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0"                   
# [3] "Fgfr1,0,0,0,0,0,0,0,0,0,0"                                                  
system2("sed", c("-i", "-f", "gene_to_human.sed", "example.csv"))
readLines("example.csv", n = 3)
# [1] "gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10"
# [2] "A,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0"                       
# [3] "B,0,0,0,0,0,0,0,0,0,0"                                                      

尽管在您对事物感到满意/满意之前,我认为保留原始文件可能是“一件好事(tm)”。


bash等价物

(假设您已经创建了"gene_to_human.sed".)

$ curl -s -o example.csv https://raw.githubusercontent.com/dbrookeUAB/shared_files/master/example.csv

$ sed -i -f gene_to_human.sed example2.csv

$ head -n 3 example.csv
gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10
Gsk3b,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0
Fgfr1,0,0,0,0,0,0,0,0,0,0

$ head -n 3 example2.csv
gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10
A,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0
B,0,0,0,0,0,0,0,0,0,0

或者如果您需要拆分它:

$ split -l 5 gene_to_human.sed  gene_to_human_split.sed.

$ ll gene_to_human*
-rw-r--r-- 1 r2 197121 144 Jun 12 16:44 gene_to_human.sed
-rw-r--r-- 1 r2 197121  72 Jun 12 17:04 gene_to_human_split.sed.aa
-rw-r--r-- 1 r2 197121  72 Jun 12 17:04 gene_to_human_split.sed.ab

$ cp example.csv example3.csv

$ for sedf in gene_to_human_split.sed.* ; do
    sed -i -f "${sedf}" example3.csv
  done

$ head -n 3 example3.csv
gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10
A,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0
B,0,0,0,0,0,0,0,0,0,0

sed底漆

我们使用的 sed 命令是s/old/new/格式,在命令行中,

$ echo "hello world" | sed -e 's/hell/h-e-hockeysticks/'
h-e-hockeystickso world

可以通过多种方式执行多个命令,包括:

$ echo "hello world, goodbye world" | sed -e 's/hell/quux/;s/good/bad/'
quuxo world, badbye world

$ echo "hello world, goodbye world" | sed -e 's/hell/quuz/' -e 's/good/bad/'
quuzo world, badbye world

这些都是替换第一个实例,您可以将其进行g全局替换:

$ echo "hello world, goodbye world" | sed -e 's/world/globe/'
hello globe, goodbye world

$ echo "hello world, goodbye world" | sed -e 's/world/globe/g'
hello globe, goodbye globe

最后,如果有很多很多翻译要做,我不知道是否sed会在太多命令的压力下崩溃。如果是这样,这绝对可以零碎完成。对于此示例,我将其限制为一次 5 个,但sed可以处理更多

Sys.which("split")
#                               split 
# "c:\\Rtools40\\usr\\bin\\split.exe" 
file.copy("example.csv", "example3.csv")
# [1] FALSE
readLines("example3.csv", n = 3)
# [1] "gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10"
# [2] "Gsk3b,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0"                   
# [3] "Fgfr1,0,0,0,0,0,0,0,0,0,0"                                                  
system2("split", c("-l", "5", "gene_to_human.sed", "gene_to_human_split.sed."))
sedfiles <- list.files(pattern = "gene.*\\.sed\\..*")
sedfiles
# [1] "gene_to_human_split.sed.aa" "gene_to_human_split.sed.ab"
for (sedfile in sedfiles) system2("sed", c("-i", "-f", sedfile, "example3.csv"))
readLines("example3.csv", n = 3)
# [1] "gene,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10"
# [2] "A,0,1.33447078154433,0,0,0,0,0,0,0.998240198583345,0"                       
# [3] "B,0,0,0,0,0,0,0,0,0,0"                                                      

推荐阅读