首页 > 解决方案 > R data.table 按行条件从 .csv 或文本文件中读取特定单元格

问题描述

我正在开发一个小型闪亮应用程序,它可以读取和处理原始 ELISA 数据。数据格式如下:

1560200  Version 1.06                                                       
15/09/2021 16:22                                                    
Serial Number:  19103026                                                
Protocol Name:  euroimmune elisa                                                
Read:   Normal   00:05 Shake                                            
Blanks: None                                                
                                                    
450     1   2   3   4   5   6   7   8   9   10  11  12
    A   0.612   0.081   0.638   0   0   0   0   0   0   0   0   0
    B   1.886   0.082   0.502   0   0   0   0   0   0   0   0   0
    C   0.064   1.297   0.072   0   0   0   0   0   0   0   0   0
    D   2.343   0.571   0.069   0   0   0   0   0   0   0   0   0
    E   0.082   0.289   0.621   0   0   0   0   0   0   0   0   0
    F   0.265   3.648   0.224   0   0   0   0   0   0   0   0   0
    G   1.063   0.114   2.411   0   0   0   0   0   0   0   0   0
    H   0.942   0.685   0.09    0   0   0   0   0   0   0   0   0
                                                    
630     1   2   3   4   5   6   7   8   9   10  11  12
    A   0.037   0.035   0.034   0   0   0   0   0   0   0   0   0
    B   0.037   0.035   0.036   0   0   0   0   0   0   0   0   0
    C   0.041   0.039   0.035   0   0   0   0   0   0   0   0   0
    D   0.037   0.033   0.037   0   0   0   0   0   0   0   0   0
    E   0.036   0.033   0.037   0   0   0   0   0   0   0   0   0
    F   0.035   0.041   0.037   0   0   0   0   0   0   0   0   0
    G   0.036   0.034   0.041   0   0   0   0   0   0   0   0   0
    H   0.037   0.035   0.036   0   0   0   0   0   0   0   0   0
                                                    
Delta       1   2   3   4   5   6   7   8   9   10  11  12
    A   0.575   0.046   0.604   0.575   0   0   0   0   0   0   0   0
    B   1.849   0.047   0.466   1.849   0   0   0   0   0   0   0   0
    C   0.023   1.258   0.037   0.023   0   0   0   0   0   0   0   0
    D   2.306   0.538   0.032   2.306   0   0   0   0   0   0   0   0
    E   0.046   0.256   0.584   0.046   0   0   0   0   0   0   0   0
    F   0.23    3.607   0.187   0   0   0   0   0   0   0   0   0
    G   1.027   0.08    2.37    0   0   0   0   0   0   0   0   0
    H   0.905   0.65    0.054   0   0   0   0   0   0   0   0   0


值得注意的是,行标题没有列名,其余列标题仅编号为 1 到 12。如果单元格有结果,则其中会有一个数值 - 否则空白单元格的值为 0。

我有一些代码使用我通常使用的 ELISA 阅读器的输出准确指定从哪些单元格读取数据(通过单元格引用):

# Import raw ELISA data from .csv:
elisa_raw <- fread(file = params$resultats, 
                     stringsAsFactors = FALSE, # read in strings as characters
                     skip = 27, # number of rows to skip
                     select = c(2:14), # which columns to select
                     header = TRUE, # use the first row of the selected data as column names
                     dec = ".", # specify the decimal separator (comma or period)
                     colClasses = list(character = 2, numeric = 3:14)) # predefine column types

这很好用,但是我想升级代码以处理更一般的情况 - 例如,如果我的 ELISA 阅读器上的软件更新,跳过的行数可能会改变。如果我希望我的应用程序可用于拥有不同型号 ELISA 阅读器的人,那么肯定会有不同数量的行或列要跳过。

在所有这些场景中保持不变的一件事是数据的结构:它保持如上,有 8 行和 13 列(第一个是行标题),其中行命名为 A - H,列命名1 - 12. 包含行标题的列通常没有名称。

我想看看是否有一种方法可以有条件地选择要使用此结构导入的数据单元格。我尝试grep -E如下 fread 并在这篇文章中描述:

elisa_raw <- fread(cmd = "grep -E 'A|B|C|D|E|F|G|H' TF_210915_IPB.txt")

我认为这将导入行的第一个值为 A 或 B 或 C 等的所有行。但是这给了我一个错误:

> elisa_raw <- fread(cmd = "grep -E 'A|B|C|D|E|F|G|H' TF_210915_IPB.txt")
'grep' is not recognized as an internal or external command,
operable program or batch file.
Warning messages:
1: In (if (.Platform$OS.type == "unix") system else shell)(paste0("(",  :
  '(grep -E 'A|B|C|D|E|F|G|H' TF_210915_IPB.txt) > C:\Users\Username\AppData\Local\Temp\Rtmpeki6Rv\file4010600782e' execution failed with error code 255
2: In fread(cmd = "grep -E 'A|B|C|D|E|F|G|H' TF_210915_IPB.txt") :
  File 'C:\Users\Username\AppData\Local\Temp\Rtmpeki6Rv\file4010600782e' has size 0. Returning a NULL data.table.

与原始帖子不同,我无法指定列名,因为行标题没有。我正在使用 Windows 10 的 PC;我知道grep -E在这种情况下这是正确的命令,但现在不太确定给出错误消息。

我怎样才能使这项工作或有另一种方法来解决这个问题?

编辑:

为了演示我正在导入的文件类型及其变化方式,我在下面提供了两个不同的原始 ELISA 文件示例的 dput。它们在没有任何规范的情况下被导入到 r 中,以保留它们的结构,如下所示:

# Example file 1: .csv from a Biotek reader
biotek <- fread(file = "Biotek_21091501.csv", stringsAsFactors = FALSE, na.strings = c(""))

# Example file 2: .txt from a ThermoFisher Scientific reader
tfs <- fread(file = "TF_21091501.txt", stringsAsFactors = FALSE, na.strings = c(""))

这些示例文件可以通过使用下面的 dput 在 R 中重新创建对象然后分别导出为 .csv 和 .txt 来重新创建(没有列标题或行名)。

请注意,这两个文件之一实际上具有三个符合我标准的数据结构,而我只需要第三个(名为 delta)。目前,尽管我不介意任何可以带回所有三个的解决方案,因为我可以在之后对它们进行子集化。

# Example data from Biotek ELISA reader 1:
> dput(biotek)
structure(list(V1 = c("1560200  Version 1.06", "15/09/2021 16:22", 
"Serial Number:", "Protocol Name:", "Read:", "Blanks:", "", "450", 
"", "", "", "", "", "", "", "", "", "630", "", "", "", "", "", 
"", "", "", "", "Delta", "", "", "", "", "", "", "", ""), V2 = c("", 
"", "19103026", "euroimmune elisa", "Normal", "None", "", "", 
"A", "B", "C", "D", "E", "F", "G", "H", "", "", "A", "B", "C", 
"D", "E", "F", "G", "H", "", "", "A", "B", "C", "D", "E", "F", 
"G", "H"), V3 = c("", "", "", "", "00:05 Shake", "", "", "1", 
"0.612", "1.886", "0.064", "2.343", "0.082", "0.265", "1.063", 
"0.942", "", "1", "0.037", "0.037", "0.041", "0.037", "0.036", 
"0.035", "0.036", "0.037", "", "1", "0.575", "1.849", "0.023", 
"2.306", "0.046", "0.23", "1.027", "0.905"), V4 = c(NA, NA, NA, 
NA, NA, NA, NA, 2, 0.081, 0.082, 1.297, 0.571, 0.289, 3.648, 
0.114, 0.685, NA, 2, 0.035, 0.035, 0.039, 0.033, 0.033, 0.041, 
0.034, 0.035, NA, 2, 0.046, 0.047, 1.258, 0.538, 0.256, 3.607, 
0.08, 0.65), V5 = c(NA, NA, NA, NA, NA, NA, NA, 3, 0.638, 0.502, 
0.072, 0.069, 0.621, 0.224, 2.411, 0.09, NA, 3, 0.034, 0.036, 
0.035, 0.037, 0.037, 0.037, 0.041, 0.036, NA, 3, 0.604, 0.466, 
0.037, 0.032, 0.584, 0.187, 2.37, 0.054), V6 = c(NA, NA, NA, 
NA, NA, NA, NA, 4, 0, 0, 0, 0, 0, 0, 0, 0, NA, 4, 0, 0, 0, 0, 
0, 0, 0, 0, NA, 4, 0.575, 1.849, 0.023, 2.306, 0.046, 0, 0, 0
), V7 = c(NA, NA, NA, NA, NA, NA, NA, 5L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, NA, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 5L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), V8 = c(NA, NA, NA, NA, NA, NA, NA, 
6L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 6L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, NA, 6L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), V9 = c(NA, 
NA, NA, NA, NA, NA, NA, 7L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 
7L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 7L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L), V10 = c(NA, NA, NA, NA, NA, NA, NA, 8L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, NA, 8L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
NA, 8L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), V11 = c(NA, NA, NA, 
NA, NA, NA, NA, 9L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 9L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 9L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), V12 = c(NA, NA, NA, NA, NA, NA, NA, 10L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, NA, 10L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 
10L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), V13 = c(NA, NA, NA, NA, 
NA, NA, NA, 11L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 11L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, 11L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), V14 = c(NA, NA, NA, NA, NA, NA, NA, 12L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, NA, 12L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
NA, 12L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 
-36L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000002a5ddbf1ef0>)

# Example data from ThermoFisher Scientific ELISA reader 2:
> dput(tfs)
structure(list(V1 = c("Protocole:", "Format de plaque:", "", 
"Absorbances     Filtre 1: 450nm", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "A", "B", "C", "D", "E", "F", "G", "H"), V2 = c("Demo3 15.09.2021 17:11:54", 
"96 puits", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "1", "0.456", 
"1.337", "0.065", "1.577", "0.123", "0.33", "1.094", "0.291"), 
    V3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 
    0.134, 0.116, 1.112, 0.29, 0.209, 2.915, 0.353, 0.719), V4 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3, 0.513, 0.342, 
    0.109, 0.112, 0.562, 0.238, 2.224, 0.15), V5 = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L), V6 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), V7 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L), V8 = c(NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, 7L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), V9 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L), V10 = c(NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, 9L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    V11 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), V12 = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, 11L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L), V13 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, 12L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 
-36L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000002a5ddbf1ef0>)

标签: rimportdata.tableconditional-statements

解决方案


推荐阅读