首页 > 解决方案 > Remove NA strings from table (characters) in R

问题描述

How can I remove NA strings in a simple data frame like the one below, which consists of a single column, in R?

head(test)
Column1 
[1] "Gene1 Gene2 Gene3 NA NA NA NA" 
[2] "Gene41 NAGene218 GeneX NA"
[3] "Gene19 GeneNA NA NA NA NA NA"

Some genes start or end with 'NA', so to avoid getting rid of those NAs, the gsub regex has to specify the position of the NA in the string... Something like: test2 <- gsub('^ NA$', "", test$Column1), with ^ indicating that ' NA' has to be at the start and $ at the end of the string... I am sure it's something simple, but I don't understand what I am doing wrong? (As I am not very familiar with these regex symbols)

[UPDATE] - Desired output

head(test2)
Column1 
[1] "Gene1 Gene2 Gene3" 
[2] "Gene41 NAGene218 GeneX"
[3] "Gene19 GeneNA"

标签: rregexgsub

解决方案


您可以使用

test$Column1 <- gsub("^NA(?:\\s+NA)*\\b\\s*|\\s*\\bNA(?:\\s+NA)*$", "", test$Column1)

查看正则表达式演示

细节

  • ^NA(?:\s+NA)*\b\s*- 备选方案 1:
    • ^- 字符串的开始
    • NA-NA字符串
    • (?:\s+NA)*- 0 次或多次重复 1+ 空格和NA文本
    • \b- 确保有一个单词边界(不NAGene应该发生匹配)
    • \s* - 0+ 个空格
  • |- 或者
  • \s*\bNA(?:\s+NA)*$- 备选方案 2:
    • \s* - 0+ 个空格
    • \b- 确保有一个单词边界(不GeneNA应该发生匹配)
    • NA-NA字符串
    • (?:\s+NA)*- 0 次或多次重复 1+ 空格和NA文本
    • $ - 字符串结束。

推荐阅读