首页 > 解决方案 > 用数据框中的值替换 unicode

问题描述

我试图用 sapply 函数替换数据框中的 unicode“U+00F3”,但什么也没发生。我要替换的 unicode 部分是 chr 类型。

这里的功能:

tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U+00F3>", replacement= "o")

编辑 :

感谢下面 Cath 的回答,我在 + 之前添加了:\\

tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U\\+00F3>", replacement= "o")

但它没有用。

我还尝试提供我的数据集的示例,但问题是它适用于它而不是我的:

tableExcel <- data.frame("Team" = c("A", "B", "C", "Reducci<U+00F3>n"), "Point" = c(2, 30, 40, 30))
tableExcel$Team <- as.character(tableExcel$Team)   

提供更多信息,这里输入我的excel文件:

tableExcel <- as.data.frame(read_excel("Dataset LOS.xls", sheet = "Liga Squads"))

我的数据结构:

structure(list(Team = c("CHURN", "CHURN", "RESIDENCIAL NPTB", "RESIDENCIAL NPTB", "AUDIENCIAS TV", "AUDIENCIAS TV"), Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig", "P. entr"), 2019-S01 = c(0, 0, 50, 0, NA, NA), 2019-S02 = c(0, 0, 10, 10, NA, NA), 2019-S03 = c(93, 88, 46, 19, NA, NA), 2019-S04 = c(56, 48, 0, 0, 13, 13), 2019-S05 = c(NA, NA, 80.5, 49.5, 42, 28.5), 2019-S06 = c(NA, NA, 66, 48, 55, 39.5), 2019-S07 = c(131, 112, 103, 63, 40.5, 38)), row.names = c(1L, 2L, 4L, 5L, 7L, 8L), class = "data.frame")

标签: runicodeshiny

解决方案


我无法用gsub. 以下按预期工作:

tableExcel$Team <- gsub("<U\\+00F3>", "o", tableExcel$Team)

#### OUTPUT ####

              Team  Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1 Reducci<U+00F1>n P. Asig        0        0       93       56       NA       NA    131.0
2            CHURN P. entr        0        0       88       48       NA       NA    112.0
4 Reducci<U+00F2>n P. Asig       50       10       46        0     80.5     66.0    103.0
5 RESIDENCIAL NPTB P. entr        0       10       19        0     49.5     48.0     63.0
7    AUDIENCIAS TV P. Asig       NA       NA       NA       13     42.0     55.0     40.5
8             <NA> P. entr       NA       NA       NA       13     28.5     39.5     38.0
9        Reduccion P. entr       NA       NA       NA       NA       NA       NA       NA

但是,使用正则表达式替换可能不是转换 unicode 字符的最有效方法,因为这需要多次调用gsub. 相反,您可能想尝试一下 stringi stri_unescape_unicode()

# install.packages("stringi") # Use if not yet installed.
library(stringi)

tableExcel$Team <- stri_unescape_unicode(gsub("<U\\+(.*)>", "\\\\u\\1", tableExcel$Team))

#### OUTPUT ####

              Team  Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1        Reducciñn P. Asig        0        0       93       56       NA       NA    131.0
2            CHURN P. entr        0        0       88       48       NA       NA    112.0
4        Reducciòn P. Asig       50       10       46        0     80.5     66.0    103.0
5 RESIDENCIAL NPTB P. entr        0       10       19        0     49.5     48.0     63.0
7    AUDIENCIAS TV P. Asig       NA       NA       NA       13     42.0     55.0     40.5
8             <NA> P. entr       NA       NA       NA       13     28.5     39.5     38.0
9        Reducción P. entr       NA       NA       NA       NA       NA       NA       NA

格式<U+0000>首先转换为\\u0000using gsub,然后转义。如您所见,它一次性处理多个 unicode 字符,这使事情变得更加简单。

数据:

tableExcel <- structure(list(Team = c("Reducci<U+00F1>n", "CHURN", "Reducci<U+00F2>n", 
"RESIDENCIAL NPTB", "AUDIENCIAS TV", NA, "Reducci<U+00F3>n"), 
    Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig", 
    "P. entr", "P. entr"), `2019-S01` = c(0, 0, 50, 0, NA, NA, 
    NA), `2019-S02` = c(0, 0, 10, 10, NA, NA, NA), `2019-S03` = c(93, 
    88, 46, 19, NA, NA, NA), `2019-S04` = c(56, 48, 0, 0, 13, 
    13, NA), `2019-S05` = c(NA, NA, 80.5, 49.5, 42, 28.5, NA), 
    `2019-S06` = c(NA, NA, 66, 48, 55, 39.5, NA), `2019-S07` = c(131, 
    112, 103, 63, 40.5, 38, NA)), row.names = c(1L, 2L, 4L, 5L, 
7L, 8L, 9L), class = "data.frame")

推荐阅读