r - 用数据框中的值替换 unicode
问题描述
我试图用 sapply 函数替换数据框中的 unicode“U+00F3”,但什么也没发生。我要替换的 unicode 部分是 chr 类型。
这里的功能:
tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U+00F3>", replacement= "o")
编辑 :
感谢下面 Cath 的回答,我在 + 之前添加了:\\
tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U\\+00F3>", replacement= "o")
但它没有用。
我还尝试提供我的数据集的示例,但问题是它适用于它而不是我的:
tableExcel <- data.frame("Team" = c("A", "B", "C", "Reducci<U+00F3>n"), "Point" = c(2, 30, 40, 30))
tableExcel$Team <- as.character(tableExcel$Team)
提供更多信息,这里输入我的excel文件:
tableExcel <- as.data.frame(read_excel("Dataset LOS.xls", sheet = "Liga Squads"))
我的数据结构:
structure(list(Team = c("CHURN", "CHURN", "RESIDENCIAL NPTB", "RESIDENCIAL NPTB", "AUDIENCIAS TV", "AUDIENCIAS TV"), Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig", "P. entr"), 2019-S01 = c(0, 0, 50, 0, NA, NA), 2019-S02 = c(0, 0, 10, 10, NA, NA), 2019-S03 = c(93, 88, 46, 19, NA, NA), 2019-S04 = c(56, 48, 0, 0, 13, 13), 2019-S05 = c(NA, NA, 80.5, 49.5, 42, 28.5), 2019-S06 = c(NA, NA, 66, 48, 55, 39.5), 2019-S07 = c(131, 112, 103, 63, 40.5, 38)), row.names = c(1L, 2L, 4L, 5L, 7L, 8L), class = "data.frame")
解决方案
我无法用gsub
. 以下按预期工作:
tableExcel$Team <- gsub("<U\\+00F3>", "o", tableExcel$Team)
#### OUTPUT ####
Team Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1 Reducci<U+00F1>n P. Asig 0 0 93 56 NA NA 131.0
2 CHURN P. entr 0 0 88 48 NA NA 112.0
4 Reducci<U+00F2>n P. Asig 50 10 46 0 80.5 66.0 103.0
5 RESIDENCIAL NPTB P. entr 0 10 19 0 49.5 48.0 63.0
7 AUDIENCIAS TV P. Asig NA NA NA 13 42.0 55.0 40.5
8 <NA> P. entr NA NA NA 13 28.5 39.5 38.0
9 Reduccion P. entr NA NA NA NA NA NA NA
但是,使用正则表达式替换可能不是转换 unicode 字符的最有效方法,因为这需要多次调用gsub
. 相反,您可能想尝试一下 stringi stri_unescape_unicode()
:
# install.packages("stringi") # Use if not yet installed.
library(stringi)
tableExcel$Team <- stri_unescape_unicode(gsub("<U\\+(.*)>", "\\\\u\\1", tableExcel$Team))
#### OUTPUT ####
Team Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1 Reducciñn P. Asig 0 0 93 56 NA NA 131.0
2 CHURN P. entr 0 0 88 48 NA NA 112.0
4 Reducciòn P. Asig 50 10 46 0 80.5 66.0 103.0
5 RESIDENCIAL NPTB P. entr 0 10 19 0 49.5 48.0 63.0
7 AUDIENCIAS TV P. Asig NA NA NA 13 42.0 55.0 40.5
8 <NA> P. entr NA NA NA 13 28.5 39.5 38.0
9 Reducción P. entr NA NA NA NA NA NA NA
格式<U+0000>
首先转换为\\u0000
using gsub
,然后转义。如您所见,它一次性处理多个 unicode 字符,这使事情变得更加简单。
数据:
tableExcel <- structure(list(Team = c("Reducci<U+00F1>n", "CHURN", "Reducci<U+00F2>n",
"RESIDENCIAL NPTB", "AUDIENCIAS TV", NA, "Reducci<U+00F3>n"),
Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig",
"P. entr", "P. entr"), `2019-S01` = c(0, 0, 50, 0, NA, NA,
NA), `2019-S02` = c(0, 0, 10, 10, NA, NA, NA), `2019-S03` = c(93,
88, 46, 19, NA, NA, NA), `2019-S04` = c(56, 48, 0, 0, 13,
13, NA), `2019-S05` = c(NA, NA, 80.5, 49.5, 42, 28.5, NA),
`2019-S06` = c(NA, NA, 66, 48, 55, 39.5, NA), `2019-S07` = c(131,
112, 103, 63, 40.5, 38, NA)), row.names = c(1L, 2L, 4L, 5L,
7L, 8L, 9L), class = "data.frame")