首页 > 解决方案 > 如何用反引号作为字符串编码器和 ¥‎ 作为转义字符在 R 中读取 csv?

问题描述

我有 CSV 数据,其中反引号 (`) 作为字符串包围,日元符号 (¥) 作为转义字符。

例子 :

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

我尝试阅读原始文件并用反斜杠替换日元符号但不起作用。

fl <- readLines("data.csv", encoding = "UTF-8")
fl2 <- gsub('¥', "\\", fl)
writeLines(fl2, "Edited_data.txt")
sms_data <- fread("Edited_data.txt", sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".", encoding = "UTF-8")

预期的数据框

在此处输入图像描述

标签: rcsvfread

解决方案


您可以将转义序列更改为您喜欢的任何内容,并在阅读文本后将其更改回来。我在这里复制了您的数据:

yen <- c("Sentence,Value1,Value2", 
         "`ML Taper, Triology TM`,0,0", 
         "90481 3TBS/¥`10TRYS/1SR PAUL/JOE,0,0", 
         "`D/3,E/4`,0,0")
writeLines(yen, path.expand("~/yen.csv"))

现在代码

library(data.table)

# Read data without specifying encoding to handle ANSI or UTF8 yens
fl <- readLines(path.expand("~/yen.csv"))

# The yen symbol is 0xc2 0xa5 in UTF8, so we want it encoded this way
utf8_yen <- rawToChar(as.raw(c(0xc2, 0xa5)))
ansi_yen <- rawToChar(as.raw(0xa5))
fl <- gsub(utf8_yen, ansi_yen, fl)

# Paste on our backtick to get the backtick escape
yen_tick <- paste0(ansi_yen, "`")

# Change the backtick escape then remove all yen nsymbols
fl2 <- gsub(yen_tick, "&backtick;", fl)
fl2 <- gsub(ansi_yen, "", fl2)

# Save our modified string and reload it as a dataframe
writeLines(fl2, path.expand("~/Edited_data.txt"))
sms_data <- fread(path.expand("~/Edited_data.txt"),
                  sep = ",", stringsAsFactors = FALSE, quote = "\`", dec = ".")

# Now we can unescape our backticks and we're done
sms_data$Sentence <- gsub("&backtick;", "`", sms_data$Sentence)

所以现在我们有

sms_data
#>                           Sentence Value1 Value2
#> 1:           ML Taper, Triology TM      0      0
#> 2: 90481 3TBS/`10TRYS/1SR PAUL/JOE      0      0
#> 3:                         D/3,E/4      0      0

推荐阅读