首页 > 解决方案 > 如何从 zip 文件中的文件名中删除特殊字符?

问题描述

我有一个文件向量,unzip(temp, list = TRUE)[ , "Name"]但这些文件的名称上有特殊字符(我猜它们是“windows-1252”,因为名称是葡萄牙语(我敢肯定),并guess_encoding()为“windows-1252”给出了 0.52 ) 但是当我尝试做

vector < - unzip(temp, list = TRUE)[ , "Name"]
fixed_names <- iconv(vector, to = "UTF-8", from = "windows-1252")

字符以错误的方式转换。例子:

“Microdados_Educa\x87\xc6o_Superior_2019/anexos/ANEXO I - Dicion\xa0rio de Vari\xa0veis e Tabelas Auxiliares/C\xa2digo_do_Pa\xa1s_de_Nascimento_ou_Naturaliza\x87\xc6o.xlsx”

应该

“Microdados_Educação_Superior_2019/anexos/ANEXO I - Dicionário de Variáveis e Tabelas Auxiliares/Código_do_País_de_Nascimento_ou_Naturalização.xlsx”

但它正在变为

“Microdados_Educa‡Æo_Superior_2019/anexos/ANEXO I - Dicion rio de Vari veis e Tabelas Auxiliares/C¢digo_do_Pa¡s_de_Nascimento_ou_Naturaliza‡Æo.xlsx”

我试图忽略并做任何事情,unzip(tempfile, vector, junkpaths = TRUE)但所有文件的名称都错误,最后在文件扩展名之后出现“(无效编码)” 。如何使用正确的名称提取它们?

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C               LC_TIME=pt_BR.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] here_1.0.1  httr_1.4.2  dplyr_1.0.6 rvest_1.0.0

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13  xml2_1.3.2       magrittr_2.0.1   tidyselect_1.1.1 R6_2.5.0         rlang_0.4.11    
 [7] fansi_0.4.2      stringr_1.4.0    tools_4.1.0      utf8_1.2.1       DBI_1.1.1        selectr_0.4-2   
[13] ellipsis_0.3.2   rprojroot_2.0.2  assertthat_0.2.1 tibble_3.1.2     lifecycle_1.0.0  crayon_1.4.1    
[19] purrr_0.3.4      vctrs_0.3.8      curl_4.3.1       glue_1.4.2       stringi_1.6.2    compiler_4.1.0  
[25] pillar_1.6.1     generics_0.1.0   pkgconfig_2.0.3 

编辑:如果我在 Windows 中运行相同的代码,我也会得到错误的文件名,但如果我手动解压缩文件(右键单击并解压缩),则名称是正确的。

标签: rencoding

解决方案


推荐阅读