首页 > 解决方案 > 合并数据并将零放在 R 中不存在的缩写中

问题描述

我怎样才能以这种方式组合数据?在这个数据集中

  forest=structure(list(ADR.N.14.0 = c(8140010250001, 8140010250002, 8140010250005
), Соста.C.254 = structure(c(3L, 1L, 2L), .Label = c("10WB", 
"6AS  4WB", "7AS  3WB"), class = "factor"), PLSVYD.N.16.6 = c(3, 
2, 36), PRBPOR.C.254 = structure(c(1L, 2L, 1L), .Label = c("AS", 
"WB"), class = "factor"), NOMYAR.N.16.6 = c(1, 1, 1), KOFPOR1.N.16.6 = c(7, 
10, 6), POR1.C.254 = structure(c(1L, 2L, 1L), .Label = c("AS", 
"WB"), class = "factor"), VOZPOR1.N.16.6 = c(80, 45, 50), VYSPOR1.N.16.6 = c(24, 
17, 19), DEMPOR1.N.16.6 = c(36, 16, 24), POLNOT1.N.16.6 = c(1, 
0.9, 0.8), ZAPZAH1.N.16.6 = c(210, 160, 170), NOMYAR2.N.16.6 = c(1, 
1, 1), KOFSAST2.N.16.6 = c(3, 0, 4), POR2.C.254 = structure(c(2L, 
1L, 2L), .Label = c("AS", "WB"), class = "factor"), VOZPOR2.N.16.6 = c(70, 
45, 40), VYSPOR2.N.16.6 = c(22, 17, 16), DEMPOR2.N.16.6 = c(26, 
22, 16), POLNOT2.N.16.6 = c(0, 0, 0), ZAPZAH2.N.16.6 = c(0, 0, 
0), NOMYAR3.N.16.6 = c(1, 0, 0), KOFSAST3.N.16.6 = c(0, 0, 0), 
    POR3.C.254 = structure(c(2L, 1L, 1L), .Label = c("", "Д"), class = "factor"), 
    VOZPOR3.N.16.6 = c(140, 0, 0), VYSPOR3.N.16.6 = c(20, 0, 
    0), DEMPOR3.N.16.6 = c(40, 0, 0), POLNOT3.N.16.6 = c(0, 0, 
    0), ZAPZAH3.N.16.6 = c(0, 0, 0), NOMYAR4.N.16.6 = c(1, 0, 
    0), KOFSAST4.N.16.6 = c(0, 0, 0), POR4.C.254 = structure(c(2L, 
    1L, 1L), .Label = c("", "ЛИП"), class = "factor"), VOZPOR4.N.16.6 = c(130, 
    0, 0), VYSPOR4.N.16.6 = c(20, 0, 0), DEMPOR4.N.16.6 = c(36, 
    0, 0), POLNOT4.N.16.6 = c(0, 0, 0), ZAPZAH4.N.16.6 = c(0, 
    0, 0), KOFSAST5.N.16.6 = c(0L, NA, NA), POR5.C.255 = structure(c(2L, 
    1L, 1L), .Label = c("", "oak"), class = "factor"), VOZPOR5.N.16.6 = c(0L, 
    NA, NA), VYSPOR5.N.16.6 = c(0L, NA, NA), DEMPOR5.N.16.6 = c(0L, 
    NA, NA), POLNOT5.N.16.6 = c(0L, NA, NA), ZAPZAH5.N.16.6 = c(0L, 
    NA, NA)), class = "data.frame", row.names = c(NA, -3L))

例如,在某些变量中 Соста,C,254;PRBPOR,C,254有缩写,如AS,WD

这里树字典,它包含这些缩写的含义

tree_dict=structure(list(AS = structure(1L, .Label = "WB", class = "factor"), 
    aspen = structure(1L, .Label = "warty birch", class = "factor")), class = "data.frame", row.names = c(NA, 
-1L))

但缩写列表可能很长。例如

td1=structure(list(О = structure(1:2, .Label = c("H", "M"), class = "factor"), 
    Oak = structure(1:2, .Label = c("Hornbeam", "Maple"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

forest对于这些变量,如何在数据帧的每一行中

KOFPOR,N,16,6
POR,C,254
VOZPOR,N,16,6
VYSPOR,N,16,6
DEMPOR,N,16,6
POLNOT,N,16,6
ZAPZAH,N,16,6

对于此行中没有但输入tree_dict为零的每个缩写词?

并输入下一个编号(在此数据示例中,前缀从 1 到 4),例如对于橡木,它将是

KOFPOR5,N,16,6
POR5,C,254
VOZPOR5,N,16,6
VYSPOR5,N,16,6
DEMPOR5,N,16,6
POLNOT5,N,16,6
ZAPZAH5,N,16,6

并在变量中POR, C, 254设置值橡木,POR5, C, 254即将被放置oak ,并且在它们指示的任何列中的任何缩写都会更改为真实姓名tree_dict

例如

7AS  3WB
7 aspin ,3 warty birch

所以橡木的理想outout应该是

output=structure(list(Соста.C.254 = structure(1L, .Label = "7Aspen  3warty birch", class = "factor"), 
    PLSVYD.N.16.6 = 3L, PRBPOR.C.254 = structure(1L, .Label = "Aspen", class = "factor"), 
    NOMYAR.N.16.6 = 1L, KOFPOR1.N.16.6 = 7L, POR1.C.254 = structure(1L, .Label = "Aspen", class = "factor"), 
    VOZPOR1.N.16.6 = 80L, VYSPOR1.N.16.6 = 24L, DEMPOR1.N.16.6 = 36L, 
    POLNOT1.N.16.6 = 1L, ZAPZAH1.N.16.6 = 210L, NOMYAR2.N.16.6 = 1L, 
    KOFSOCT2.N.16.6 = 3L, POR2.C.254 = structure(1L, .Label = "warty birch", class = "factor"), 
    VOZPOR2.N.16.6 = 70L, VYSPOR2.N.16.6 = 22L, DEMPOR2.N.16.6 = 26L, 
    POLNOT2.N.16.6 = 0L, ZAPZAH2.N.16.6 = 0L, NOMYAR3.N.16.6 = 1L, 
    KOFSOCT3.N.16.6 = 0L, POR3.C.254 = structure(1L, .Label = "elm", class = "factor"), 
    VOZPOR3.N.16.6 = 140L, VYSPOR3.N.16.6 = 20L, DEMPOR3.N.16.6 = 40L, 
    POLNOT3.N.16.6 = 0L, ZAPZAH3.N.16.6 = 0L, NOMYAR4.N.16.6 = 1L, 
    KOFSOCT4.N.16.6 = 0L, POR4.C.254 = structure(1L, .Label = "Linden", class = "factor"), 
    VOZPOR4.N.16.6 = 130L, VYSPOR4.N.16.6 = 20L, DEMPOR4.N.16.6 = 36L, 
    POLNOT4.N.16.6 = 0L, ZAPZAH4.N.16.6 = 0L, NOMYAR5.N.16.6 = 1L, 
    KOFSOCT5.N.16.6 = 0L, POR5.C.255 = structure(1L, .Label = "oak", class = "factor"), 
    VOZPOR5.N.16.6 = 0L, VYSPOR5.N.16.6 = 0L, DEMPOR5.N.16.6 = 0L, 
    POLNOT5.N.16.6 = 0L, ZAPZAH5.N.16.6 = 0L), class = "data.frame", row.names = c(NA, 
-1L))

和 formaple将是第六个

KOFPOR6,N,16,6
POR6,C,254
VOZPOR6,N,16,6
VYSPOR6,N,16,6
DEMPOR6,N,16,6
POLNOT6,N,16,6
ZAPZAH6,N,16,6

如何进行如此高难度的组合?

标签: rdplyrdata.table

解决方案


我不确定我是否理解您的所有帖子,尤其是关于maple. 此外,您tree_dict只是部分内容,并未列出您给出的示例中的“elm”或“Linden” output。但是,根据您的数据和这个相同的output示例,以下是一些至少在某种程度上可以帮助您的编码:

install.packages("data.table")
install.packages("hash")
TD  <- data.frame(tree_dict)

# Your tree_dict structure is not ideally conditioned. Names look like data
# that are part of the translation hash. So we must integrate them as row data
# not just name labels, and row-bind:

TD0 <- data.frame(list(AS="AS", aspen="aspen"))
TD  <- rbind(TD0, TD)

# Using hashes (giving up on table merges as your strings 
# may contain several translation tokens at a time)

h   <- hash::hash(TD[[1]], TD[[2]])
forest<-data.table::as.data.table(forest)
g <- function(y) { for (x in keys(h)) y <- gsub(x, h[[x]], y); y; }

# Now for the expected output, just apply g column-wise:

forest[, lapply(.SD, g)]

# Your structure `output`is the first line of the resulting table, the following
# ones should be OK if using the complete version of `tree_dict`, which
# is cut-down in your post. 

推荐阅读