首页 > 解决方案 > 将列与 ID 字符串匹配并在新列中分配新值

问题描述

我有这个数据:

USDfirms <- c("GOOG", "BABA" "0071.TW")
TWRfirms <- c("3231.TW")
JPYfirms <- c("7752.T")

我正在尝试使用该grepl函数创建一个新列。因此,如果tickerdf数据中与3231.TW上述 3 个字符串向量之一中的公司匹配,则分配一个值 ( TWRmatch),或者如果ticker与公司匹配,则GOOG分配一个值USDmatch等。

这些ticker值可能并不总是完美匹配,即ticker3231 不完全匹配,3231.TW这就是为什么我想grepl在匹配时忽略 .TW 的原因。

df <- structure(list(symbol = c("3231.TW", "3231.TW", "3231.TW", "3231.TW", 
"7752.T", "7752.T", "7752.T", "7752.T", "GOOG", "GOOG", "GOOG", 
"GOOG", "BABA", "BABA", "BABA", "BABA"), ticker = c("3231", "3231", 
"3231", "3231", "7752", "7752", "7752", "7752", "GOOG", "GOOG", 
"GOOG", "GOOG", "BABA", "BABA", "BABA", "BABA"), country = c("TW", 
"TW", "TW", "TW", "T", "T", "T", "T", NA, NA, NA, NA, NA, NA, 
NA, NA), year = c(2017L, 2016L, 2015L, 2014L, 2018L, 2017L, 2016L, 
2015L, 2017L, 2016L, 2015L, 2014L, 2018L, 2017L, 2016L, 2015L
)), .Names = c("symbol", "ticker", "country", "year"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 123L, 124L, 125L, 126L, 127L, 128L, 
129L, 130L), class = "data.frame")

编辑:

此功能似乎不起作用

ifelse(grepl(USDfirms, df$ticker), "yes", "no")

我也经历过:

df$match <- ifelse(USDfirms %in% x$ticker, "yes", "no")

这只是让我对一切都是肯定的。

标签: r

解决方案


不是一个完美的解决方案,但蛮力方法可能是使用嵌套lapply/sapply解决方案。这里ticker每个元素都有一个双循环firm_list,我们检查它是否存在于列表中的任何元素中,如果存在,我们提取该列表的名称。

df$firms <- unlist(lapply(df$ticker, function(x)
        unlist(sapply(seq_along(firm_list), function(y) {
           if (any(grepl(x, unlist(firm_list[y])))) 
               names(firm_list[y])
})))) 

df

#     symbol ticker country year    firms
#1   3231.TW   3231      TW 2017 TWRfirms
#2   3231.TW   3231      TW 2016 TWRfirms
#3   3231.TW   3231      TW 2015 TWRfirms
#4   3231.TW   3231      TW 2014 TWRfirms
#5    7752.T   7752       T 2018 JPYfirms
#6    7752.T   7752       T 2017 JPYfirms
#7    7752.T   7752       T 2016 JPYfirms
#8    7752.T   7752       T 2015 JPYfirms
#123    GOOG   GOOG    <NA> 2017 USDfirms
#124    GOOG   GOOG    <NA> 2016 USDfirms
#125    GOOG   GOOG    <NA> 2015 USDfirms
#126    GOOG   GOOG    <NA> 2014 USDfirms
#127    BABA   BABA    <NA> 2018 USDfirms
#128    BABA   BABA    <NA> 2017 USDfirms
#129    BABA   BABA    <NA> 2016 USDfirms
#130    BABA   BABA    <NA> 2015 USDfirms

我们将所有公司移动到一个列表中,以便于检查。

firm_list <- list(USDfirms = c("GOOG", "BABA", "0071.TW"), 
                  TWRfirms = c("3231.TW"), 
                  JPYfirms = c("7752.T"))

或者实际上,如果我们创建一个查找数据框,然后从中匹配和提取,它会更方便和更短。

ref_df <- data.frame(firms = unlist(firm_list), 
           names = rep(names(firm_list), lengths(firm_list)))

df$firms <- ref_df$names[sapply(df$ticker, function(x) grep(x, ref_df$firms))]


df
#     symbol ticker country year    firms
#1   3231.TW   3231      TW 2017 TWRfirms
#2   3231.TW   3231      TW 2016 TWRfirms
#3   3231.TW   3231      TW 2015 TWRfirms
#4   3231.TW   3231      TW 2014 TWRfirms
#5    7752.T   7752       T 2018 JPYfirms
#6    7752.T   7752       T 2017 JPYfirms
#7    7752.T   7752       T 2016 JPYfirms
#8    7752.T   7752       T 2015 JPYfirms
#123    GOOG   GOOG    <NA> 2017 USDfirms
#124    GOOG   GOOG    <NA> 2016 USDfirms
#125    GOOG   GOOG    <NA> 2015 USDfirms
#126    GOOG   GOOG    <NA> 2014 USDfirms
#127    BABA   BABA    <NA> 2018 USDfirms
#128    BABA   BABA    <NA> 2017 USDfirms
#129    BABA   BABA    <NA> 2016 USDfirms
#130    BABA   BABA    <NA> 2015 USDfirms

推荐阅读