首页 > 解决方案 > 如何对不同数据帧的列之间的匹配进行for循环测试,然后保存到新数据帧

问题描述

我正在尝试在 R 中创建一个 for 循环,当一个数据帧(df2)中的列(“areaName2”)的值与列(“ISLAND” ) 来自不同的数据帧 (df1)。如果 df2 的第一列中没有匹配项,那么我希望它继续配对 df2 和 df1 中的第二组列(df2:"areaName1 和 df1:"ARCHIP")。同样,如果有匹配项,它应该打印在新的数据框中。如果再次没有匹配,那么我希望它在第三对列(df2:“Country”和 df1:“COUNTRY”)上移动。如果所有列df 2 是空白的,那么我想跳过该行。如果 df 2 的其中一列中有一些信息,但它与 df1 不匹配,我希望它以某种方式说明是否可能。

我做了一个df1、df2和结果的例子:

ID <- c(1,2,3,4,5, 6)
COUNTRY <- c("country1", 'country2', 'country3','country4', 'country5', 'country6')
ARCHIP <- c('archipelago1', 'archipelago2', 'archipelgao3', 'archipelago4', 'archipelago5', 'archipelago6')
ISLAND <- c('someisland1', 'someIsland2', 'someIsland3', 'someIsland4', 'someIsland5', 'someIsland6')
df1 <- data.frame(ID, COUNTRY, ARCHIP, ISLAND)


Sciname <- c("scientificName1", "scientificName2", "scientificName3", "scientificName4", "scientificName5", "scientificName6")
AreaName2 <- c("someIsland1", NA, "someIsland3", NA, NA, 'unrecognisableIsland')
AreaName1 <- c("archipelago1", "archipelago2", "archipelago3", NA, NA, 'archipelago6')
Country <- c("country1", "country2", "country3", 'country4', NA, 'country6')
df2 <- data.frame(Sciname, Country, AreaName1, AreaName2)


Species <- c("scientificName1","scientificName2", "scientificName3", "scientificName4", 'scientificName6')
Location <- c("someIsland1", "archipelago2", "someIsland3", 'country4', 'UNREGOGNISED')
results <- data.frame(Species, Location)

我在想我需要为每个列集做一些事情

for (i in df2$AreaName2) {
results[[i]] <- if(df2$AreaName2 %in% df1$ISLAND)
}

但我不确定如何使其适用于每组,或者如何使其通过几列运行 - 也许我应该为我希望匹配的每组列创建一个 for 循环?有任何想法吗?谢谢!

标签: rfor-loop

解决方案


# I like to use tidyverse :)
library(tidyverse)

# First, to create our datasets - (Thank you for providing sample data!)
# I've set this up in a slightly different way, in an attempt to keep our workspace clear.
# I've also used tibble in place of data.frame, to line up with the tidyverse approach.
df1 <- tibble(    ID = seq(1:6), 
                  COUNTRY = c("country1", 'country2', 'country3','country4', 'country5', 'country6'), 
                  ARCHIP = c('archipelago1', 'archipelago2', 'archipelgao3', 'archipelago4', 'archipelago5', 'archipelago6'), 
                  ISLAND = c('someIsland1', 'someIsland2', 'someIsland3', 'someIsland4', 'someIsland5', 'someIsland6'))

df2 <- tibble(    Sciname = c("scientificName1", "scientificName2", "scientificName3", "scientificName4", "scientificName5", "scientificName6"), 
                  Country = c("country1", "country2", "country3", 'country4', NA, 'country6'), 
                  AreaName1 = c("archipelago1", "archipelago2", "archipelago3", NA, NA, 'archipelago6'),
                  AreaName2 = c("someIsland1", NA, "someIsland3", NA, NA, 'unrecognisableIsland'))


# Rather than use a for loop, I'll use full_join to match the two tables, then filter for the conditions you're looking for. 

# Merge data
join_country <- full_join(df2, df1, by = c("Country" = "COUNTRY"))

# Identify scinames with matching island names
# I use _f to signify my goal here - filtering
island_f <- join_country %>%
  filter(AreaName2 == ISLAND) %>%
  # Keep only relevant columns
  select(Sciname, Location = AreaName2)

# Identify scinames with matching archip names
archip_f <- join_country %>%
  filter(
         # Exclude scinames we've identified with matching island names.
         !(Sciname %in% island_f$Sciname),
         AreaName1 == ARCHIP) %>%
  select(Sciname, Location = AreaName1)

# Identify scinames left over (countries already matched from full_join)
country_f <- join_country %>%
  filter(
    # Exclude scinames we've identified with matching island or archip names.
         !(Sciname %in% island_f$Sciname),
         !(Sciname %in% archip_f$Sciname)) %>%
  select(Sciname, Location = Country)

sciname_location <- bind_rows(island_f, 
                              archip_f,
                              country_f) %>%
  arrange(Sciname)

# Finally, to identify records that are populated but don't match at all, we can use anti_join.
records_no_match <- anti_join(df1, df2, by = c("COUNTRY" = "Country"))

您可以从R for Data Science 第 13 章了解有关关系数据的更多信息。

请让我知道,如果你有任何问题!


推荐阅读