首页 > 解决方案 > 在另一个数据框中搜索一列名称,并结合其他列的数据获取结果

问题描述

我想创建一个基于两个不同数据框的数据框。

第一个具有期刊名称及其各自的影响因子。第二个数据框包含我要搜索的期刊的名称。

df1:

Full Journal Title  Journal Impact Factor

CA-A CANCER JOURNAL FOR CLINICIANS  223.679
Nature Reviews Materials    74.449
NEW ENGLAND JOURNAL OF MEDICINE 70.670
LANCET  59.102
NATURE REVIEWS DRUG DISCOVERY   57.618
CHEMICAL REVIEWS    54.301
Nature Energy   54.000
NATURE REVIEWS CANCER   51.848
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION    51.273
NATURE REVIEWS IMMUNOLOGY   44.019
NATURE REVIEWS GENETICS 43.704
NATURE REVIEWS MOLECULAR CELL BIOLOGY   43.351
NATURE  43.070
and continues...

str(df1)
data.frame':    12541 obs. of  2 variables:
 $ my.journal: Factor w/ 11879 levels "","2D Materials",..: 4155 1872 8866 8999 8033 8861 2143 8841 8856 5795 ...
 $ jcr       : Factor w/ 4732 levels "","0.000","0.006",..: 4731 2905 4614 4613 4337 4336 4335 4334 4333 4332 ...

df2:

     my.journal
1                           Bioscience journal
2            Summa phytopathologica (impresso)
3            Summa phytopathologica (impresso)
4            Summa phytopathologica (impresso)
5  Australian journal of crop science (online)
6            Summa phytopathologica (impresso)
7                       Summa phytopathologica
8      Pesquisa agropecuaria tropical (online)
9      Crop breeding and applied biotechnology
10             Genetics and molecular research
11                    Tropical plant pathology
12             Genetics and molecular research
13     Perspectivas online: biológicas e saúde
14            Científica (jaboticabal. online)
15     Journal of plant physiology & pathology
16                    Tropical plant pathology
17           Summa phytopathologica (impresso)

> str(df2)
'data.frame':   17 obs. of  1 variable:
 $ my.journal: Factor w/ 11 levels "Australian journal of crop science (online)",..: 2 10 10 10 1 10 9 8 4 5 ...

我想要另一个 df (df3),其中 df2 中的期刊在 df1 中搜索,如果匹配给我这样的东西(没有NA):

北美,我希望期刊影响因子与我的 df2 中的期刊相对应。

df3

journal                                     jcr   total
   <chr>                                       <fct> <int>
 1 Summa phytopathologica (impresso)           NA        5
 2 Genetics and molecular research             NA        2
 3 Tropical plant pathology                    NA        2
 4 Australian journal of crop science (online) NA        1
 5 Bioscience journal                          NA        1
 6 Científica (jaboticabal. online)            NA        1
 7 Crop breeding and applied biotechnology     NA        1
 8 Journal of plant physiology & pathology     NA        1
 9 Perspectivas online: biológicas e saúde     NA        1
10 Pesquisa agropecuaria tropical (online)     NA        1
11 Summa phytopathologica                      NA        1

我开始使用 R 几个月了,但我不知道如何开始解决这个问题。

这两个数据框位于链接df1 和 df2

标签: rdataframeleft-join

解决方案


更新

一种解决方案是使用joinwith dplyr

library(dplyr)

df1 <- read.table("df1.txt", skip = 1, header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table("df2.txt", header = TRUE, stringsAsFactors = FALSE)

df1 <- df1 %>%
  mutate(Full.Journal.Title = toupper(Full.Journal.Title))

df2 <- df2 %>%
  mutate(my.journal = toupper(my.journal))

df2 %>%
  left_join(df1, by = c("my.journal" = "Full.Journal.Title")) %>%
  group_by(my.journal, Journal.Impact.Factor) %>%
  summarize(total = n()) %>%
  arrange(desc(total))

my.journal                                  Journal.Impact.Factor total
   <chr>                                       <chr>                 <int>
 1 SUMMA PHYTOPATHOLOGICA (IMPRESSO)           NA                        5
 2 GENETICS AND MOLECULAR RESEARCH             NA                        2
 3 TROPICAL PLANT PATHOLOGY                    1.254                     2
 4 AUSTRALIAN JOURNAL OF CROP SCIENCE (ONLINE) NA                        1
 5 BIOSCIENCE JOURNAL                          0.375                     1
 6 CIENTíFICA (JABOTICABAL. ONLINE)            NA                        1
 7 CROP BREEDING AND APPLIED BIOTECHNOLOGY     1.026                     1
 8 JOURNAL OF PLANT PHYSIOLOGY & PATHOLOGY     NA                        1
 9 PERSPECTIVAS ONLINE: BIOLóGICAS E SAúDE   NA                        1
10 PESQUISA AGROPECUARIA TROPICAL (ONLINE)     NA                        1
11 SUMMA PHYTOPATHOLOGICA                      NA                        1

完成这项工作需要注意的几点:

  • 在标题中读取df1似乎占用了 2 行,因此跳过了第一行(因为这更接近您之前的示例)
  • read.table包括stringsAsFactors = FALSE如果您不希望它们作为因素
  • 有些期刊名称是大写的,有些是小写的。联接区分大小写,因此包括在内toupper以在联接之前使所有内容都大写(作为替代方案,如果您想保持原始数据帧不变,可以将其嵌入toupper内部)left_join

如果这是您的想法,请告诉我。


推荐阅读