首页 > 解决方案 > 如何从 csv 文件的列中提取某些字符串?

问题描述

给我的 csv 文件有一列有点乱。我希望能够收集所有部门/机构(例如 Y1 系、X2 学院和 X3 学院),这样我就可以计算我的档案中有多少个独特的机构。

来自 csv 文件的示例数据如下所示:

Name           Affiliation
Anna           Faculty of X1, Department of Y1, Z1 University, City A, Country 1
Isabela        Institute of W1, College of X2, Department of Y2, University of Z2, Country 1
Wally          Institute of W2, College of X3, Department of Y2, University of Z2, Country 1

这持续了数千行。

我的目标是拥有类似于数据框的东西,其中包含具有所述独特机构的列。有没有办法使用 Rstudio(最好)或 python 来做到这一点?

标签: pythonr

解决方案


这是使用stringrR 中的包的选项。

library(stringr)
TargetPrefixes <- c("Faculty", "Department", "Univrsity", "College", "Institute")
split <- str_split(data$Affiliation,",")
split <- unlist(split)
trim <- str_trim(split,side = "both") 
matches <- sapply(trim,function(x){Reduce(`|`,(str_detect(x,TargetPrefixes)))})
result <- unique(trim[matches])
result
[1] "Faculty of X1"    "Department of Y1" "Institute of W1"  "College of X2"    "Department of Y2" "Institute of W2"  "College of X3" 

数据

data <- structure(list(Name = c("Anna", "Isabela", "Wally"), Affiliation = c("Faculty of X1, Department of Y1, Z1 University, City A, Country 1", 
"Institute of W1, College of X2, Department of Y2, University of Z2, Country 1", 
"Institute of W2, College of X3, Department of Y2, University of Z2, Country 1"
)), class = "data.frame", row.names = c(NA, -3L))

推荐阅读