python - 如何从 csv 文件的列中提取某些字符串?
问题描述
给我的 csv 文件有一列有点乱。我希望能够收集所有部门/机构(例如 Y1 系、X2 学院和 X3 学院),这样我就可以计算我的档案中有多少个独特的机构。
来自 csv 文件的示例数据如下所示:
Name Affiliation
Anna Faculty of X1, Department of Y1, Z1 University, City A, Country 1
Isabela Institute of W1, College of X2, Department of Y2, University of Z2, Country 1
Wally Institute of W2, College of X3, Department of Y2, University of Z2, Country 1
这持续了数千行。
我的目标是拥有类似于数据框的东西,其中包含具有所述独特机构的列。有没有办法使用 Rstudio(最好)或 python 来做到这一点?
解决方案
这是使用stringr
R 中的包的选项。
library(stringr)
TargetPrefixes <- c("Faculty", "Department", "Univrsity", "College", "Institute")
split <- str_split(data$Affiliation,",")
split <- unlist(split)
trim <- str_trim(split,side = "both")
matches <- sapply(trim,function(x){Reduce(`|`,(str_detect(x,TargetPrefixes)))})
result <- unique(trim[matches])
result
[1] "Faculty of X1" "Department of Y1" "Institute of W1" "College of X2" "Department of Y2" "Institute of W2" "College of X3"
数据
data <- structure(list(Name = c("Anna", "Isabela", "Wally"), Affiliation = c("Faculty of X1, Department of Y1, Z1 University, City A, Country 1",
"Institute of W1, College of X2, Department of Y2, University of Z2, Country 1",
"Institute of W2, College of X3, Department of Y2, University of Z2, Country 1"
)), class = "data.frame", row.names = c(NA, -3L))
推荐阅读
- c++ - C ++ - 如何找到n叉树的给定节点的深度
- php - 调用未定义的方法 Illuminate\Database\Eloquent\Relations\HasMany::associate()
- spring - Spring Data JDBC Firebird方言无法识别
- r - R:如何将变量名转换为字符串
- google-bigquery - Google bigquery 数据传输服务标头问题
- python - 无法获取推文 tweepy
- python - Pandas 数据框结合了唯一的行值
- go - 将 func 类型转换为其他 func 类型
- django - 如何从 django 模板内的多对多字段关系中选择特定对象
- matplotlib - 如何离散化内置颜色图?