r - 在关键字之间提取R中的文本
问题描述
我从 PDF 中提取的文本看起来像这样:
String<-"A recorded vote was taken. In favour: Vanuatu, Venezuela (Bolivarian Republic of), Viet Nam, Yemen, Zambia, Zimbabwe Against: None Abstaining: Malawi, Palau The draft resolution"
我想提取赞成、反对和弃权的国家。
所以到目前为止我所做的是
library(stringr)
in.fav<-str_locate_all(String, "In favour:")[[1]]
against<-str_locate_all(String, "Against:")[[1]]
abstain<-str_locate_all(String, "Abstaining:")[[1]]
获取关键字、"In favour:"
和"Against:"
的位置"Abstaining:"
。
然后我提取了赞成和反对使用的国家:
Favour<str_trim(str_sub(String,start=in.fav[1,"end"]+1,end=against[1,"start"]-1))
Against<str_trim(str_sub(String,start=against[1,"end"]+1,end=abstain[1,"start"]-1))
但我很难找到弃权的国家,马拉维和帕劳,因为没有特定的关键字来标记弃权国家的终结。
我以为我可以预定义一个国家名称列表
Names<-c("Vanuatu", "Venezuela (Bolivarian Republic of)", "Viet Nam", "Yemen", "Zambia", "Zimbabwe", "Malawi", "Palau")
然后看看
str_sub(String,start=abstain[1,"start"]+1)
提取直到第一个单词不包含在 中Names
,但我没有成功。任何帮助或其他想法如何有效地获得赞成、反对和弃权的国家名单将不胜感激。更具体地说,我想要这样的输出:
Results<-list()
Results$favour<-c("Vanuatu, Venezuela (Bolivarian Republic of), Viet Nam, Yemen, Zambia, Zimbabwe")
Results$against<-c("None")
Results$abstain<-c("Malawi", "Palau")
非常感谢。马丁
解决方案
使用stringr
你可以尝试:
String<-"A recorded vote was taken. In favour: Vanuatu, Venezuela (Bolivarian Republic of), Viet Nam, Yemen, Zambia, Zimbabwe Against: None Abstaining: Malawi, Palau The draft resolution"
# get those in favor
in.fav <- str_extract(String, "(?<=In favour: ).*(?=Against)")
# get those against
against <- str_extract(String, "(?<= Against: ).*(?=Abstaining)")
# get those abstaining
# assuming the end of the sentence is always (The bill <blank>)
abstaining <- str_extract(String, "(?<= Abstaining: ).*(?=The)")
然后你可以把它变成一个列表:
str_split(in.fav, ", ") -> in.fav.list
str_split(against, ", ") -> against.list
str_split(abstaining, ", ") -> abstain.list
这将产生(对于有利的国家):
[[1]]
[1] "Vanuatu" "Venezuela (Bolivarian Republic of)" "Viet Nam" "Yemen" "Zambia"
[6] "Zimbabwe "
编辑:因为 The 并不总是在弃权类别中的国家之后的第一个词:
abstaining <- str_extract(String, "(?<=Abstaining: )")
str_split(abstaining, ", ") -> abstain.list
#Create vector of countries
country.list <- c(countries)
#Create new list for updated abstained countries
abstain.list.updated <- list()
k <- 1
for(i in 1:length(abstain.list[[1]])){
if(abstaining[[1]][i] %in% country.list){
abstaining[[1]][i] -> abstain.list.updated[[k]]
k <- k + 1
}
}
结果列表会比上面的更混乱,但你可以调整for()
循环以产生更好的输出。
推荐阅读
- sql - Oracle SQL 选择查询比较 2 列
- python-3.x - 对多页 URL 的 GET 请求和 400 错误请求
- ant - 从文件中加载内容作为 ant 中的参数
- google-cloud-spanner - Cloud Spanner - 插入或更新和增加 DML?
- c++ - 班级成员和中断
- typescript - 如何修复找不到名称“设置”
- wordpress - 自定义分类模板得到 404
- javascript - 不变违规:为了初始化 Apollo 客户端,您必须在单元测试用例的选项对象中指定“链接”和“缓存”属性
- sas - 如何在以下问题中使用循环?
- java - 我的 for 语句什么时候没有运行