首页 > 解决方案 > 在关键字之间提取R中的文本

问题描述

我从 PDF 中提取的文本看起来像这样:

String<-"A recorded vote was taken. In favour: Vanuatu, Venezuela (Bolivarian Republic of), Viet Nam, Yemen, Zambia, Zimbabwe Against: None Abstaining: Malawi, Palau The draft resolution"

我想提取赞成、反对和弃权的国家。

所以到目前为止我所做的是

library(stringr)
in.fav<-str_locate_all(String, "In favour:")[[1]]
against<-str_locate_all(String, "Against:")[[1]]
abstain<-str_locate_all(String, "Abstaining:")[[1]]

获取关键字、"In favour:""Against:"的位置"Abstaining:"

然后我提取了赞成和反对使用的国家:

Favour<str_trim(str_sub(String,start=in.fav[1,"end"]+1,end=against[1,"start"]-1))

Against<str_trim(str_sub(String,start=against[1,"end"]+1,end=abstain[1,"start"]-1))

但我很难找到弃权的国家,马拉维和帕劳,因为没有特定的关键字来标记弃权国家的终结。

我以为我可以预定义一个国家名称列表

Names<-c("Vanuatu", "Venezuela (Bolivarian Republic of)", "Viet  Nam", "Yemen", "Zambia", "Zimbabwe", "Malawi", "Palau")

然后看看

str_sub(String,start=abstain[1,"start"]+1)

提取直到第一个单词不包含在 中Names,但我没有成功。任何帮助或其他想法如何有效地获得赞成、反对和弃权的国家名单将不胜感激。更具体地说,我想要这样的输出:

Results<-list()

Results$favour<-c("Vanuatu, Venezuela (Bolivarian Republic of), Viet Nam, Yemen, Zambia, Zimbabwe")

Results$against<-c("None")

Results$abstain<-c("Malawi", "Palau")

非常感谢。马丁

标签: rregexstringr

解决方案


使用stringr你可以尝试:

String<-"A recorded vote was taken. In favour: Vanuatu, Venezuela (Bolivarian Republic of), Viet  Nam, Yemen, Zambia, Zimbabwe Against: None Abstaining: Malawi, Palau The draft resolution"

# get those in favor
in.fav <- str_extract(String, "(?<=In favour: ).*(?=Against)") 
# get those against
against <- str_extract(String, "(?<= Against: ).*(?=Abstaining)")
# get those abstaining
# assuming the end of the sentence is always (The bill <blank>)
abstaining <- str_extract(String, "(?<= Abstaining: ).*(?=The)")

然后你可以把它变成一个列表:

str_split(in.fav, ", ") -> in.fav.list
str_split(against, ", ") -> against.list
str_split(abstaining, ", ") -> abstain.list

这将产生(对于有利的国家):

[[1]]
[1] "Vanuatu"                            "Venezuela (Bolivarian Republic of)" "Viet  Nam"                          "Yemen"                              "Zambia"                            
[6] "Zimbabwe "                         

编辑:因为 The 并不总是在弃权类别中的国家之后的第一个词:

abstaining <- str_extract(String, "(?<=Abstaining: )")
str_split(abstaining, ", ") -> abstain.list

#Create vector of countries
country.list <- c(countries)

#Create new list for updated abstained countries
abstain.list.updated <- list()
k <- 1

for(i in 1:length(abstain.list[[1]])){
 if(abstaining[[1]][i] %in% country.list){
  abstaining[[1]][i] -> abstain.list.updated[[k]]
  k <- k + 1
 }
}

结果列表会比上面的更混乱,但你可以调整for()循环以产生更好的输出。


推荐阅读