r - 如何从非结构化文本中提取某些项目？

我在 R 中有一个非常非结构化的数据框 (df)，其中包括一个文本列。

df$text 的示例如下所示

John Smith 3.8 GPA johnsmith@gmail.com, https://link.com

我正在尝试从字段中提取 GPA 并保存到名为 df$GPA 的新列中，但无法使其正常工作。

我努力了：

df$gpa <- sub('[0-9].[0-9] GPA',"\\1", df$text)

但这会返回整个文本块。

我也在尝试提取网址，但我也不确定该怎么做。有人有什么建议吗？

标签： r

(?=GPA)这是在包中和str_extract包中使用正向前瞻的解决方案stringr：

df$GPA <- str_extract(df$text, "\\d+\\.\\d+\\s(?=GPA)")

具有反向引用的sub解决方案是：

df$GPA <- sub(".*(\\d+\\.\\d+).*", "\\1", df$text)

结果：

df
                                                      text GPA
1 John Smith 3.8 GPA johnsmith@gmail.com, https://link.com 3.8

数据：

df <- data.frame(text = "John Smith 3.8 GPA johnsmith@gmail.com, https://link.com")