首页 > 解决方案 > 使用环视刮取数据的正则表达式不起作用

问题描述

我试图弄清楚为什么我的一个正则表达式命令有效而另一个无效。这是它将从中提取的两个字符串的示例。刮擦导致的新行垃圾具有一致性,因此我尽我所能利用它来发挥我的优势:

"\n\tMenghe a'Nyam\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Forward\n\n\n\n  6-5, 215lb (196cm, 
97kg) \n  \n\n  \n\n  \n  \n  \n\n  School: Canisius\n\n\n\n\n\n  More player info\n\n\n\n\n\n"

"\n\tJordan Aaberg\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Guard\n\n\n\n  6-9, 225lb (206cm, 
102kg) \n  \n\n  Hometown: Rothsay, MN\n\n\n\n  \n\n  High School: Rothsay\n\n\n\n  \n  \n  \n\n  
School: North Dakota State\n\n\n\n\n\n  More player info\n\n\n\n\n\n"

我的目标是从中提取所需的数据,例如位置(分别为前锋、后卫),最重要的是身高(分别为 6-5、6-9)。我成功地通过以下方式拉动了这个职位:

test <- df %>%
  mutate(position = str_extract(player, "(?<=Position:\n  \n  ).*?(?=\n\n\n\n  \\d-\\d)")) 

但是,当我按照类似的环视方法为高度添加另一个 col 时,它返回 NA:

test <- df %>%
  mutate(position = str_extract(player, "(?<=Position:\n  \n  ).*?(?=\n\n\n\n  \\d-\\d)")) %>%
  mutate(height = str_extract(player, "(?<=\\w+\n\n\n\n  ).*?(?=, \\d{3}lb)"))

如果有帮助,这是我的 df 前 3 行的上述调用的结果示例:

structure(list(player = c("\n\tMenghe a'Nyam\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Forward\n\n\n\n  6-5, 215lb (196cm, 97kg) \n  \n\n  \n\n  \n  \n  \n\n  School: Canisius\n\n\n\n\n\n  More player info\n\n\n\n\n\n"  , 
"\n\tJordan Aaberg\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Forward\n\n\n\n  6-9, 225lb (206cm, 102kg) \n  \n\n  Hometown: Rothsay, MN\n\n\n\n  \n\n  High School: Rothsay\n\n\n\n  \n  \n  \n\n  School: North Dakota State\n\n\n\n\n\n  More player info\n\n\n\n\n\n"  , 
"\n\tKarl Aaker\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Forward\n\n\n\n  6-5, 210lb (196cm, 95kg) \n  \n\n  Hometown: Reno, NV\n\n\n\n  \n\n  \n  \n  \n\n  School: Portland\n\n\n\n\n\n  More player info\n\n\n\n\n\n"  
), position = c("Forward", "Forward", "Forward"), height = c(NA_character_, 
NA_character_, NA_character_)), row.names = c(NA, 3L), class = "data.frame")    

标签: rregex

解决方案


+您可以在之后删除,\w因为 ICU 正则表达式引擎不支持后向内的无限长度字符串匹配模式,并用于\s匹配任何空格:

test <- df %>%
  mutate(position = str_extract(player, "(?<=Position:\n  \n  ).*?(?=\n\n\n\n  \\d-\\d)")) %>%
  mutate(height = str_extract(player, "(?<=\\w\n{4}\\s{2}).*?(?=,\\s+\\d{3}lb)"))

查看正则表达式演示

细节

  • (?<=\w\n{4}\s{2})- 在比赛之前,应该有一个单词字符,然后是 4 个换行符,然后是任何 2 个空白字符
  • .*?- 除换行符之外的任何 0 个或更多字符尽可能少
  • (?=,\s+\d{3}lb)- 在匹配之后,应该有一个逗号、一个或多个空格字符、3 位数字和lb子字符串。

推荐阅读