首页 > 解决方案 > R中的正则表达式帮助 - 在每次出现数字(点)数字后拆分字符串

问题描述

我有以下示例数据框:

dat <- data.frame(date= c("Sep2020", "Oct2020", "Nov2020", "Dec2020"), 
                  txt= c("1.1 What is the Constitution?     1.2 The original charter, which replaced the Articles of Confederation      1.3 hat all States would be equal.  ", 
                         "4.4 What is the Bill of Rights?      4.5    The 9th and 10th amendments are general ",
                         "5.1  in criminal prosecution to a speedy and public    5.2  War, three amendments were ratified (1865  5.3   13. The most recent amendment, the 27th, was",
                         "6.2  the case of the proposed equal rights amendment, the Congress exten      6.3     but the proposed Amendment was never ratifie          6.4  tification deadline. The 38th State, Michig"))

我想拆分数据框,以便在每个数字(点)数字之后创建一个新行。最终的数据框如下所示:

dat2 <-data.frame(date= c("Sep2020", "Sep2020", "Sep2020", "Oct2020", "Oct2020", "Nov2020", "Nov2020", "Nov2020", "Dec2020", "Dec2020", "Dec2020"), 
                txt= c("1.1 What is the Constitution?","1.2 The original charter, which replaced the Articles of Confederation","1.3 hat all States would be equal.  ", 
                       "4.4 What is the Bill of Rights?",      "4.5    The 9th and 10th amendments are general ",
                       "5.1  in criminal prosecution to a speedy and public",    "5.2  War, three amendments were ratified (1865",  "5.3   13. The most recent amendment, the 27th, was",
                       "6.2  the case of the proposed equal rights amendment, the Congress exten", "6.3     but the proposed Amendment was never ratifie", "6.4  tification deadline. The 38th State, Michig"))

这是我到目前为止所拥有的:

dat<-dat %>% 
  mutate(parsed= str_extract_all(txt, "(\\d{1}\\.\\d{1,2})")) %>% 
  unnest(parsed) 

我能得到数字,但不能得到它们之间的文本。例如,我是正则表达式的初学者,不知道如何说我想要 1.1 和 1.2 之间的所有内容。

谢谢!

标签: rregex

解决方案


我们可能会使用separate_rows

library(tidyr)
library(dplyr)
dat %>% 
    separate_rows(txt, sep = "\\s+(?=\\d+\\.\\d+)")

推荐阅读