首页 > 解决方案 > 有没有一种通用的方法来删除在 R 中以数字开头并以大写字母结尾的子字符串

问题描述

很难描述,但基本上,我试图找到一种通用的方法来做到这一点:

    [1]" On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…" 
    [2]" Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online"

进入这个:

    [1] "95 E Kennedy Blvd"
    [2] "231 3rd St"

使用 R。我知道它涉及正则表达式,但我并不像我想的那样流利。

谢谢!

标签: rregexgsub

解决方案


您的预期输出没有非常可靠的逻辑,但查看您的预期数据,您可以使用此正则表达式实现您正在尝试的内容,

^.*?(\d{2,}.*?[a-z])[A-Z].*

并将其替换\1为 group1 捕获您想要的文本。

正则表达式演示

R 代码演示

sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…")
sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")

打印如您所愿,

[1] "95 E Kennedy Blvd"
[1] "231 3rd St"

编辑: 好的,\d{2,}可能有点依赖数据,所以在这里我们可以使用另一种逻辑,我将只用一个或多个数字开始捕获,\d+但后面跟着一个或多个空格,而且由于匹配在Lakewood因此使用之前停止在正则表达式中也有一个积极的展望(?=Lakewood),并且可以使用更新和更好的正则表达式,

^.*?(\d+\s+.*?)(?=Lakewood).*

正则表达式演示 2

现在,如果您愿意,您甚至可以使用以下代码行str_match使用正则表达式提取文本,\d+\s+.*?(?=Lakewood)

library(stringr)

str_match("On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…", "\\d+\\s+.*?(?=Lakewood)")
str_match("Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online", "\\d+\\s+.*?(?=Lakewood)")

印刷,

     [,1]               
[1,] "95 E Kennedy Blvd"
     [,1]        
[1,] "231 3rd St"

推荐阅读