首页 > 解决方案 > 从 R 中数据框的所有行中删除 URL 或任何重复出现的短语

问题描述

我有以下名为的数据框bbchealth

head(bbchealth)
# A tibble: 6 x 1
  Tweets                                                    
  <chr>                                                     
1 Breast cancer risk test devised http://bbc.in/1CimpJF     
2 GP workload harming care - BMA poll http://bbc.in/1ChTBRv 
3 Short people's 'heart risk greater' http://bbc.in/1ChTANp 
4 New approach against HIV 'promising' http://bbc.in/1E6jAjt
5 Coalition 'undermined NHS' - doctors http://bbc.in/1CnLwK7
6 Review of case against NHS manager http://bbc.in/1Ffj6ci  

如您所见,每行包含一条推文,最后都有一个 URL。我想只删除这个 URL,而不影响数据框的其余部分。

如果我尝试使用类似的东西rm_url,我会得到以下信息:

[1] "c(\"Breast cancer risk test devised \"GP workload harming care - BMA poll \"Short people's 'heart risk greater' \"New approach against HIV 'promising' \"Coalition 'undermined NHS' - doctors \"Review of case against NHS manager \"\\\"VIDEO: 'All day is empty, what am I going to do?' \"VIDEO: 'Overhaul needed' for end-of-life care \"Care for dying 'needs overhaul' \"VIDEO: NHS: Labour and Tory key policies \"Have GP services got worse? \"A&amp;E waiting hits new worst level \"Parties row over GP opening hours \"Why strenuous runs may not be so bad after all \"VIDEO: Health surcharge for non-EU patients \"VIDEO: Skin cancer spike 'from 60s holidays' \"\.........

也就是说,一个向量(?)由删除 URL 的推文字符串组成。

我使用的代码是rm_url(bbchealth, replacement = "").

如果我使用gsub("http.*","",bbchealth),我会得到以下输出:

[1] "c(\"Breast cancer risk test devised "

然而,这不是我想要的。我想保留柱状结构。那是,

# A tibble: 6 x 1
  Tweets                                                    
  <chr>                                                     
1 Breast cancer risk test devised  
2 GP workload harming care - BMA poll 
3 Short people's 'heart risk greater'  
4 New approach against HIV 'promising' 
5 Coalition 'undermined NHS' - doctors 
6 Review of case against NHS manager 

我怎样才能做到这一点?

标签: r

解决方案


给你,带stringi

dt <- data.frame(
  Tweets = c(
    "Breast cancer risk test devised http://bbc.in/1CimpJF ",
    "GP workload harming care - BMA poll http://bbc.in/1ChTBRv",
    "Short people's 'heart risk greater' http://bbc.in/1ChTANp "
  )
)

library(stringi)

dt$Tweets2 <- stringi::stri_replace_all_regex(dt$Tweets, "\\shttp://.*$", "")

推荐阅读