r - 从 R 中数据框的所有行中删除 URL 或任何重复出现的短语
问题描述
我有以下名为的数据框bbchealth
:
head(bbchealth)
# A tibble: 6 x 1
Tweets
<chr>
1 Breast cancer risk test devised http://bbc.in/1CimpJF
2 GP workload harming care - BMA poll http://bbc.in/1ChTBRv
3 Short people's 'heart risk greater' http://bbc.in/1ChTANp
4 New approach against HIV 'promising' http://bbc.in/1E6jAjt
5 Coalition 'undermined NHS' - doctors http://bbc.in/1CnLwK7
6 Review of case against NHS manager http://bbc.in/1Ffj6ci
如您所见,每行包含一条推文,最后都有一个 URL。我想只删除这个 URL,而不影响数据框的其余部分。
如果我尝试使用类似的东西rm_url
,我会得到以下信息:
[1] "c(\"Breast cancer risk test devised \"GP workload harming care - BMA poll \"Short people's 'heart risk greater' \"New approach against HIV 'promising' \"Coalition 'undermined NHS' - doctors \"Review of case against NHS manager \"\\\"VIDEO: 'All day is empty, what am I going to do?' \"VIDEO: 'Overhaul needed' for end-of-life care \"Care for dying 'needs overhaul' \"VIDEO: NHS: Labour and Tory key policies \"Have GP services got worse? \"A&E waiting hits new worst level \"Parties row over GP opening hours \"Why strenuous runs may not be so bad after all \"VIDEO: Health surcharge for non-EU patients \"VIDEO: Skin cancer spike 'from 60s holidays' \"\.........
也就是说,一个向量(?)由删除 URL 的推文字符串组成。
我使用的代码是rm_url(bbchealth, replacement = "")
.
如果我使用gsub("http.*","",bbchealth)
,我会得到以下输出:
[1] "c(\"Breast cancer risk test devised "
然而,这不是我想要的。我想保留柱状结构。那是,
# A tibble: 6 x 1
Tweets
<chr>
1 Breast cancer risk test devised
2 GP workload harming care - BMA poll
3 Short people's 'heart risk greater'
4 New approach against HIV 'promising'
5 Coalition 'undermined NHS' - doctors
6 Review of case against NHS manager
我怎样才能做到这一点?
解决方案
给你,带stringi
包
dt <- data.frame(
Tweets = c(
"Breast cancer risk test devised http://bbc.in/1CimpJF ",
"GP workload harming care - BMA poll http://bbc.in/1ChTBRv",
"Short people's 'heart risk greater' http://bbc.in/1ChTANp "
)
)
library(stringi)
dt$Tweets2 <- stringi::stri_replace_all_regex(dt$Tweets, "\\shttp://.*$", "")
推荐阅读
- jenkins - Jenkins groovy 全局库声明性管道错误
- python - 沿非重复字符之间的边界拆分字符串
- node.js - 当我以角度删除对象时,令牌在nodejs中无效
- java - VideoView 错误,当我通过 url 加载视频时
- python - 百分比格式不会更改为浮动 - 熊猫
- doctrine - 在单元测试模拟中抛出 \Doctrine\DBAL\Driver\DriverException
- azure - 在远程服务器上运行集成测试
- ruby-on-rails - Rails 活动存储 Blob 和附件的软删除
- kubernetes - 如何在 cron 中运行 kubectl 命令
- python - 如果验证码显示在页面上,则打印一些内容