r - 从数据框中删除非英语观察
问题描述
我有一个带有 Twitter 数据的数据框。我已经清理了 Tweet 文本并将其添加为矢量,clean_text
但是在非英语语言中存在许多影响我的文本分析的观察结果。如何删除数据框中所有非英文的观察结果?
这是我的数据框的可重现样本,BrexitTweets
.
structure(list(`Tweet ID` = c(746280472381107968, 746280472355929984,
746280472154603008, 746280472129342976, 746280472083332992, 746280472037170944,
746280471831645952, 746280471814888960, 746280471777185024, 746280471756180992,
746280471743565056, 746280471705844992, 746280471680658944, 746280471676488960,
746280471676455936, 746280471617757056, 746280471613570944, 746280471600992000,
746280471525469952, 746280471403847040), Time = c("24/06/2016 10:55:04",
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04",
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04",
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04",
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04",
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04",
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04",
"24/06/2016 10:55:04"), `Tweet Type` = c("Tweet", "Retweet",
"Retweet", "Retweet", "Retweet", "Retweet", "Tweet", "Retweet",
"Tweet", "Retweet", "Tweet", "Tweet", "Retweet", "Tweet", "Retweet",
"Retweet", "Retweet", "Tweet", "Retweet", "Retweet"), `Retweeted By` = c(NA,
"misyed_", "Skuys", "priyadarshibbc", "Amaranta_2012", "ECCA_Nordic",
NA, "Dat_Sync", NA, "SirDeGuz", NA, NA, "RoGreca_", NA, "30SecondsToMoon",
"StuartGray", "DataDebate", NA, "alek_dev", "addi_GrBj"), `Number of Retweets` = c(0,
251, 4, 14, 2, 39, 0, 6462, 0, 1391, 0, 0, 31595, 0, 27, 15,
35, 0, 6462, 20521), `Number of Followers` = c(6079, 434717,
16036, 345319, 4566, 3223810, 109145, 560, 78, 1957, 766, 1299,
2155087, 235, 1925, 735, 8045, 159, 560, 128027), `Number Following` = c(2314,
1994, 12403, 344855, 1012, 765, 333, 236, 132, 1407, 294, 1381,
1, 338, 725, 1601, 831, 969, 236, 1606), clean_text = c("mayagoodfellow as always making sense of it all for us ive never felt less welcome in this country brexit httpstcoiai5xa9ywv",
"never underestimate power of stupid people in a democracy brexit",
"gana el brexit reino unido decide abandonar la unión europea httpstco66cwudtsxu vía elmundoes",
"uk prime minister set to resign brexit httpstco0bxbdmiswm",
"oye junckereu que dice la ciudadanía de uk que tus tratados se los pasan por sus urnas brexit httpstcoedqfkl",
"a quick guide to brexit and beyond after britain votes to quit eu httpstcos1xkzrumvg httpstcocniutojkt0",
"this selfinflicted wound will be his legacy cameron falls on sword after brexit euref httpstcoegph3qonbj httpstcohbyhxodeda",
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o",
"this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit httpstcolobtyo48ng",
"you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal httpstco",
"brexit httpstconwutx2owcs", "brexit primer anàlisi de les conseqüencies en món de lesport httpstcon3bdrqz5cf via iusport unioesports",
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q",
"es ist nicht immer klug das volk entscheiden zu lassen brexit",
"gli studenti europei verranno considerati extraeuropei e rimarranno senza assistenza sanitaria assurdo brexit",
"i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro",
"brexit einfach erklärt httpstcou7jhlhrpim", "brexit httpstcoiive3hsj26",
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o",
"absolutely brilliant poll on brexit by yougov httpstcoepevg1moaw"
)), .Names = c("Tweet ID", "Time", "Tweet Type", "Retweeted By",
"Number of Retweets", "Number of Followers", "Number Following",
"clean_text"), row.names = c(NA, 20L), class = c("tbl_df", "tbl",
"data.frame"))
解决方案
查看文本猫包
# install.packages("textcat") - install this package
require(textcat)
require(dplyr)
data$Languages <- textcat(data$clean_text)
data <- data %>% filter(Languages == "english")
推荐阅读
- python - 了解在文本分类中使用什么 keras 和 TensorFlow
- git - 为什么 GitLab 合并分支的 git 日志图倒置了?
- typescript - 并非所有代码路径都通过多次批量写入返回值 Google Firestore Cloud Function Typescript
- php - 当我在网站上使用 php 和 html 时,如何通过 sql 查询输入当前时间 + 12 小时和 30 分钟
- typescript - 打字稿:如何输入可变参数?
- html - 你能把 HTML 连接到 FIREBASE
- powershell - 无法使用 Out-File 或 Add-Content 记录函数输出
- asp.net - 如果使用了 Response.redirect,ScriptManager.RegisterStartupScript 不起作用
- php - 在 docker 下安装 composer-plugin-api 时出错
- r - 从列名列表中在现有 R 数据框中创建许多新的空列