r - 使用 grep 函数进行文本挖掘
问题描述
我在对数据进行评分时遇到问题。下面是数据集。text是我想要进行文本挖掘和情感分析的推文
**text** **call bills location**
-the bill was not generated 0 bill 0
-tried to raise the complaint 0 0 0
-the location update failed 0 0 location
-the call drop has increased in my location call 0 location
-nobody in the location received bill,so call ASAP call bill location
这是虚拟数据,其中 Text 是我尝试进行文本挖掘的列,我在 R 中使用 grep 函数来创建列(例如账单、电话、位置),如果账单在任何行中,在列名写账单,所有其他类别也是如此。
vdftweet$app = ifelse(grepl('app',tolower(vdftweet$text)),'app',0)
table(vdftweet$app)
现在,我无法理解的问题是
我想创建一个新列“category_name”,在该列下,每一行都应该给出它们所属类别的名称。如果每条推文的类别超过 3 个,则将其标记为“其他”。否则给出类别的名称。
解决方案
有几种方法可以使用这个tidyverse
包来做到这一点。在第一种方法中,mutate
用于将类别名称作为列添加到文本 data.frame 中,类似于您所拥有的。gather
然后用于将其转换为键值格式,其中类别是category_name
列中的值。
另一种方法是直接进入键值格式,其中类别是category_name
列中的值。如果它们属于多个类别,则会重复行。如果您不需要将类别作为列名的第一个表单,则替代方法可以更灵活地添加新类别并且需要较少的处理。
在这两种方法中,都str_match
包含将类别与文本匹配的正则表达式。这里的模式很简单,但如果需要,可以使用更复杂的模式。
代码如下:
library(tidyverse)
#
# read dummy data into data frame
#
dummy_dat <- read.table(header = TRUE,stringsAsFactors = FALSE,
strip.white=TRUE, sep="\n",
text= "text
-the bill was not generated
-tried to raise the complaint
-the location update failed
-the call drop has increased in my location
-nobody in the location received bill,so call ASAP")
#
# form data frame with categories as columns
#
dummy_cats <- dummy_dat %>% mutate(text = tolower(text),
bill = str_match(.$text, pattern="bill"),
call = str_match(.$text, pattern="call"),
location = str_match(.$text, pattern="location"),
other = ifelse(is.na(bill) & is.na(call) &
is.na(location), "other",NA))
#
# convert categories as columns to key-value format
# withcategories as values in category_name column
#
dummy_cat_name <- dummy_cats %>%
gather(key = type, value=category_name, -text,na.rm = TRUE) %>%
select(-type)
#
#---------------------------------------------------------------------------
#
# ALTERNATIVE: go directly from text data to key-value format with categories
# as values under category_name
# Rows are repeated if they fall into multiple categories
# Rows with no categories are put in category other
#
dummy_dat <- dummy_dat %>% mutate(text=tolower(text))
dummy_cat_name1 <- data.frame(text = NULL, category_name =NULL)
for( cat in c("bill", "call", "location")) {
temp <- dummy_dat %>% mutate(category_name = str_match(.$text, pattern=cat)) %>% na.omit()
dummy_cat_name1 <- dummy_cat_name1 %>% bind_rows(temp)
}
dummy_cat_name1 <- left_join(dummy_dat, dummy_cat_name1, by = "text") %>%
mutate(category_name = ifelse(is.na(category_name), "other", category_name))
结果是
dummy_cat_name1
text category_name
-the bill was not generated bill
-tried to raise the complaint other
-the location update failed location
-the call drop has increased in my location call
-the call drop has increased in my location location
-nobody in the location received bill,so call asap bill
-nobody in the location received bill,so call asap call
-nobody in the location received bill,so call asap location
推荐阅读
- typo3 - 常量编辑器还原按钮删除整个条目?
- git - 获取每个请求文件的最新 GIT 提交信息
- javascript - 试图理解异步/等待
- c++ - 单链表中的函数问题
- python - 使用 Boto3 将 S3 对象转换为 pyspark 数据帧:TypeError:路径只能是字符串、列表或 RDD
- azure - 看不到容器
- java - 允许在 android recyclerview 中一次从不同的 textview 项目中选择文本
- javascript - 模态reactjs关闭时间
- flutter - 如何在父类中动态生成和存储自定义有状态小部件的状态?
- html - 向左浮动后调整网页大小时的图像移位