r - 在标记化期间丢失了一份文件
问题描述
我在标记化过程中丢失了一行数据。
该数据集中共有三个文档
structure(list(ID = c("N12277Y", "N12284X", "N12291W"), corrected = c("I am living in I like living in I would not like to emigrate because you never hardly see your parents at all and brothers and sisters I would be nursing in a hospital I will drive a car and I would like to wear fashionable clothes I am married I like having parties and going out on nights If I had a girl and a boy I would call the girl and I would call the boy The little girl is two and the little boy is one month. My hobbies are making dresses knitting and Swimming I like going on holiday I like going to other countries. ",
"I do not know. ", "I emigrated* to Australia* last year. I have have a small farm* just outside Sydney. I have 250 acres* of land and on that I *****ly plow and keepanimals on. I go into Town (Sydney) about twice a week mostly to get ca*** and hay, my wife does all the Shopping. So I don't have to worry about that. We have two girls one is twelve and the other is ten. the oldest has just got to the stage of pop and Horse riding, the younger one has just finished her first play with the school and she came in yesterday saying that* the c***** teacher* said that she was the best of all we have just got over the worst summer* for years. The sun was so hot - that it dried* up all the ***nds and all the crop*. 500 sheep and 100 cows died* with lack of water and we almost dried up as well. But we seem to have* got over that and we are all back to normal again. The two Children went back to school after the summer* holidays three weeks ago. The road* is* very dust and one of s* friends was injured with a * up thought* from the dust. I miss the football a lot but U have plenty of cricket*. The school is about three miles away its only a little place but it only cost two pounds every three weeks. There isnt so much field* in England there is only a pinch* compared to here well there isnt much more to tell so goodbye. "
), father = structure(c(2L, 2L, 1L), .Label = c("1", "2"), class = "factor"),
financial = structure(c(1L, 1L, 1L), .Label = "1", class = "factor")), row.names = 598:600, class = "data.frame")
然后,我执行了以下代码:
library(dplyr)
library(tidytext)
library(SnowballC)
tokens<- data%>%
unnest_tokens(output = "word", token = "words", input = corrected)%>%
anti_join(stop_words)%>% # remove stop words
mutate(word = wordStem(word)) # stem words
essay_matrix <- tokens%>%
count(ID, word)%>%
cast_dtm(document = ID, term = word, value = n, weighting = tm::weightTfIdf)
但它显示矩阵仅包含 2 个文档。
<<DocumentTermMatrix (documents: 2, terms: 87)>>
Non-/sparse entries: 84/90
Sparsity : 52%
Maximal term length: 9
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
我找到了问题:第二行导致这个错误
(函数(cl,name,valueClass)中的错误:“数字”类对象的分配对于“dgTMatrix”类对象中的@'Dim'无效;is(value,“integer”)不是TRUE
我不确定为什么这一行有问题,因为我有超过 4000 个数据条目,但只有这一行会导致错误。有人可以帮忙吗?
先感谢您。
解决方案
就像@MrFlick 提到的,“我不知道”中的所有单词都是停用词,因此删除停用词后,该文档为空。
为了解决它,我通过调用以下代码将它们删除,并用于data_ready
以后的分析。
data_ready<- data[data$ID %in% essay_matrix[["dimnames"]][["Docs"]],]
data_empty<- data[!data$ID %in% essay_matrix[["dimnames"]][["Docs"]],]
推荐阅读
- azure - 视频索引器逻辑应用程序无效模板错误
- python - 如何在 Pandas DataFrame 中编辑 Tensorflow 数据集?
- c# - Roslyn Analyzer 未清除上次诊断
- python - 使用下拉菜单连接图表(Plotly Dash)
- kotlin - Jetpack Compose:通过按钮关闭应用程序
- android-studio - 将数据传递到 Viewpager2 内从回收器视图单击的项目上的片段
- identityserver4 - 身份服务器滑动会话在到期后不会自动注销
- python - 等到 selenium python 中的类 opacity-transition 的 div 可用
- android - Flutter Firebase:抬头通知未在后台显示
- spring-boot - 未显示具有 JaCoco 测试覆盖率的 SpringBoot2.4.4 Junit5