首页 > 解决方案 > 知道两个dfm之间哪些单词不同的代码是什么?

问题描述

我有两个 dfm,我想知道它们之间缺少哪些单词/不同。例如,

library(quanteda)

df1 <- data.frame(Text = c("Stackoverflow is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)

corpus1 <- corpus(df1, text_field = "Text")

df2 <- data.frame(Text = c("Stackoverflow is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)
corpus2 <- corpus(df2, text_field = "Text")

dfm1 <- dfm(corpus1, remove_punct = TRUE)

dfm2 <- dfm(corpus2, remove_punct = TRUE)

我想看看 dfm2 中的哪些单词不在 dfm1 中。非常感谢你的帮助!

标签: rquanteda

解决方案


上面的答案效果很好。但是,我认为可以使用以下方法更清洁dfm_select

dfm_select(dfm1, pattern = dfm2, selection = "remove")
#> Document-feature matrix of: 1 document, 10 features (0.0% sparse).
#> 1 x 10 sparse Matrix of class "dfm"
#>        features
#> docs    so stack immensely useful thank guys sort this out for
#>   text1  1     1         1      1     1    1    1    1   1   1

推荐阅读