首页 > 解决方案 > Using textstat_simil with a dictionary or globs in Quanteda

问题描述

I looked into the documentation, but as far as I understand, there is now way to use the textstat_simil function with a dictionary or globs. What would be the best way of approaching something like the below?

txt <- "It is raining. It rains a lot during the rainy season"
rain_dfm <- dfm(txt)
textstat_simil(rain_dfm, "rain", method = "cosine", margin = "features")

Do I need to use tokens_replace to change "rain*" to "rain", or is there another way to do this? In this case, stemming would do the trick, but what about cases where that is not feasible?

标签: rquanteda

解决方案


这是可能的,但首先您需要使用dfm_lookup(). (注意:还有其他方法可以做到这一点,例如标记化然后使用tokens_lookup(), or tokens_replace(),但我认为查找方法更直接,这也是您在问题中提出的问题。

另请注意,对于特征相似性,您必须拥有多个文档,这解释了为什么我在这里添加了两个。

txt <- c("It is raining. It rains a lot during the rainy season",
         "Raining today, and it rained yesterday.",
         "When it's raining it must be rainy season.")

rain_dfm <- dfm(txt)

然后使用字典将带有“rain*”的全局匹配(默认)转换为“rain”,同时保留其他功能。(在这种特殊情况下,您是正确的,dfm_wordstem()可以完成同样的事情。)

rain_dfm <- dfm_lookup(rain_dfm, 
                       dictionary(list(rain = "rain*")), 
                       exclusive = FALSE,
                       capkeys = FALSE)
rain_dfm
## Document-feature matrix of: 3 documents, 17 features (52.9% sparse).
## 3 x 17 sparse Matrix of class "dfm"
##        features
## docs    it is rain . a lot during the season today , and yesterday when it's must be
##   text1  2  1    3 1 1   1      1   1      1     0 0   0         0    0    0    0  0
##   text2  1  0    2 1 0   0      0   0      0     1 1   1         1    0    0    0  0
##   text3  1  0    2 1 0   0      0   0      1     0 0   0         0    1    1    1  1

现在,您可以计算“rain”的目标特征的余弦相似度:

textstat_simil(rain_dfm, selection = "rain", method = "cosine", margin = "features")
##                rain
## it        0.9901475
## is        0.7276069
## rain      1.0000000
## .         0.9801961
## a         0.7276069
## lot       0.7276069
## during    0.7276069
## the       0.7276069
## season    0.8574929
## today     0.4850713
## ,         0.4850713
## and       0.4850713
## yesterday 0.4850713
## when      0.4850713
## it's      0.4850713
## must      0.4850713
## be        0.4850713

推荐阅读