首页 > 解决方案 > 在 quanteda 中替换几个 ngram

问题描述

在我的新闻文章文本中,我想将几​​个不同的表示同一政党的 ngram 转换为首字母缩略词。我想这样做是因为我想避免任何情感词典将党名(自由党)中的词与不同上下文中的相同词(自由主义帮助)混淆。

我可以在下面执行此操作,str_replace_all并且我知道token_compound()quanteda 中的功能,但它似乎并不能完全满足我的需要。

library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')

我应该以某种方式在将文本变成语料库之前对其进行预处理吗?或者有没有办法把它变成quanteda.

这是一些扩展的示例代码,可以更好地说明问题:

`text<-c('a text about some political parties called the new democratic party 
the new democrats and the liberal party and the liberals. I would like the 
word democratic to be counted in the dfm but not the words new democratic. 
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))

dfm(text, dictionary=partydict)`

这个例子democratic在这两个意义上都很重要new democraticdemocratic但我会分开计算。

标签: rtext-miningquanteda

解决方案


tokens_lookup()在定义了将规范党标签定义为键的字典并将党名称的所有 ngram 变体列为值之后,您需要该函数。通过设置exclusive = FALSE它将保留不匹配的令牌,实际上是用规范的政党名称替代所有变体。

在下面的示例中,我对您的输入文本进行了一些修改,以说明政党名称的组合方式与使用“自由党”而非“自由党”的短语不同。

library("quanteda")

text<-c('a text about some political parties called the new democratic party 
         which is conservative the new democrats and the liberal party and the 
         liberals which are liberal helping poor people')
toks <- tokens(text)

partydict <- dictionary(list(
    olp = c("liberal party", "the liberals"),
    ndp = c("new democrats", "new democratic party")
))

(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
##  [1] "a"            "text"         "about"        "some"         "political"    "parties"     
##  [7] "called"       "the"          "NDP"          "which"        "is"           "conservative"
## [13] "the"          "NDP"          "and"          "the"          "OLP"          "and"         
## [19] "OLP"          "which"        "are"          "liberal"      "helping"      "poor"        
## [25] "people"   

所以这已经用派对钥匙取代了派对名称的差异。现在在这些新标记上从这些新标记构建 dfm,保留可能与情绪相关的(例如)“自由党”的使用,但已经将“自由党”合并并用“OLP”替换。现在,将字典应用于 dfm 将适用于您在“自由帮助”中的“自由”示例,而不会将其与政党名称中的“自由”混淆。

sentdict <- dictionary(list(
    left = c("liberal", "left"),
    right = c("conservative", "")
))

dfm(toks2) %>%
    dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
##        features
## docs    olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
##  text1   2   2 1    1     1    1         1       1      1   3     2  1     1   2    1   1       1
##        features
## docs    poor people
##  text1    1      1

两个附加说明:

  1. 如果您不希望替换标记中的键大写,请设置capkeys = FALSE.

  2. 您可以使用参数设置不同的匹配类型valuetype,包括valuetype = regex. (请注意,您在示例中的正则表达式可能格式不正确,因为您|在 ndp 示例中的运算符范围将获得“新民主党”或“新”,然后是“民主党”。但tokens_lookup()你不需要担心这个!)


推荐阅读