首页 > 解决方案 > 余弦相似度:我想了解我得到的值



Example Data:
Document_A <- "Ich gehe heute einkaufen" ##(English: I am going shopping today) 
Document_B <- "Das Wetter ist heute gut" ##(English: The weather today is good)


tkn <- function(SelectedGroup){ 
  ## Lemmatization
  abc <- lemmatize_strings(SelectedGroup, dictionary = lemma_data)
  ## Delete Punctuation etc.
  abc <- gsub("[[:punct:]]", "", abc[1])
  ## Transform everything to lower case letters
  abc <- tolower(abc)
  ## Elimination of stopwords
  abc <- tm::removeWords(tm::scan_tokenizer(abc), mystopwords)

treatment_tkn <- tkn(Document_A)
control_tkn <- tkn(Document_B)


word_embedding <- "C:/Users/Aaron/Desktop/Testordner/embeddings_german.txt" 
 ## This file has been downloaded on http://vectors.nlpl.eu/repository/#. (Language german, ID=45)
 ## It has the following properties: Vector size: 100, Window: 10, Corpus: German CoNLL17 corpus,Vocabulary size: 4946997,Algorithm: Word2Vec Continuous Skipgram, Lemmatization: False

 ## apply the word embeddings vectors on the treatment and control variable
token_list <- list(treatment_tkn,control_tkn)
init <- Doc2Vec$new(token_list = token_list, word_vector_FILE = word_embedding)
out = init$doc2vec_methods(method = "sum_sqrt")
 ## Calculate the cosine similarity of the two groups (measures the angle between the two vectors)
Cosine_similarity <- cosine(out[1,], out[2,])

然而,即使对于这个例子(最后只比较了 einkaufen 和 wetter 这两个词),我得到了 0.51 的余弦相似度,即使这些词根本不相关。

同样,它发生在所有类型的文档中。比较德语和英语文档时,该值在 0.6 左右。真的不高吗?如果两个文档都是德语并且确实共享一些内容,则该值大多超过 0.9。


谢谢 :)

标签: rtext-miningcosine-similaritydoc2vec

